Tuesday, May 13, 2008

Data conversion

Data conversion

Data conversion is the conversion of one form of computer data to another--the changing of bits from being in one format to a different one, usually for the purpose of application interoperability or of capability of using new features. At the simplest level, data conversion can be exemplified by conversion of a text file from one character encoding to another.

Data Transformations:
In metadata, a data transformation converts data from a source data format into destination data.

Data transformation can be divided into two steps:
data mapping maps data elements from the source to the destination and captures any transformation that must occur
code generation that creates the actual transformation program

Data element to data element mapping is frequently complicated by complex transformations that require one-to-many and many-to-one transformation rules.

Data mapping is the process of creating data element mappings between two distinct data models. Data mapping is used as a first step for a wide variety of data integration tasks including:
Data transformation or data mediation between a data source and a destination.

Identification of data relationships as part of data lineage analysis
Discovery of hidden sensitive data such as the last four digits social security number hidden in another user id as part of a data masking or de-identification project
Consolidation of multiple databases into a single data base and identifying redundant columns of data for consolidation or elimination.

Metadata:
The metadata is the “data about data”. as information that describes, or supplements, the central data.
Example: "12345" is data, and with no additional context is meaningless. When "12345" is given a meaningful name (metadata) of "ZIP code", one can understand (at least in the United States, and further placing "ZIP code" within the context of a postal address) that "12345" refers to the General Electric plant in Schenectady, New York.

Thursday, May 1, 2008

ETL In Data Warehousing

Extract,Transform, and Load (ETL)

Extract, transform, and load (ETL) is a process in data warehousing that involves
· extracting data from outside sources,
· transforming it to fit business needs, and ultimately
· loading it into the data warehouse.
ETL is important, as it is the way data actually gets loaded into the warehouse. This article assumes that data is always loaded into a data warehouse, whereas the term ETL can in fact refer to a process that loads any database.
ETL is important, as it is the way data actually gets loaded into the warehouse

Extract
The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format.

Transform
The transform phase applies a series of rules or functions to the extracted data to derive the data to be loaded. Some data sources will require very little manipulation of data. In other cases, one or more of the following transformations types may be required:
· Selecting only certain columns to load (or selecting null columns not to load)
· Translating coded values (e.g., if the source system stores M for male and F for female, but the warehouse stores 1 for male and 2 for female)
· Encoding free-form values (e.g., mapping "Male" and "M" and "Mr" onto 1)
· Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
· Joining together data from multiple sources (e.g., lookup, merge, etc.)
· Summarizing multiple rows of data (e.g., total sales for each region)
· Generating surrogate key values
· Transposing or pivoting (turning multiple columns into multiple rows or vice versa)

Load
The load phase loads the data into the data warehouse. Depending on the requirements of the organization, this process ranges widely. Some data warehouses merely overwrite old information with new data. More complex systems can maintain a history and audit trail of all changes to the data
Data Integration:
Data integration is the problem of combining data residing at different sources and providing the user with a unified view of these data. This important problem emerges in a variety of situation both commercial and scientific.
Data Synchronization:
Data synchronization technologies are designed to synchronize a single set of data between two or more devices, automatically copying changes back and forth. For example, a user's contact list on one mobile device can be synchronized with other mobile devices or computers. Data synchronization can be local synchronization where the device and computer are side-by-side and data is transferred or remote synchronization when a user is mobile and the data is synchronized over a mobile network.. The ability for data in different databases to be kept up-to-date so that each repository contains the same information.

Data Analysis:
Data analysis is the act of transforming data with the aim of extracting useful information and facilitating conclusions. Depending on the type of data and the question, this might include application of statistical methods, curve fitting, selecting or discarding certain subsets based on specific criteria, or other techniques. In respect to Data mining, data analysis is usually more narrowly intended as not aiming to the discovery of unforeseen patterns hidden in the data, but to the verification or disproval of an existing

Data Quality:
Data Quality refers to the quality of data. Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J.M. Juran). Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer
1. Data profiling - initially assessing the data to understand its quality challenges
2. Data standardization - a business rules engine that ensures that data conforms to quality rules
3. Geo-coding - for name and address data. Corrects data to US and Worldwide postal standards
4. Matching or Linking - a way to compare data so that similar, but slightly different records can be aligned. Matching may use "fuzzy logic" to find duplicates in the data. It often recognizes that 'Bob' and 'Robert' may be the same individual. It might be able to manage 'house holding', or finding links between husband and wife at the same address, for example. Finally, it often can build a 'best of breed' record, taking the best components from multiple data sources and building a single super-record.
5. Monitoring - keeping track of data quality over time and reporting variations in the quality of data.
6. Batch and Real time - Once the data is initially cleansed (batch), companies often want to build the processes into enterprise applications to keep it clean.

Data Profiling:
Data profiling is the first phase of any data migration or data integration project. Broadly speaking, data profiling helps you in two different ways:
1. Identify potential problems in the current data. This helps in avoiding late project surprises.
2. Give better understanding of your current data. This helps in, for example, planning your final data schema.