Cluster computing

Wednesday, December 18, 2013

We continue our discussion on data warehouse.
Extraction of data from one place to another seems deceptively simple but usually turns out to be much more complex and larger than initially assumed.
We see some of these transformations now:
1) The extraction of the data from the operational environment to the legacy environment requires a change in technology including operating systems, hardware and data.
2) The selection of data from the operational environment may require reading several tables, following relational constraints and to be done during the online operational window.
3) Input keys from operational data have to be restructured and converted. before they are written to a data warehouse.
4) Non-key data is reformatted - for example date-time formats may change to be more consistent across the warehouse.
5) Data is also cleansed before passing to the warehouse with cross-record verification and domain checks.
6) Multiple sources of data exist and must pass into the data warehouse. This means that we tap into each data source with a defined logic.
7) When there are multiple input files, key resolution can be done before the files can be merged.
8) If there are multiple files, the sequence of files may not be compatible and this may involve record migrations as well.
9) The transformation logic that creates the data warehouse may produce different summarization levels for different data source and hence different outputs that need to be reconciled.
10) There has to be default values when there is no source of data
11) The efficiency of the selection of data is often a performance consideration because only a few records from the original source may need to be pulled.
12) data also needs to be summarized and combined into the profile record.

Cluster computing

Wednesday, December 18, 2013

No comments:

Post a Comment