Cluster computing

Thursday, December 19, 2013

We continue our discussion on data transformation from operational to warehouse:
We mentioned summarizing data by properly sequencing the operational data as one of the functionalities required. Others are:
Input records can be of different proprietary or non-standard formats, syntax and even semantics.
Conversion for these records need to be authored and they can be complex.
Semantic relationships in the operational data is harder to translate and often come at the price of reverse engineering.
Data format conversion must be done to keep the data consistent in the warehouse.
Massive volumes of input must be accounted for including parallel loads or reads.
The warehouse must conform to a corporate data model. Since business changes more frequently than when the application was written, these mismatches need to be accounted for in designing and building a warehouse. There is order and discipline to the design and structure of the data warehouse.
The data warehouse reflects the historical need for information and during transformation, we add this element of time where missing.
The warehouse must be built for the informational need of the organization and not the clerical needs.
Since hardware, operating systems and environment changes between operational systems to datawarehouse systems, the newly generated output files from transformation may need some local conversion.
ETL software is often used for these extract-tranform-load operations.
An alternative to ETL software is ELT software to do extract-load-transform operations where the transformation can be skipped, however this does not benefit the warehouse.
The transformations or population of the data warehouse is triggered with an 'event-snapshot' interaction i.e. as and when snapshots become available, they are moved into the warehouse.
Events can be activity generated or time generated.
Events that are triggered by time are usually cyclic.
We look at the components of the snapshot next. As mentioned earlier, a snapshot contains a unit of time, a key, non-key primary data and secondary data. The first usually marks the moment when the snapshot is taken. The key identifies the snapshot. The third is the data that relates to the key and the last is the data
that is circumstantial to the actual data.
We look at the components of the profile records as well.
Data in a warehouse is not all stable and some are susceptible to changes due to various factors. For these some kind of aggregate records is preferable. Such a record is called a profile record. For example, the monthly call records of a subscriber is aggregated to provide a single composite record. Notice how this is similar to HDFS in which the column and the column counts don't matter. Records can have varying number of columns because they are now structured as key value pairs in a column family. The records allow access to individual values based on key lookup for that record in that collection. A profile record in the data warehouse is a single composite record of all the changes captured until the event that triggers the profile record. Note that the comparison is being drawn in terms of the layout and not the content.
Since the profile record contains grouping of many detail records, the aggregation can take many different forms:
Values from the operational data can be summarized
Frequencies can be maintained, units of the operational data can be tallied, their min, max and avg() can be found and they can be put in ranges with boundaries.
A desirable benefit of using a single profile record is that it saves the end user from having to calculate it.

Cluster computing

Thursday, December 19, 2013

No comments:

Post a Comment