Cluster computing

Friday, December 20, 2013

We continue our discussion on the data warehouse with the indirect use of the data warehouse data. We see an emerging pattern with the examples mentioned in the previous post. The warehouse is periodically analyzed by a program examines relevant characteristics and criteria. This analysis creates a small file in the online environment that contains succinct information about the business of the enterprise. The small online file is used quickly and efficiently as is customary in the operational environment.
We look at these in more detail now. First, the program that runs on the warehouse is typically a machine learning program and has free rein to run anywhere in the warehouse. It runs in the background and is in harmony with the rate at which the data warehouse changes. The refreshment occurs periodically and relatively infrequently. And it replaces existing data as opposed to updating it.
The online pre-analyzed data file contains only a small amount of data per unit of data and may contain only what the online user/clerk needs. Its ownership and access are entirely within the operational environment.
In the data warehouse, a normalized approach produces the optimal design with flexibility. It fits well with granular data, it is not optimized for any given set of processing requirements and fits very nicely with the data model. Usually deviations to the normalized design comes from the common requirements of several departments for eg. storing data for all months. Data warehousing typically involves snowflake like design which includes the star joins, fact tables and dimensions.
Unlike data warehouses, data marts are very much shaped by processing requirements. However, Data warehouses have a much broader audience. If we look at the entity relationship diagram for a data mart, all the entities will look equal. However in reality some entities may have volume of data associate than others. The design that best utilizes this characteristic is a star join. As a simple example of a star join, the entity with the most data can have foreign keys to the identifiers of all associated entities.
One of the issues with the data warehouses and data marts is how data gets from the data warehouse to the data marts. Data in the data warehouse is very granular. Data in the data mart is compact and summarized. Periodically data must be moved from the data warehouse to the data mart.
As mentioned earlier, the data mart are designed based on the requirements of the department. Different departments will have different data marts. Star joins involving fact tables and dimensions are suitable for a data mart. But not for a warehouse simply because the data warehouse does not cater to one department. Thus DataMart data structures in general are not reusable or flexible for reconciliation across departments but the granular data in the warehouse is all of those. When data passes from a data warehouse to the data marts, it moves from a normalized world to a multidimensional world.
Unlike the processing requirements for the data mart, the data warehouse is designed from the enterprise requirements. The enterprise requirements are gathered from what is known as the Zachman framework.
This is a convenient device that lays out a matrix to include all perspectives. The output is perspective and context. As an example, consider a matrix with scope as rows and data as columns. Data could correspond to function, network, people, time and model. The scope could correspond to Enterprise model, system model, technical model, components and functional system. By exhaustively covering all perspectives, this approach attempts to let the enterprise requirements be articulated from the blueprint. These requirements subsequently lead to the enterprise data model which helps in building the data warehouse. Therefore a lot of order and discipline is required in using this framework.

Cluster computing

Friday, December 20, 2013

No comments:

Post a Comment