Cluster computing

Tuesday, December 31, 2013

We now look at advanced topics of the data warehouse. Some of these don't fit well with others so they are covered here. We start with the requirements. Where do these come from ? They come from the common data model.The end user requirements influence the model. Their role is indirect. The model has different levels - high, mid and low-levels. The high levels consist of the subject areas. The mid level identifies the keys, attributes and relationships and other aspects of the details of the data warehouse. The low level is where the physical database design is done including partitioning, foreign key relationships and indexes. When the relational database is ready, the end user requirements are used to show how the data is to be reshaped.
Resource contentions in the data warehouse often occur due to statistical processing. If they occur periodically, the statistical processing can be interspersed. If there is no room for the statistical processing, it is better to build out a special warehouse called exploration data warehouse for the statistical processing. The purpose of an exploration warehouse is to provide a foundation for heavy statistical analysis.
Another reason for separation of the exploration warehouse from the main is that the statistical processing requires very different environment from the regular analytic processing. Another reason is database design. For example , a convenience field is often added for the purpose of streamlining the statistical analysis and to recast the data warehouse. The exploration warehouse is seldom a direct copy of the data found in the data warehouse.
There is a difference between the exploration warehouse and the data mining warehouse. They are similar but the difference is that the exploration warehouse is used to create assertions, hypothesis and observations while the data mining warehouse is used to prove the strength of the truth of the hypothesis. The exploration warehouse must provide a breadth of information. The data mining warehouse needs to have depth.
The exploration warehouse is not refreshed with the most current detailed data unlike the data warehouse. This is because the processing on the exploration warehouse requires the data to be frozen.
One other factor in the differences between the main data warehouse and the exploration warehouse is that the external data fits easily into the exploration warehouse. External results and internal results often provide interesting comparisons. On the other hand, the external data is difficult to integrate with the fine detailed structure of the warehouse.
Another topic is whether data marts or data warehouses should share the same processor. There are several reasons why : First, the processors are expensive, separating out the data marts onto separate machines brings the costs down. Second the workloads are different so they are more manageable. Third, different departments can take ownership of the data marts.
There is a life cycle of data as it enters the corporation and then is used. It is first captured, then edited, then loaded into the application, then online data is integrated, and passed into the warehouse, where the frequency of usage drops, then sent to near line storage and subsequently retired to archival storage.
We also mention testing and the data warehouse. A test environment is a good staging ground. This means that the data changes can be made without affecting the public before being made visible. The size of the data in the warehouse could be just enough to test the changes.
Tracing the flow of data through the warehouse is another practice. This means there the data is traced as it flows through an initial capture, an ETL enterprise integration, a Datamart transformation, an exploration warehouse or data mining and near-line storage . Tracing gives information on the data such as what revenue when there may be mutliple definitions or unclear origin or lineage. One common issue with tracing is the name change or value change. Another less common issue is the recalculation of the unit of data. Both these data and metadata changes need to be considered when tracing.
Another factor related to the tracing is the velocity. The velocity of data refers to the speed at which it passes through the warehouse from the initial capture to the point of use by an analyst. Integration may hinder velocity
Data may also be pushed from the operational environment and into the data warehouse or pulled by data marts from the warehouse. Both affect its velocity.
An important application of data warehouse is eBusiness. From the start eBusiness systems were kept separate from the operational environment. With the data warehouse and ODS, its easier to interface with the web and corporate information.

Cluster computing

Tuesday, December 31, 2013

No comments:

Post a Comment