Cluster computing

Saturday, December 21, 2013

We saw how the data warehouse granularity affects the design and how it is decided. The next step is to anticipate the needs of the different architectural entities that will be fed from the data warehouse. This means that the data warehouse has to serve as the lowest common denominator to suit all the entities. Data that is too fine can always be summarized but the reverse is harder.
There is a feedback loop between these systems and the warehouse. We attempt to make it harmonious with the following techniques:
Start with a rapid prototype of very small steps of developing the warehouse, seek early feedback and make fast adjustments.
Use the existing granularity in the participating systems to gain insight into what works
Make recommendations based on the iteration experience and keep the changes visible to the users.
Meet for joint application design and simulate the output for feedback.
If the granularity has to be raised, it can be done in many ways such as :
Summarizing the data from the sources as it goes to the target.
averaging or selecting the minimum and maximum values into the target.
Selecting only the data that is needed into the target.
Selecting a subset of data using conditional logic.
Thus aggregation is used for summarizing and it can be based on well known average, min, max operations or user - defined aggregations.
The feedback loop is probably the single most distinguishing characteristic of a process for building the warehouse. Without the feedback loop with the DSS analyst as described above, it becomes very hard to build it right.
There are often dual-levels of granularity used in different industries. The first level is say the sixty day worth of operational data and the second level is the lightly summarized 10 year history.
Since data from a data warehouse is usually fed into the data marts, the granularity of the data warehouse has to be at least at the lowest level required by any of the data marts.
When the data spills over to the overflow, the granularity is not a consideration any more. To store in the overflow, two pieces of software are used - a cross-media storage manager (CMSM) that manages the traffic between the disk and the alternate storage and an activity monitor to determine the spill cutoff.
The data warehouse has to manage large amounts of data. The size of the data could be in terabytes and petabytes. The issue of storing and managing large amounts of data plays predominantly into every technology used in the warehouse. There are technologies for managing volumes, managing multiple media, to index and monitor freely and for receiving data and passing to a wide variety of technologies.
The hierarchy of storage of data in terms of speed and access are as follows: main memory, expanded memory, cache, DASD, magnetic tape, near line, optical disk, Fiche.
Technology that supports indexing is also important. The cost of creating the index and using the index cannot be significant and a wide variety of indexes may need to be supported. Monitoring the data warehouse is equally important and involves such things as to detect when reorganization is required, or index is poorly structured, very little spillover data, the composition of the data and the available remaining space.

Cluster computing

Saturday, December 21, 2013

No comments:

Post a Comment