Sunday, December 15, 2013

In today's post, we will read from a book "building the data warehouse" by Inmon. We discussed briefly about CIF Architecture and comparison with Star-Kimball model. We look at data warehouse in more detail now.
Data Warehouse is an accumulation of data. It has different levels of detail. There is an older level of detail, a current level of detail and a level of lightly summarized data (the data mart level), and a level of highly summarized data Before data makes its way into the warehouse, there's tremendous transformation involved. When data ages, it moves from current detail to older detail. When data is summarized it moves from current detail to lightly summarized and then to highly summarized.  The data across levels is organized based on subject areas. Typical subject areas in a corporate data model involve Customer, Product, Transaction, Policy, Claim and Account. A subject area could consist of several (order of tens or hundreds of) physical tables. If we take a subject area such as a Customer, there could be a physical table for base customer say between 1985-87, another for 1988-1990. There could be a cumulative table for 1986-1990 and a summary record for each customer record based on customer activity per month. All of the physical tables are related by a common key - the customer ID. Direct Access Storage Device, magnetic tape, fiche and optical disk can all be used to store data. a fiche is good for storing data that does not have to be reproduced in an electronic medium again
Some tables are organized on a from date to date basis while others are cumulative.
Due to their accumulation and size, Data warehouses are not built all at once. For example, on day 1 there may be just the legacy systems doing operational, transactional processing. On day 2, the first few tables of the first subject area of data warehouse are populated. Users start to discover warehouse and analytical processing. On day 3, there are more subjects and more curiosity drawing more users and a DSS analyst. On day 4, more data is extracted from operational environment, transformed and placed in the data warehouse. Many DSS applications spring up. Access needs to be granted for usage and some users are
On day 5, departmental databases come up. Departments now extract the data from the warehouse to their level for analysis.
On day 6, the land rush to departmental, multidimensional systems takes place. This is because the departmental data is now directly available from the warehouse. And users no longer look at the data level details but focus on the departmental analytical processing.
On day n, the architecture is fully developed. Although at this point it looks like a spider web, it is not one. "The Cabinet Design" article on May 1991 explains why.
Note that the level of the detail or the summarization unit of the data plays a very critical and pervasive role in the design of the data warehouse. This is referred to as Granularity.
In operational systems, the data is mandatorily stored at the lowest granularity. In data warehouses, such assumptions are not made. For eg. subscriber call records for a month are a high level of detail and a low level of granularity. But the summary of same subscriber calls are a low level of detail and a high level of granularity. Note the number of records to search the existence of one record is order of magnitudes greater in the first than in the second.
In most retail companies, there are usually dual-levels of granularity. There is a level of granularity for the true archival level  and a lightly summarized data in the other level. The lightly summarized data can come from say 30 days' details maintained by the operational data.

 

No comments:

Post a Comment