Sunday, December 29, 2013

In this post, we will discuss the really large data warehouse. As we have seen, data warehouses grow at a rapid rate- from a few gigabytes to tera bytes in a few years.This is primarily due to the following contributors:
1) history - no where else in the system, do we keep historical data as we do in the warehouse. Application developers are quick to remove historical data because it affects performance.
2) granularity - The data warehouses collect data at the most granular level.  This lets the developers to look at data like no one before.
3) integration - The data warehouse brings together data from different operational systems. Its the database for the entire enterprise. There are common data model, common keys, common reference tables, common definitions,
These are referred to in the book as :
HistoricalData_DiverseData_DetailedData = Lots_of_data
The impact from this data is that there are 1) associated costs, 2) value in terms of usefulness of the data and 3) maintenance in terms of the rules of data management.
Basic data management activities include loading, creating an index etc.
The real costs of storage include the costs for the processor (DBMS + OS) software, communications controller, the disk controller and the data store.
When data grows, the percentage of data that is used actually reduces. For example, at 50GB almost all the data is used, at 500GB only 75% is used, at 2 TB, 35% is used, and at 10tb, only 5% is used. This is important to understand because it enables efficient planning of storage technology.
To calculate this usage of data , we first determine the size of the data access. We calculate this based on
Number of users_Queries per day_Bytes per query_200 days
We could also divide by the overlap factor that indicates the same amount of data pulled more than once.
Typically numbers are 250GB of data servicing 600 users
Next the usage of data is calculated with a ratio as
Usage_ratio = Actual bytes used / Total data warehouse bytes
Note that the separation of frequently used and infrequently used data is different for different systems. In OLTP environment, there is random access and interspersal. In DSS environment, there is a much more sharper division between the frequent and the infrequent data.


No comments:

Post a Comment