Cluster computing: Data Warehouses

Friday, February 1, 2013

Data Warehouses

Data Warehouses are large historical databases that get loaded with data periodically. They are then used for decision support which account for nearly one third of all DBMS activity. They have come to require custom query optimization and execution engine support. The first of these is that they are not suited for online transaction processing OLTP where response time was critical. Historical data, on the other hand, was used for analysis such as which items to put on promotion and where layout was important for the upcoming season. These kind of queries were based on the wisdom that historical information enabled better stock management and more savings. The second of these is that such databases required a schema very typical and different from others such as star, multi-level star or snowflake schema where customers, products, stores, times etc. were the dimesions or fact tables. Many dimensions were naturally hierarchical. and hence the use of multi-level star or snowflake. The third of these is that they required very different data structures from the B+-trees which were optimized for fast insertion, deletion and update of records. In contrast, a data warehouse performs an initial load and then the data is static for months. Bitmaps helped store this information with far fewer bits and were also advantageous for various sophisticated bitmap arithmetic including conjunctive filters. The fourth difference in these were that these databases required fast loads of large data periodically. These loads not only were accumulated data but also loaded from systems during say night time.Loads and queries could potentially conflict and hence there was a need to provide techniques such as update-in-place and historical queries. Updates were timestamped and MVCC isolation were provided by a few vendors. The fifth difference is that the joins on the data were very costly to perform and hence there was a need to capture the view and persist it till the next update. The sixth difference is that these databases supported predictable queries that consisted of various aggregates. These aggregates often called data cubes were special class of materialized views that allowed user to navigate the data. Lastly, such systems required special hardware and some vendors like Teradata and Netezza provided proprietary hardware.

Cluster computing

Friday, February 1, 2013

Data Warehouses

No comments:

Post a Comment