Cluster computing

Sunday, December 22, 2013

Some more technological requirements from the data warehouse involve:
the ability to load the warehouse efficiently
the ability to use indexes efficiently
the requirement to store data in a compact way
the requirement to support composite keys.
Data is loaded into the warehouse either on a record by record basis or through a language interface or in bulk with a utility. The latter is faster however it is better to repeatedly move a few at a time. As the burden of the volume of loading becomes an issue, the load is often parallelized. Once the input data is divided, each job stream is executed independently of the other job stream. If its' possible to defer loading until all the transformations are complete, that may be an advantage as well. In this way, another related approach to the efficient loading of very large amounts of data is staging the data prior to loading. Staging helps when the complexity of processing is high or when merges of data is necessary.
Data warehouse may need to support the creation and loading of new indexes and these indexes must be accessed efficiently. Indexes can be bit maps, or have multiple levels. The index may be stored entirely in memory or compacted when the order of the data being indexed allows such compaction, creating the selective indexes or range indexes. Compaction of data is a crucial tool in efficient data warehouse. Data in the warehouse is seldom updated once inserted, and hence this comes useful. Compaction and de-compaction are CPU bound and not I/O bound, so their execution is not a concern.
Similarly, compound keys are important to the data warehouse because subjects are usually in different tables related by foreign keys and because the data is time variant. Both these factors require compound keys.
We take a look at some of the database features.
In the data warehouse, since data is relatively stable, variable length records pose no problems.
Moreover, due to the variety of data found in the data warehouse, variable length structuring of data must be supported.
Lock Manager in databases is used widely to prevent updates to the same record at the same time. Since the data warehouse doesn't have any updates and because lock manager consumes significant amount of resources, turning this feature on or off is helpful to the performance of the warehouse.
Index only processing is another feature, It is possible to service a request by simply using the index instead of the associated data. In the case of data warehouse, such a feature will tremendously improve performance . For the feature to work, the query should have a hint that the index has been specified.
Fast Restore - A restore from a direct access storage device (DASD) is usually the norm. That's why there is copying between the recovery repository and the DASD. But when the restore can occur directly from the secondary storage, the data warehouse benefits tremendously. Fast restore helps with error detection.
Finally it is preferable to not use transacted operations in the data warehouse. They become too complicated and are prone to a lot of resource consumption. Instead a robust non-transacted with retry logic can be used.

Cluster computing

Sunday, December 22, 2013

No comments:

Post a Comment