Cluster computing

Tuesday, December 24, 2013

After the data warehouse is built, the focus shifts from building the warehouse to day to day operations. It turns out that the cost of operating and maintaining a warehouse is high and the volume of data is increasing. The usages by the DSS may cause contention for resources but the biggest expense comes from the periodic refreshes of legacy data.
In order to refresh the data, it must be read from old legacy databases. Repeated and direct reads of old legacy databases is a costly operation. First, the legacy DBMS must be active and online during the read process. It must occur in a window of time and stretching that window is not welcome. The refreshment scan must process the entire legacy file when only a few percentage is required.
A mitigation to such repeated reads is to trap the data as it is being updated. Now the trapped data can be processed offline.
Data can be trapped either by data replication or change data capture. Both techniques are effective. The first requires data to be identified prior to update and a trigger causes the update activity to be trapped. The content and the format of data is clean.
The second approach is the change data capture . This approach captures all the changes that have occurred say in a day. In this approach the log or journal tape is read and while that can be cumbersome, its offline and captures all updates.
As with any production database environment, the data warehouse testing seems to call for two parallel environments be setup one for stage and one for production. The staging environment is where programmers test out new programs.
However due to the size of the warehouse, providing resources for even one of the environments is hard at times. Moreover, the programs are run on the warehouse in a heuristic manner and not repetitive manner as with conventional databases. If there is an issue with the data in the warehouse, the program is simply redone.
As a summary, data warehouses must support:
1) robust language interface
2) compound keys and variable length data
3) manage large amounts of data
4) easily index and monitor data
5) Interface with a wide number of technologies
6) allow the data to be placed directly on disk
7) store and access data in parallel
8) have metadata control in the warehouse
9) efficiently load the warehouse
10) efficiently use the indexes
11) store data in a compact way
12) support compound keys
13) selectively turn off the lock manager
14) do index-only processing
15) quickly restore from bulk-storage

Also we discussed that multi-dimensional OLAP technology is suited for data mart processing and not data warehouse processing and that metadata plays a different role in data warehouse.
Most organizations build a single centralized data warehouse environment.

Data warehouses are not always centralized. Businesses are distributed geographically or over a multiple, differing product lines.
Distributed data warehouses are of three different types:
1) local and global data warehouses where the former represents data and processing at a remote site and the latter represents the integration across business
2) technologically distributed data warehouse where the same single logical data warehouse is spread out over different physical data warehouse.
3) independently evolving distributed data warehouse where one warehouse comes up and then another in an uncoordinated manner.

Cluster computing

Tuesday, December 24, 2013

No comments:

Post a Comment