Cluster computing

Tuesday, December 24, 2013

After the data warehouse is built, the focus shifts from building the warehouse to day to day operations. It turns out that the cost of operating and maintaining a warehouse is high and the volume of data is increasing. The usages by the DSS may cause contention for resources but the biggest expense comes from the periodic refreshes of legacy data.
In order to refresh the data, it must be read from old legacy databases. Repeated and direct reads of old legacy databases is a costly operation. First, the legacy DBMS must be active and online during the read process. It must occur in a window of time and stretching that window is not welcome. The refreshment scan must process the entire legacy file when only a few percentage is required.
A mitigation to such repeated reads is to trap the data as it is being updated. Now the trapped data can be processed offline.
Data can be trapped either by data replication or change data capture. Both techniques are effective. The first requires data to be identified prior to update and a trigger causes the update activity to be trapped. The content and the format of data is clean.
The second approach is the change data capture . This approach captures all the changes that have occurred say in a day. In this approach the log or journal tape is read and while that can be cumbersome, its offline and captures all updates.
As with any production database environment, the data warehouse testing seems to call for two parallel environments be setup one for stage and one for production. The staging environment is where programmers test out new programs.
However due to the size of the warehouse, providing resources for even one of the environments is hard at times. Moreover, the programs are run on the warehouse in a heuristic manner and not repetitive manner as with conventional databases. If there is an issue with the data in the warehouse, the program is simply redone.
As a summary, data warehouses must support:
1) robust language interface
2) compound keys and variable length data
3) manage large amounts of data
4) easily index and monitor data
5) Interface with a wide number of technologies
6) allow the data to be placed directly on disk
7) store and access data in parallel
8) have metadata control in the warehouse
9) efficiently load the warehouse
10) efficiently use the indexes
11) store data in a compact way
12) support compound keys
13) selectively turn off the lock manager
14) do index-only processing
15) quickly restore from bulk-storage

Also we discussed that multi-dimensional OLAP technology is suited for data mart processing and not data warehouse processing and that metadata plays a different role in data warehouse.
Most organizations build a single centralized data warehouse environment.

Data warehouses are not always centralized. Businesses are distributed geographically or over a multiple, differing product lines.
Distributed data warehouses are of three different types:
1) local and global data warehouses where the former represents data and processing at a remote site and the latter represents the integration across business
2) technologically distributed data warehouse where the same single logical data warehouse is spread out over different physical data warehouse.
3) independently evolving distributed data warehouse where one warehouse comes up and then another in an uncoordinated manner.

Monday, December 23, 2013

Data Warehouse has special needs so the DBMS have also evolved to include data warehouse-specific features. Instead of transaction processing, data warehouse calls for load-and-access processing. Data is integrated, transformed and loaded into the warehouse from the operational environments and ODS. Data warehouse does not have updates. Instead data is stored in a series of snapshot records. When there is a change, a new snapshot record is added, rather than an update being done.
There are different storage media required for a data warehouse often used in a dual environment where one processing environment is the DASD environment where online interactive processing is done and the other is the tape or mass store where the data is accumulated.
The DASD environment could use a different vendor from the mass store environment. It's also possible that the DASD environment is split over more than one vendors. As long as these are deliberate instead of political or historical, these should not pose issues.
When comparing operational environment to data warehouse environment, the role of metadata is also very different. For example, in the operational environment, the audience for the metadata is the IT professional and he/she is usually savvy about computers whereas in the data warehouse the metadata is used by DSS analyst who could do with as much help as possible and hence the use for metadata services.
In the data warehouse, the metadata tracks the significant transformations to the data as it passes from operational to data warehouse environments.
Moreover, data in a data warehouse exists for a lengthy span of time and the warehouse changes its structure in this time. Keeping track of this change in structure is a natural task for the metadata. In the operational environment on the other hand, the metadata keeps track of one and only one correct definition of the structure of the data.
Also, operational systems focus on the current data as in the current balance or the current inventory, however trends are not available from current data. They become apparent only when there is data accumulated over time. For the span of time that this data is accumulated, a new dimension by name data context becomes important. This data context is all the context information such as changes in the data source etc. that may explain the discrepancies in the trends on the mere data accumulated over time.
Three levels of context information must be maintained - simple, complex and external.
Simple contextual information relates to the basic structure of the data itself and includes structure, encoding, naming conventions, and metrics. Complex contextual information includes such things as product definitions, marketing territories, pricing, packaging etc. External contextual information includes such things as inflation, financial trends, taxation and economic growth. Complex and external contextual information can vary a lot and harder to keep in a structure.
In the past, collecting contextual information has been difficult because the audience was different, the interest was passive, the resources had vanished, and even when it was collected, it was restricted to simple contextual information.

Sunday, December 22, 2013

Some more technological requirements from the data warehouse involve:
the ability to load the warehouse efficiently
the ability to use indexes efficiently
the requirement to store data in a compact way
the requirement to support composite keys.
Data is loaded into the warehouse either on a record by record basis or through a language interface or in bulk with a utility. The latter is faster however it is better to repeatedly move a few at a time. As the burden of the volume of loading becomes an issue, the load is often parallelized. Once the input data is divided, each job stream is executed independently of the other job stream. If its' possible to defer loading until all the transformations are complete, that may be an advantage as well. In this way, another related approach to the efficient loading of very large amounts of data is staging the data prior to loading. Staging helps when the complexity of processing is high or when merges of data is necessary.
Data warehouse may need to support the creation and loading of new indexes and these indexes must be accessed efficiently. Indexes can be bit maps, or have multiple levels. The index may be stored entirely in memory or compacted when the order of the data being indexed allows such compaction, creating the selective indexes or range indexes. Compaction of data is a crucial tool in efficient data warehouse. Data in the warehouse is seldom updated once inserted, and hence this comes useful. Compaction and de-compaction are CPU bound and not I/O bound, so their execution is not a concern.
Similarly, compound keys are important to the data warehouse because subjects are usually in different tables related by foreign keys and because the data is time variant. Both these factors require compound keys.
We take a look at some of the database features.
In the data warehouse, since data is relatively stable, variable length records pose no problems.
Moreover, due to the variety of data found in the data warehouse, variable length structuring of data must be supported.
Lock Manager in databases is used widely to prevent updates to the same record at the same time. Since the data warehouse doesn't have any updates and because lock manager consumes significant amount of resources, turning this feature on or off is helpful to the performance of the warehouse.
Index only processing is another feature, It is possible to service a request by simply using the index instead of the associated data. In the case of data warehouse, such a feature will tremendously improve performance . For the feature to work, the query should have a hint that the index has been specified.
Fast Restore - A restore from a direct access storage device (DASD) is usually the norm. That's why there is copying between the recovery repository and the DASD. But when the restore can occur directly from the secondary storage, the data warehouse benefits tremendously. Fast restore helps with error detection.
Finally it is preferable to not use transacted operations in the data warehouse. They become too complicated and are prone to a lot of resource consumption. Instead a robust non-transacted with retry logic can be used.

We looked at indexing and monitoring in a data warehouse. We now look at the interfaces to many technologies. The data warehouse recevies data from the operational systems and ODS and passes it to the data marts, DSS applications and alternate storage. The passing of data through the ware house should be smooth and easy. This is an important criterion for the data warehouse and note that this does not mean online passage of data which is not supported. The tests to see whether the data is passing smoothly through the DBMS is to see whether the passage requires changes to the format or if it can be passed to multi-dimensional processing or whether the data can be change data capture or if the context is lost in translation.
When the data is placed, the developer would like to improve efficiency of access and update with the placement of data at the physical block or page level.
Another requirement is to manage the data in parallel. This will improve performance.
Another technological requirement is to have a solid metadata control. And yet another technological requirement is to have a rich language interface to the data warehouse.
Parallel storage lets the volume grow arbitrarily. The entire issue of parallel storage and management of data is complex. The performance improves with the inverse of number of physical devices over which the data is physically distributed.
The use of metadata becomes all the more important with the heuristic iterative development cycle of the data warehouse. The metadata must always represent the warehouse with uptodate and accurate information. Metadata contains such things as :
table structures
table attributes
source data
mapping from source data to warehouse
data model
extract logging
common routines for access of data
definitions or descriptions of data
relationships to one unit of data or the other.
As with data, every technology in the business environment has its own metadata. Report writers, ODS environments and ETL all have their own metadata.
Similarly the language interface could tremendously improve the usage of data. The requirements from a language interface include the ability to access data a set at a time, a record at a time, use one or more indexes, have a SQL interface, perform update, delete and insert etc.
Note that the use of CLR and the SQL interfaces satisfy these conditions and were developed to that end.

Saturday, December 21, 2013

We saw how the data warehouse granularity affects the design and how it is decided. The next step is to anticipate the needs of the different architectural entities that will be fed from the data warehouse. This means that the data warehouse has to serve as the lowest common denominator to suit all the entities. Data that is too fine can always be summarized but the reverse is harder.
There is a feedback loop between these systems and the warehouse. We attempt to make it harmonious with the following techniques:
Start with a rapid prototype of very small steps of developing the warehouse, seek early feedback and make fast adjustments.
Use the existing granularity in the participating systems to gain insight into what works
Make recommendations based on the iteration experience and keep the changes visible to the users.
Meet for joint application design and simulate the output for feedback.
If the granularity has to be raised, it can be done in many ways such as :
Summarizing the data from the sources as it goes to the target.
averaging or selecting the minimum and maximum values into the target.
Selecting only the data that is needed into the target.
Selecting a subset of data using conditional logic.
Thus aggregation is used for summarizing and it can be based on well known average, min, max operations or user - defined aggregations.
The feedback loop is probably the single most distinguishing characteristic of a process for building the warehouse. Without the feedback loop with the DSS analyst as described above, it becomes very hard to build it right.
There are often dual-levels of granularity used in different industries. The first level is say the sixty day worth of operational data and the second level is the lightly summarized 10 year history.
Since data from a data warehouse is usually fed into the data marts, the granularity of the data warehouse has to be at least at the lowest level required by any of the data marts.
When the data spills over to the overflow, the granularity is not a consideration any more. To store in the overflow, two pieces of software are used - a cross-media storage manager (CMSM) that manages the traffic between the disk and the alternate storage and an activity monitor to determine the spill cutoff.
The data warehouse has to manage large amounts of data. The size of the data could be in terabytes and petabytes. The issue of storing and managing large amounts of data plays predominantly into every technology used in the warehouse. There are technologies for managing volumes, managing multiple media, to index and monitor freely and for receiving data and passing to a wide variety of technologies.
The hierarchy of storage of data in terms of speed and access are as follows: main memory, expanded memory, cache, DASD, magnetic tape, near line, optical disk, Fiche.
Technology that supports indexing is also important. The cost of creating the index and using the index cannot be significant and a wide variety of indexes may need to be supported. Monitoring the data warehouse is equally important and involves such things as to detect when reorganization is required, or index is poorly structured, very little spillover data, the composition of the data and the available remaining space.

We now look at the granularity in the data warehouse.We choose this based on what matters to the environments that depend on the warehouse data as well as what is efficient for the system given the volume of the data. By setting the granularity at the right level and spilling the excess volume of data to overflow storage, other aspects of the design and implementation can flow smoothly.
To determine the volume of data, we estimate the number of rows. If there are a few thousand rows, any granularity will do. If there are millions of row, a low level of granularity can be chosen. If there are billions of rows, a higher level of granularity is required.
The rule of thumb cited in the book suggests the following:
1) For each known table:
we find how big a row can be and find both the biggest and smallest estimates
For the 1-year horizon, we find the number of rows possible
For the 5-year horizon, we find the number of rows possible
for each key of the table, we find the size in bytes.i
With the number of rows from above, we calculate the max 1 year space as the biggest row times the number of 1-year max rows. We calculate the min 1 year space as the smallest row times the 1-year min rows plus index space.
2) We repeat this for all the known tables.
This way we can estimate the volume of data to manage. The author has some interesting caveats here :
one that the projections of the size of the database are usually low. and second, the growth rate of the warehouse is usually faster than the projection.
These row estimates and the number of direct access storage devices helps with the planning process. Note that more than the size of the records, the number of records matters. In fact, as long as all rows are reasonably sized, the calculation of index entries is entirely dependent on the number of rows and these together affect the choice of granularity and the overflow storage.

Friday, December 20, 2013

We continue our discussion on the data warehouse with the indirect use of the data warehouse data. We see an emerging pattern with the examples mentioned in the previous post. The warehouse is periodically analyzed by a program examines relevant characteristics and criteria. This analysis creates a small file in the online environment that contains succinct information about the business of the enterprise. The small online file is used quickly and efficiently as is customary in the operational environment.
We look at these in more detail now. First, the program that runs on the warehouse is typically a machine learning program and has free rein to run anywhere in the warehouse. It runs in the background and is in harmony with the rate at which the data warehouse changes. The refreshment occurs periodically and relatively infrequently. And it replaces existing data as opposed to updating it.
The online pre-analyzed data file contains only a small amount of data per unit of data and may contain only what the online user/clerk needs. Its ownership and access are entirely within the operational environment.
In the data warehouse, a normalized approach produces the optimal design with flexibility. It fits well with granular data, it is not optimized for any given set of processing requirements and fits very nicely with the data model. Usually deviations to the normalized design comes from the common requirements of several departments for eg. storing data for all months. Data warehousing typically involves snowflake like design which includes the star joins, fact tables and dimensions.
Unlike data warehouses, data marts are very much shaped by processing requirements. However, Data warehouses have a much broader audience. If we look at the entity relationship diagram for a data mart, all the entities will look equal. However in reality some entities may have volume of data associate than others. The design that best utilizes this characteristic is a star join. As a simple example of a star join, the entity with the most data can have foreign keys to the identifiers of all associated entities.
One of the issues with the data warehouses and data marts is how data gets from the data warehouse to the data marts. Data in the data warehouse is very granular. Data in the data mart is compact and summarized. Periodically data must be moved from the data warehouse to the data mart.
As mentioned earlier, the data mart are designed based on the requirements of the department. Different departments will have different data marts. Star joins involving fact tables and dimensions are suitable for a data mart. But not for a warehouse simply because the data warehouse does not cater to one department. Thus DataMart data structures in general are not reusable or flexible for reconciliation across departments but the granular data in the warehouse is all of those. When data passes from a data warehouse to the data marts, it moves from a normalized world to a multidimensional world.
Unlike the processing requirements for the data mart, the data warehouse is designed from the enterprise requirements. The enterprise requirements are gathered from what is known as the Zachman framework.
This is a convenient device that lays out a matrix to include all perspectives. The output is perspective and context. As an example, consider a matrix with scope as rows and data as columns. Data could correspond to function, network, people, time and model. The scope could correspond to Enterprise model, system model, technical model, components and functional system. By exhaustively covering all perspectives, this approach attempts to let the enterprise requirements be articulated from the blueprint. These requirements subsequently lead to the enterprise data model which helps in building the data warehouse. Therefore a lot of order and discipline is required in using this framework.