Cluster computing

Wednesday, December 25, 2013

We looked at the distributed data warehouses in detail in the previous post. We will look at the executive information systems and the data warehouse next. EIS was the predecessor for data warehouses. EIS was the notion that the computational information should be available to executives via appealing screens but it was more about the presentation than the computation. When the data warehouse first appeared, it was not accepted by EIS but the warehouse provided exactly what the EIS was lacking. EIS could be argued has later morphed into business intelligence especially in terms of key-ratio indicator, trend analysis, drill-down analysis, problem monitoring, competitive analysis and key performance indicator. Trend analysis is the time sequenced measurements, drill-down analysis is the slicing and dicing on the data, key performance indicators could be such things as cash on hand, customer pipeline, length of sales cycle, collection time, product channel and competitive products.
To support such EIS operations, there must be a data infrastructure. In fact it has been seen that for every $1 spent on EIS software there is $9 spent on the data preparation. Furhermore, the executive analysis shifts focus every now and then exacerbating the requirements from the preparation.
This is mitigated with a data warehouse. The warehouse enables searching the definitive source of data, creating special extract programs from existing systems, dealing with un-integrated data, compiling and linking detailed and summary data, finding an appropriate time basis of data and changing business needs.
The EIS analyst can go to various levels in the warehouse to find such information. For example, the departmental level of processing, the lightly summarized level of processing, or the archival or dormant level of data. The drill down for the data is also in this order of levels in the warehouse.
One specific technique in the data warehouse that comes useful to EIS processing is event mapping. The simplest way to show event mapping is to use a trend line - for examples revenues varying over the months. Superimposing trend lines also gives correlated information.
Detailed data is prerequisite for EIS/DSS processing but just how much detail has also been discussed in this book. It's argued that there is no limit to the amount of detail desired. Since the requirements can vary and vary frequently, being prepared with as much detail as necessary is important. However, there is another approach that says just as much as the processing requires which is based on Zeno's paradox. In Zeno's paradox it is suggested that a rabbit can not outrun a turtle as long as the turtle has a head start on the rabbit. The underlying reasons for these arguments involve, storage and processing costs, volume that presents a challenge to readability, and reuse of previous analysis is difficult.

We mentioned the types of the distributed data warehouses in the earlier post. We will look into it now. First, the data warehouse may be affected by many different development efforts and these call for the data architect to manage the changes. For example, the development efforts could be based on different lines of products which are unintegrated or the same warehouse could be distributed based on geography such as north, south, east or west. There could also be different levels of data within the same data warehouse such as lightly summarized data , detailed data and OLAP that are build by different groups. Even the detailed non-distributed part of the same data warehouse could be build by different groups. The unintegrated lines of business pose little or no conflicts however that is rare in the cases of data warehouses. The second case of multiple data warehouses across locations is more common. Different groups want to isolate their development effort often at the risk of repeating efforts across groups. The third case of building multiple levels of data simultaneously is much easier to manage than either of the two earlier cases. However, due to the difference in the levels, there are different uses and expecations. The fourth case requires the most attention from the data warehouse architect. This is because the non-distributed warehouse is being built like slices of a pie and they all need to come together. The architect is responsible for consistency and growth of the data warehouse. Another approach different from the above is to develop completely independent warehouses. They are integrated by a common minimal corporate warehouse . The corporate warehouse also requires metadata just like the others but its simply not tied to the others since there is no business integration.
Now we will look at building the business in multiple levels. Here the main tool for the architect is the definitions for interconnectivity between the levels. And the primary data model is driven by the group that is building the current level of detail because they get to use the data warehoused data model. Interconnectivity addresses compatibility of access at the call level. And the current level of detail must be sufficient to generate the lightly summarized data. The co-ordination proceeds with agreements either informal or formal and time-sequenced such that there is no team waiting for data that is not made available yet.
In the case where there are multiple groups forming current level of detail, if the data is mutually exclusive between the groups, then there is little or no coordination. Otherwise the overlap can create redundancies and cost. Further, this redundancy introduces inconsistencies and the so-called spider web into the DSS environment. Therefore, a common data model needs to be established. With the separation of a common versus local data models, there is no redundancy across the local groups.
The same technology can be used for all the data models. Another strategy could be to use different platforms for different types of data found at the detailed level. There are data transfers between platforms and the boundary between technologies may be crossed many times. In all these cases, metadata sits on top of all these data.

Tuesday, December 24, 2013

After the data warehouse is built, the focus shifts from building the warehouse to day to day operations. It turns out that the cost of operating and maintaining a warehouse is high and the volume of data is increasing. The usages by the DSS may cause contention for resources but the biggest expense comes from the periodic refreshes of legacy data.
In order to refresh the data, it must be read from old legacy databases. Repeated and direct reads of old legacy databases is a costly operation. First, the legacy DBMS must be active and online during the read process. It must occur in a window of time and stretching that window is not welcome. The refreshment scan must process the entire legacy file when only a few percentage is required.
A mitigation to such repeated reads is to trap the data as it is being updated. Now the trapped data can be processed offline.
Data can be trapped either by data replication or change data capture. Both techniques are effective. The first requires data to be identified prior to update and a trigger causes the update activity to be trapped. The content and the format of data is clean.
The second approach is the change data capture . This approach captures all the changes that have occurred say in a day. In this approach the log or journal tape is read and while that can be cumbersome, its offline and captures all updates.
As with any production database environment, the data warehouse testing seems to call for two parallel environments be setup one for stage and one for production. The staging environment is where programmers test out new programs.
However due to the size of the warehouse, providing resources for even one of the environments is hard at times. Moreover, the programs are run on the warehouse in a heuristic manner and not repetitive manner as with conventional databases. If there is an issue with the data in the warehouse, the program is simply redone.
As a summary, data warehouses must support:
1) robust language interface
2) compound keys and variable length data
3) manage large amounts of data
4) easily index and monitor data
5) Interface with a wide number of technologies
6) allow the data to be placed directly on disk
7) store and access data in parallel
8) have metadata control in the warehouse
9) efficiently load the warehouse
10) efficiently use the indexes
11) store data in a compact way
12) support compound keys
13) selectively turn off the lock manager
14) do index-only processing
15) quickly restore from bulk-storage

Also we discussed that multi-dimensional OLAP technology is suited for data mart processing and not data warehouse processing and that metadata plays a different role in data warehouse.
Most organizations build a single centralized data warehouse environment.

Data warehouses are not always centralized. Businesses are distributed geographically or over a multiple, differing product lines.
Distributed data warehouses are of three different types:
1) local and global data warehouses where the former represents data and processing at a remote site and the latter represents the integration across business
2) technologically distributed data warehouse where the same single logical data warehouse is spread out over different physical data warehouse.
3) independently evolving distributed data warehouse where one warehouse comes up and then another in an uncoordinated manner.

Monday, December 23, 2013

Data Warehouse has special needs so the DBMS have also evolved to include data warehouse-specific features. Instead of transaction processing, data warehouse calls for load-and-access processing. Data is integrated, transformed and loaded into the warehouse from the operational environments and ODS. Data warehouse does not have updates. Instead data is stored in a series of snapshot records. When there is a change, a new snapshot record is added, rather than an update being done.
There are different storage media required for a data warehouse often used in a dual environment where one processing environment is the DASD environment where online interactive processing is done and the other is the tape or mass store where the data is accumulated.
The DASD environment could use a different vendor from the mass store environment. It's also possible that the DASD environment is split over more than one vendors. As long as these are deliberate instead of political or historical, these should not pose issues.
When comparing operational environment to data warehouse environment, the role of metadata is also very different. For example, in the operational environment, the audience for the metadata is the IT professional and he/she is usually savvy about computers whereas in the data warehouse the metadata is used by DSS analyst who could do with as much help as possible and hence the use for metadata services.
In the data warehouse, the metadata tracks the significant transformations to the data as it passes from operational to data warehouse environments.
Moreover, data in a data warehouse exists for a lengthy span of time and the warehouse changes its structure in this time. Keeping track of this change in structure is a natural task for the metadata. In the operational environment on the other hand, the metadata keeps track of one and only one correct definition of the structure of the data.
Also, operational systems focus on the current data as in the current balance or the current inventory, however trends are not available from current data. They become apparent only when there is data accumulated over time. For the span of time that this data is accumulated, a new dimension by name data context becomes important. This data context is all the context information such as changes in the data source etc. that may explain the discrepancies in the trends on the mere data accumulated over time.
Three levels of context information must be maintained - simple, complex and external.
Simple contextual information relates to the basic structure of the data itself and includes structure, encoding, naming conventions, and metrics. Complex contextual information includes such things as product definitions, marketing territories, pricing, packaging etc. External contextual information includes such things as inflation, financial trends, taxation and economic growth. Complex and external contextual information can vary a lot and harder to keep in a structure.
In the past, collecting contextual information has been difficult because the audience was different, the interest was passive, the resources had vanished, and even when it was collected, it was restricted to simple contextual information.

Sunday, December 22, 2013

Some more technological requirements from the data warehouse involve:
the ability to load the warehouse efficiently
the ability to use indexes efficiently
the requirement to store data in a compact way
the requirement to support composite keys.
Data is loaded into the warehouse either on a record by record basis or through a language interface or in bulk with a utility. The latter is faster however it is better to repeatedly move a few at a time. As the burden of the volume of loading becomes an issue, the load is often parallelized. Once the input data is divided, each job stream is executed independently of the other job stream. If its' possible to defer loading until all the transformations are complete, that may be an advantage as well. In this way, another related approach to the efficient loading of very large amounts of data is staging the data prior to loading. Staging helps when the complexity of processing is high or when merges of data is necessary.
Data warehouse may need to support the creation and loading of new indexes and these indexes must be accessed efficiently. Indexes can be bit maps, or have multiple levels. The index may be stored entirely in memory or compacted when the order of the data being indexed allows such compaction, creating the selective indexes or range indexes. Compaction of data is a crucial tool in efficient data warehouse. Data in the warehouse is seldom updated once inserted, and hence this comes useful. Compaction and de-compaction are CPU bound and not I/O bound, so their execution is not a concern.
Similarly, compound keys are important to the data warehouse because subjects are usually in different tables related by foreign keys and because the data is time variant. Both these factors require compound keys.
We take a look at some of the database features.
In the data warehouse, since data is relatively stable, variable length records pose no problems.
Moreover, due to the variety of data found in the data warehouse, variable length structuring of data must be supported.
Lock Manager in databases is used widely to prevent updates to the same record at the same time. Since the data warehouse doesn't have any updates and because lock manager consumes significant amount of resources, turning this feature on or off is helpful to the performance of the warehouse.
Index only processing is another feature, It is possible to service a request by simply using the index instead of the associated data. In the case of data warehouse, such a feature will tremendously improve performance . For the feature to work, the query should have a hint that the index has been specified.
Fast Restore - A restore from a direct access storage device (DASD) is usually the norm. That's why there is copying between the recovery repository and the DASD. But when the restore can occur directly from the secondary storage, the data warehouse benefits tremendously. Fast restore helps with error detection.
Finally it is preferable to not use transacted operations in the data warehouse. They become too complicated and are prone to a lot of resource consumption. Instead a robust non-transacted with retry logic can be used.

We looked at indexing and monitoring in a data warehouse. We now look at the interfaces to many technologies. The data warehouse recevies data from the operational systems and ODS and passes it to the data marts, DSS applications and alternate storage. The passing of data through the ware house should be smooth and easy. This is an important criterion for the data warehouse and note that this does not mean online passage of data which is not supported. The tests to see whether the data is passing smoothly through the DBMS is to see whether the passage requires changes to the format or if it can be passed to multi-dimensional processing or whether the data can be change data capture or if the context is lost in translation.
When the data is placed, the developer would like to improve efficiency of access and update with the placement of data at the physical block or page level.
Another requirement is to manage the data in parallel. This will improve performance.
Another technological requirement is to have a solid metadata control. And yet another technological requirement is to have a rich language interface to the data warehouse.
Parallel storage lets the volume grow arbitrarily. The entire issue of parallel storage and management of data is complex. The performance improves with the inverse of number of physical devices over which the data is physically distributed.
The use of metadata becomes all the more important with the heuristic iterative development cycle of the data warehouse. The metadata must always represent the warehouse with uptodate and accurate information. Metadata contains such things as :
table structures
table attributes
source data
mapping from source data to warehouse
data model
extract logging
common routines for access of data
definitions or descriptions of data
relationships to one unit of data or the other.
As with data, every technology in the business environment has its own metadata. Report writers, ODS environments and ETL all have their own metadata.
Similarly the language interface could tremendously improve the usage of data. The requirements from a language interface include the ability to access data a set at a time, a record at a time, use one or more indexes, have a SQL interface, perform update, delete and insert etc.
Note that the use of CLR and the SQL interfaces satisfy these conditions and were developed to that end.

Saturday, December 21, 2013

We saw how the data warehouse granularity affects the design and how it is decided. The next step is to anticipate the needs of the different architectural entities that will be fed from the data warehouse. This means that the data warehouse has to serve as the lowest common denominator to suit all the entities. Data that is too fine can always be summarized but the reverse is harder.
There is a feedback loop between these systems and the warehouse. We attempt to make it harmonious with the following techniques:
Start with a rapid prototype of very small steps of developing the warehouse, seek early feedback and make fast adjustments.
Use the existing granularity in the participating systems to gain insight into what works
Make recommendations based on the iteration experience and keep the changes visible to the users.
Meet for joint application design and simulate the output for feedback.
If the granularity has to be raised, it can be done in many ways such as :
Summarizing the data from the sources as it goes to the target.
averaging or selecting the minimum and maximum values into the target.
Selecting only the data that is needed into the target.
Selecting a subset of data using conditional logic.
Thus aggregation is used for summarizing and it can be based on well known average, min, max operations or user - defined aggregations.
The feedback loop is probably the single most distinguishing characteristic of a process for building the warehouse. Without the feedback loop with the DSS analyst as described above, it becomes very hard to build it right.
There are often dual-levels of granularity used in different industries. The first level is say the sixty day worth of operational data and the second level is the lightly summarized 10 year history.
Since data from a data warehouse is usually fed into the data marts, the granularity of the data warehouse has to be at least at the lowest level required by any of the data marts.
When the data spills over to the overflow, the granularity is not a consideration any more. To store in the overflow, two pieces of software are used - a cross-media storage manager (CMSM) that manages the traffic between the disk and the alternate storage and an activity monitor to determine the spill cutoff.
The data warehouse has to manage large amounts of data. The size of the data could be in terabytes and petabytes. The issue of storing and managing large amounts of data plays predominantly into every technology used in the warehouse. There are technologies for managing volumes, managing multiple media, to index and monitor freely and for receiving data and passing to a wide variety of technologies.
The hierarchy of storage of data in terms of speed and access are as follows: main memory, expanded memory, cache, DASD, magnetic tape, near line, optical disk, Fiche.
Technology that supports indexing is also important. The cost of creating the index and using the index cannot be significant and a wide variety of indexes may need to be supported. Monitoring the data warehouse is equally important and involves such things as to detect when reorganization is required, or index is poorly structured, very little spillover data, the composition of the data and the available remaining space.