Cluster computing

Sunday, December 22, 2013

Some more technological requirements from the data warehouse involve:
the ability to load the warehouse efficiently
the ability to use indexes efficiently
the requirement to store data in a compact way
the requirement to support composite keys.
Data is loaded into the warehouse either on a record by record basis or through a language interface or in bulk with a utility. The latter is faster however it is better to repeatedly move a few at a time. As the burden of the volume of loading becomes an issue, the load is often parallelized. Once the input data is divided, each job stream is executed independently of the other job stream. If its' possible to defer loading until all the transformations are complete, that may be an advantage as well. In this way, another related approach to the efficient loading of very large amounts of data is staging the data prior to loading. Staging helps when the complexity of processing is high or when merges of data is necessary.
Data warehouse may need to support the creation and loading of new indexes and these indexes must be accessed efficiently. Indexes can be bit maps, or have multiple levels. The index may be stored entirely in memory or compacted when the order of the data being indexed allows such compaction, creating the selective indexes or range indexes. Compaction of data is a crucial tool in efficient data warehouse. Data in the warehouse is seldom updated once inserted, and hence this comes useful. Compaction and de-compaction are CPU bound and not I/O bound, so their execution is not a concern.
Similarly, compound keys are important to the data warehouse because subjects are usually in different tables related by foreign keys and because the data is time variant. Both these factors require compound keys.
We take a look at some of the database features.
In the data warehouse, since data is relatively stable, variable length records pose no problems.
Moreover, due to the variety of data found in the data warehouse, variable length structuring of data must be supported.
Lock Manager in databases is used widely to prevent updates to the same record at the same time. Since the data warehouse doesn't have any updates and because lock manager consumes significant amount of resources, turning this feature on or off is helpful to the performance of the warehouse.
Index only processing is another feature, It is possible to service a request by simply using the index instead of the associated data. In the case of data warehouse, such a feature will tremendously improve performance . For the feature to work, the query should have a hint that the index has been specified.
Fast Restore - A restore from a direct access storage device (DASD) is usually the norm. That's why there is copying between the recovery repository and the DASD. But when the restore can occur directly from the secondary storage, the data warehouse benefits tremendously. Fast restore helps with error detection.
Finally it is preferable to not use transacted operations in the data warehouse. They become too complicated and are prone to a lot of resource consumption. Instead a robust non-transacted with retry logic can be used.

We looked at indexing and monitoring in a data warehouse. We now look at the interfaces to many technologies. The data warehouse recevies data from the operational systems and ODS and passes it to the data marts, DSS applications and alternate storage. The passing of data through the ware house should be smooth and easy. This is an important criterion for the data warehouse and note that this does not mean online passage of data which is not supported. The tests to see whether the data is passing smoothly through the DBMS is to see whether the passage requires changes to the format or if it can be passed to multi-dimensional processing or whether the data can be change data capture or if the context is lost in translation.
When the data is placed, the developer would like to improve efficiency of access and update with the placement of data at the physical block or page level.
Another requirement is to manage the data in parallel. This will improve performance.
Another technological requirement is to have a solid metadata control. And yet another technological requirement is to have a rich language interface to the data warehouse.
Parallel storage lets the volume grow arbitrarily. The entire issue of parallel storage and management of data is complex. The performance improves with the inverse of number of physical devices over which the data is physically distributed.
The use of metadata becomes all the more important with the heuristic iterative development cycle of the data warehouse. The metadata must always represent the warehouse with uptodate and accurate information. Metadata contains such things as :
table structures
table attributes
source data
mapping from source data to warehouse
data model
extract logging
common routines for access of data
definitions or descriptions of data
relationships to one unit of data or the other.
As with data, every technology in the business environment has its own metadata. Report writers, ODS environments and ETL all have their own metadata.
Similarly the language interface could tremendously improve the usage of data. The requirements from a language interface include the ability to access data a set at a time, a record at a time, use one or more indexes, have a SQL interface, perform update, delete and insert etc.
Note that the use of CLR and the SQL interfaces satisfy these conditions and were developed to that end.

Saturday, December 21, 2013

We saw how the data warehouse granularity affects the design and how it is decided. The next step is to anticipate the needs of the different architectural entities that will be fed from the data warehouse. This means that the data warehouse has to serve as the lowest common denominator to suit all the entities. Data that is too fine can always be summarized but the reverse is harder.
There is a feedback loop between these systems and the warehouse. We attempt to make it harmonious with the following techniques:
Start with a rapid prototype of very small steps of developing the warehouse, seek early feedback and make fast adjustments.
Use the existing granularity in the participating systems to gain insight into what works
Make recommendations based on the iteration experience and keep the changes visible to the users.
Meet for joint application design and simulate the output for feedback.
If the granularity has to be raised, it can be done in many ways such as :
Summarizing the data from the sources as it goes to the target.
averaging or selecting the minimum and maximum values into the target.
Selecting only the data that is needed into the target.
Selecting a subset of data using conditional logic.
Thus aggregation is used for summarizing and it can be based on well known average, min, max operations or user - defined aggregations.
The feedback loop is probably the single most distinguishing characteristic of a process for building the warehouse. Without the feedback loop with the DSS analyst as described above, it becomes very hard to build it right.
There are often dual-levels of granularity used in different industries. The first level is say the sixty day worth of operational data and the second level is the lightly summarized 10 year history.
Since data from a data warehouse is usually fed into the data marts, the granularity of the data warehouse has to be at least at the lowest level required by any of the data marts.
When the data spills over to the overflow, the granularity is not a consideration any more. To store in the overflow, two pieces of software are used - a cross-media storage manager (CMSM) that manages the traffic between the disk and the alternate storage and an activity monitor to determine the spill cutoff.
The data warehouse has to manage large amounts of data. The size of the data could be in terabytes and petabytes. The issue of storing and managing large amounts of data plays predominantly into every technology used in the warehouse. There are technologies for managing volumes, managing multiple media, to index and monitor freely and for receiving data and passing to a wide variety of technologies.
The hierarchy of storage of data in terms of speed and access are as follows: main memory, expanded memory, cache, DASD, magnetic tape, near line, optical disk, Fiche.
Technology that supports indexing is also important. The cost of creating the index and using the index cannot be significant and a wide variety of indexes may need to be supported. Monitoring the data warehouse is equally important and involves such things as to detect when reorganization is required, or index is poorly structured, very little spillover data, the composition of the data and the available remaining space.

We now look at the granularity in the data warehouse.We choose this based on what matters to the environments that depend on the warehouse data as well as what is efficient for the system given the volume of the data. By setting the granularity at the right level and spilling the excess volume of data to overflow storage, other aspects of the design and implementation can flow smoothly.
To determine the volume of data, we estimate the number of rows. If there are a few thousand rows, any granularity will do. If there are millions of row, a low level of granularity can be chosen. If there are billions of rows, a higher level of granularity is required.
The rule of thumb cited in the book suggests the following:
1) For each known table:
we find how big a row can be and find both the biggest and smallest estimates
For the 1-year horizon, we find the number of rows possible
For the 5-year horizon, we find the number of rows possible
for each key of the table, we find the size in bytes.i
With the number of rows from above, we calculate the max 1 year space as the biggest row times the number of 1-year max rows. We calculate the min 1 year space as the smallest row times the 1-year min rows plus index space.
2) We repeat this for all the known tables.
This way we can estimate the volume of data to manage. The author has some interesting caveats here :
one that the projections of the size of the database are usually low. and second, the growth rate of the warehouse is usually faster than the projection.
These row estimates and the number of direct access storage devices helps with the planning process. Note that more than the size of the records, the number of records matters. In fact, as long as all rows are reasonably sized, the calculation of index entries is entirely dependent on the number of rows and these together affect the choice of granularity and the overflow storage.

Friday, December 20, 2013

We continue our discussion on the data warehouse with the indirect use of the data warehouse data. We see an emerging pattern with the examples mentioned in the previous post. The warehouse is periodically analyzed by a program examines relevant characteristics and criteria. This analysis creates a small file in the online environment that contains succinct information about the business of the enterprise. The small online file is used quickly and efficiently as is customary in the operational environment.
We look at these in more detail now. First, the program that runs on the warehouse is typically a machine learning program and has free rein to run anywhere in the warehouse. It runs in the background and is in harmony with the rate at which the data warehouse changes. The refreshment occurs periodically and relatively infrequently. And it replaces existing data as opposed to updating it.
The online pre-analyzed data file contains only a small amount of data per unit of data and may contain only what the online user/clerk needs. Its ownership and access are entirely within the operational environment.
In the data warehouse, a normalized approach produces the optimal design with flexibility. It fits well with granular data, it is not optimized for any given set of processing requirements and fits very nicely with the data model. Usually deviations to the normalized design comes from the common requirements of several departments for eg. storing data for all months. Data warehousing typically involves snowflake like design which includes the star joins, fact tables and dimensions.
Unlike data warehouses, data marts are very much shaped by processing requirements. However, Data warehouses have a much broader audience. If we look at the entity relationship diagram for a data mart, all the entities will look equal. However in reality some entities may have volume of data associate than others. The design that best utilizes this characteristic is a star join. As a simple example of a star join, the entity with the most data can have foreign keys to the identifiers of all associated entities.
One of the issues with the data warehouses and data marts is how data gets from the data warehouse to the data marts. Data in the data warehouse is very granular. Data in the data mart is compact and summarized. Periodically data must be moved from the data warehouse to the data mart.
As mentioned earlier, the data mart are designed based on the requirements of the department. Different departments will have different data marts. Star joins involving fact tables and dimensions are suitable for a data mart. But not for a warehouse simply because the data warehouse does not cater to one department. Thus DataMart data structures in general are not reusable or flexible for reconciliation across departments but the granular data in the warehouse is all of those. When data passes from a data warehouse to the data marts, it moves from a normalized world to a multidimensional world.
Unlike the processing requirements for the data mart, the data warehouse is designed from the enterprise requirements. The enterprise requirements are gathered from what is known as the Zachman framework.
This is a convenient device that lays out a matrix to include all perspectives. The output is perspective and context. As an example, consider a matrix with scope as rows and data as columns. Data could correspond to function, network, people, time and model. The scope could correspond to Enterprise model, system model, technical model, components and functional system. By exhaustively covering all perspectives, this approach attempts to let the enterprise requirements be articulated from the blueprint. These requirements subsequently lead to the enterprise data model which helps in building the data warehouse. Therefore a lot of order and discipline is required in using this framework.

Managing volume in a data warehouse
We discussed profile records in a previous post and they come in handy to manage volume. The detailed records are reduced to a single profile record in the warehouse and as such are useful to reduce volume. This comes at a loss of detail but it isn't a bad thing as long as the DSS analysts gets what he wants from the summary. What goes into a profile record may need to be iteratively worked out with DSS analysts.
Another approach used together with the iterative approach is to create another level of historical data. An alternated detail is not designed to be used frequently. This additional level may not be used frequently but it comes in useful to save the architect whenever additional fields are demanded.
Multiple profile records could also be created from the same detail depending on the subject. The metadata records written for profile records are similar to those written for single activity snapshots.These records are generated usually on the operational server and require sorting and merging of data.
In general, data flows forward from operational systems to warehouse. In rare cases, its possible to flow the data backwards.
When data from a data warehouse is required by the operational systems, it is directly queried by the applications. Alternatively, the results of the query could be cached and the application can make do with the results of the query. Such direct queries are typically are responded in time that may take say 24 hours and further the requested data should be small and not in the order of MB or GB. There must be compatibility between the operational and warehouse systems for direct query and results may need to be transformed.
Indirect access of the data warehouse, on the other hand, is entertained. For example in an airline reservation system where an airline employee talks to a travel agent and has a window of 2-3 minutes to complete the order, the flight date and flight status need to be looked up from historical information. This flight status file is created periodically by reading historical data. It then becomes easier for the employee to read the current bookings and flight status table to facilitate the transaction.
Another example for the indirect use of data warehouse in operational environment is credit scoring in banking. When a customer applies for a loan, there's quite a bit of background check involved and this requires extensive historical data such as account history, payment history, job history, salary history, asset management history, etc. To satisfy most customers in short time, a program can periodically run and produce a file for pre-qualifications that has customer identification, approved credit limit, and specially approved limit. When the customer applies for and gets a loan, if the loan is within the approved limit, it is granted.

Thursday, December 19, 2013

We continue our discussion on data transformation from operational to warehouse:
We mentioned summarizing data by properly sequencing the operational data as one of the functionalities required. Others are:
Input records can be of different proprietary or non-standard formats, syntax and even semantics.
Conversion for these records need to be authored and they can be complex.
Semantic relationships in the operational data is harder to translate and often come at the price of reverse engineering.
Data format conversion must be done to keep the data consistent in the warehouse.
Massive volumes of input must be accounted for including parallel loads or reads.
The warehouse must conform to a corporate data model. Since business changes more frequently than when the application was written, these mismatches need to be accounted for in designing and building a warehouse. There is order and discipline to the design and structure of the data warehouse.
The data warehouse reflects the historical need for information and during transformation, we add this element of time where missing.
The warehouse must be built for the informational need of the organization and not the clerical needs.
Since hardware, operating systems and environment changes between operational systems to datawarehouse systems, the newly generated output files from transformation may need some local conversion.
ETL software is often used for these extract-tranform-load operations.
An alternative to ETL software is ELT software to do extract-load-transform operations where the transformation can be skipped, however this does not benefit the warehouse.
The transformations or population of the data warehouse is triggered with an 'event-snapshot' interaction i.e. as and when snapshots become available, they are moved into the warehouse.
Events can be activity generated or time generated.
Events that are triggered by time are usually cyclic.
We look at the components of the snapshot next. As mentioned earlier, a snapshot contains a unit of time, a key, non-key primary data and secondary data. The first usually marks the moment when the snapshot is taken. The key identifies the snapshot. The third is the data that relates to the key and the last is the data
that is circumstantial to the actual data.
We look at the components of the profile records as well.
Data in a warehouse is not all stable and some are susceptible to changes due to various factors. For these some kind of aggregate records is preferable. Such a record is called a profile record. For example, the monthly call records of a subscriber is aggregated to provide a single composite record. Notice how this is similar to HDFS in which the column and the column counts don't matter. Records can have varying number of columns because they are now structured as key value pairs in a column family. The records allow access to individual values based on key lookup for that record in that collection. A profile record in the data warehouse is a single composite record of all the changes captured until the event that triggers the profile record. Note that the comparison is being drawn in terms of the layout and not the content.
Since the profile record contains grouping of many detail records, the aggregation can take many different forms:
Values from the operational data can be summarized
Frequencies can be maintained, units of the operational data can be tallied, their min, max and avg() can be found and they can be put in ranges with boundaries.
A desirable benefit of using a single profile record is that it saves the end user from having to calculate it.