Cluster computing

Saturday, December 21, 2013

We now look at the granularity in the data warehouse.We choose this based on what matters to the environments that depend on the warehouse data as well as what is efficient for the system given the volume of the data. By setting the granularity at the right level and spilling the excess volume of data to overflow storage, other aspects of the design and implementation can flow smoothly.
To determine the volume of data, we estimate the number of rows. If there are a few thousand rows, any granularity will do. If there are millions of row, a low level of granularity can be chosen. If there are billions of rows, a higher level of granularity is required.
The rule of thumb cited in the book suggests the following:
1) For each known table:
we find how big a row can be and find both the biggest and smallest estimates
For the 1-year horizon, we find the number of rows possible
For the 5-year horizon, we find the number of rows possible
for each key of the table, we find the size in bytes.i
With the number of rows from above, we calculate the max 1 year space as the biggest row times the number of 1-year max rows. We calculate the min 1 year space as the smallest row times the 1-year min rows plus index space.
2) We repeat this for all the known tables.
This way we can estimate the volume of data to manage. The author has some interesting caveats here :
one that the projections of the size of the database are usually low. and second, the growth rate of the warehouse is usually faster than the projection.
These row estimates and the number of direct access storage devices helps with the planning process. Note that more than the size of the records, the number of records matters. In fact, as long as all rows are reasonably sized, the calculation of index entries is entirely dependent on the number of rows and these together affect the choice of granularity and the overflow storage.

Friday, December 20, 2013

We continue our discussion on the data warehouse with the indirect use of the data warehouse data. We see an emerging pattern with the examples mentioned in the previous post. The warehouse is periodically analyzed by a program examines relevant characteristics and criteria. This analysis creates a small file in the online environment that contains succinct information about the business of the enterprise. The small online file is used quickly and efficiently as is customary in the operational environment.
We look at these in more detail now. First, the program that runs on the warehouse is typically a machine learning program and has free rein to run anywhere in the warehouse. It runs in the background and is in harmony with the rate at which the data warehouse changes. The refreshment occurs periodically and relatively infrequently. And it replaces existing data as opposed to updating it.
The online pre-analyzed data file contains only a small amount of data per unit of data and may contain only what the online user/clerk needs. Its ownership and access are entirely within the operational environment.
In the data warehouse, a normalized approach produces the optimal design with flexibility. It fits well with granular data, it is not optimized for any given set of processing requirements and fits very nicely with the data model. Usually deviations to the normalized design comes from the common requirements of several departments for eg. storing data for all months. Data warehousing typically involves snowflake like design which includes the star joins, fact tables and dimensions.
Unlike data warehouses, data marts are very much shaped by processing requirements. However, Data warehouses have a much broader audience. If we look at the entity relationship diagram for a data mart, all the entities will look equal. However in reality some entities may have volume of data associate than others. The design that best utilizes this characteristic is a star join. As a simple example of a star join, the entity with the most data can have foreign keys to the identifiers of all associated entities.
One of the issues with the data warehouses and data marts is how data gets from the data warehouse to the data marts. Data in the data warehouse is very granular. Data in the data mart is compact and summarized. Periodically data must be moved from the data warehouse to the data mart.
As mentioned earlier, the data mart are designed based on the requirements of the department. Different departments will have different data marts. Star joins involving fact tables and dimensions are suitable for a data mart. But not for a warehouse simply because the data warehouse does not cater to one department. Thus DataMart data structures in general are not reusable or flexible for reconciliation across departments but the granular data in the warehouse is all of those. When data passes from a data warehouse to the data marts, it moves from a normalized world to a multidimensional world.
Unlike the processing requirements for the data mart, the data warehouse is designed from the enterprise requirements. The enterprise requirements are gathered from what is known as the Zachman framework.
This is a convenient device that lays out a matrix to include all perspectives. The output is perspective and context. As an example, consider a matrix with scope as rows and data as columns. Data could correspond to function, network, people, time and model. The scope could correspond to Enterprise model, system model, technical model, components and functional system. By exhaustively covering all perspectives, this approach attempts to let the enterprise requirements be articulated from the blueprint. These requirements subsequently lead to the enterprise data model which helps in building the data warehouse. Therefore a lot of order and discipline is required in using this framework.

Managing volume in a data warehouse
We discussed profile records in a previous post and they come in handy to manage volume. The detailed records are reduced to a single profile record in the warehouse and as such are useful to reduce volume. This comes at a loss of detail but it isn't a bad thing as long as the DSS analysts gets what he wants from the summary. What goes into a profile record may need to be iteratively worked out with DSS analysts.
Another approach used together with the iterative approach is to create another level of historical data. An alternated detail is not designed to be used frequently. This additional level may not be used frequently but it comes in useful to save the architect whenever additional fields are demanded.
Multiple profile records could also be created from the same detail depending on the subject. The metadata records written for profile records are similar to those written for single activity snapshots.These records are generated usually on the operational server and require sorting and merging of data.
In general, data flows forward from operational systems to warehouse. In rare cases, its possible to flow the data backwards.
When data from a data warehouse is required by the operational systems, it is directly queried by the applications. Alternatively, the results of the query could be cached and the application can make do with the results of the query. Such direct queries are typically are responded in time that may take say 24 hours and further the requested data should be small and not in the order of MB or GB. There must be compatibility between the operational and warehouse systems for direct query and results may need to be transformed.
Indirect access of the data warehouse, on the other hand, is entertained. For example in an airline reservation system where an airline employee talks to a travel agent and has a window of 2-3 minutes to complete the order, the flight date and flight status need to be looked up from historical information. This flight status file is created periodically by reading historical data. It then becomes easier for the employee to read the current bookings and flight status table to facilitate the transaction.
Another example for the indirect use of data warehouse in operational environment is credit scoring in banking. When a customer applies for a loan, there's quite a bit of background check involved and this requires extensive historical data such as account history, payment history, job history, salary history, asset management history, etc. To satisfy most customers in short time, a program can periodically run and produce a file for pre-qualifications that has customer identification, approved credit limit, and specially approved limit. When the customer applies for and gets a loan, if the loan is within the approved limit, it is granted.

Thursday, December 19, 2013

We continue our discussion on data transformation from operational to warehouse:
We mentioned summarizing data by properly sequencing the operational data as one of the functionalities required. Others are:
Input records can be of different proprietary or non-standard formats, syntax and even semantics.
Conversion for these records need to be authored and they can be complex.
Semantic relationships in the operational data is harder to translate and often come at the price of reverse engineering.
Data format conversion must be done to keep the data consistent in the warehouse.
Massive volumes of input must be accounted for including parallel loads or reads.
The warehouse must conform to a corporate data model. Since business changes more frequently than when the application was written, these mismatches need to be accounted for in designing and building a warehouse. There is order and discipline to the design and structure of the data warehouse.
The data warehouse reflects the historical need for information and during transformation, we add this element of time where missing.
The warehouse must be built for the informational need of the organization and not the clerical needs.
Since hardware, operating systems and environment changes between operational systems to datawarehouse systems, the newly generated output files from transformation may need some local conversion.
ETL software is often used for these extract-tranform-load operations.
An alternative to ETL software is ELT software to do extract-load-transform operations where the transformation can be skipped, however this does not benefit the warehouse.
The transformations or population of the data warehouse is triggered with an 'event-snapshot' interaction i.e. as and when snapshots become available, they are moved into the warehouse.
Events can be activity generated or time generated.
Events that are triggered by time are usually cyclic.
We look at the components of the snapshot next. As mentioned earlier, a snapshot contains a unit of time, a key, non-key primary data and secondary data. The first usually marks the moment when the snapshot is taken. The key identifies the snapshot. The third is the data that relates to the key and the last is the data
that is circumstantial to the actual data.
We look at the components of the profile records as well.
Data in a warehouse is not all stable and some are susceptible to changes due to various factors. For these some kind of aggregate records is preferable. Such a record is called a profile record. For example, the monthly call records of a subscriber is aggregated to provide a single composite record. Notice how this is similar to HDFS in which the column and the column counts don't matter. Records can have varying number of columns because they are now structured as key value pairs in a column family. The records allow access to individual values based on key lookup for that record in that collection. A profile record in the data warehouse is a single composite record of all the changes captured until the event that triggers the profile record. Note that the comparison is being drawn in terms of the layout and not the content.
Since the profile record contains grouping of many detail records, the aggregation can take many different forms:
Values from the operational data can be summarized
Frequencies can be maintained, units of the operational data can be tallied, their min, max and avg() can be found and they can be put in ranges with boundaries.
A desirable benefit of using a single profile record is that it saves the end user from having to calculate it.

Interview question : Find min in shifted array : {26, 33, 49, 63, 3, 7, 13, 18, 19, 23)

int Min(int[] numbers)
{
if ( numbers == null || numbers.Length <= 0) throw new Exception();
int min = numbers[0];
for (int i = 1; i < numbers.Length; i++)
{
if (numbers[i] < min )
{
min = numbers [i];
break; // for shifted - sorted array
}
}
return min;
}

// binary chop
int Min (int[] numbers, int start, int end)
{
// validate parameters
// termination condition
if (start == end]) return numbers[start];
if (numbers[start] > numbers[end])
{
int mid = start + end / 2;
if ( numbers[start] > numbers[mid])
return Min(numbers, start, mid);
else
return Min(numbers, mid+1, end);
}
else
return numbers[start];
}

Wednesday, December 18, 2013

We continue our discussion on data warehouse.
Extraction of data from one place to another seems deceptively simple but usually turns out to be much more complex and larger than initially assumed.
We see some of these transformations now:
1) The extraction of the data from the operational environment to the legacy environment requires a change in technology including operating systems, hardware and data.
2) The selection of data from the operational environment may require reading several tables, following relational constraints and to be done during the online operational window.
3) Input keys from operational data have to be restructured and converted. before they are written to a data warehouse.
4) Non-key data is reformatted - for example date-time formats may change to be more consistent across the warehouse.
5) Data is also cleansed before passing to the warehouse with cross-record verification and domain checks.
6) Multiple sources of data exist and must pass into the data warehouse. This means that we tap into each data source with a defined logic.
7) When there are multiple input files, key resolution can be done before the files can be merged.
8) If there are multiple files, the sequence of files may not be compatible and this may involve record migrations as well.
9) The transformation logic that creates the data warehouse may produce different summarization levels for different data source and hence different outputs that need to be reconciled.
10) There has to be default values when there is no source of data
11) The efficiency of the selection of data is often a performance consideration because only a few records from the original source may need to be pulled.
12) data also needs to be summarized and combined into the profile record.

Data Warehouses are built for a wide variety of systems.However each datacenter maintains a structure of data called a snapshot. This has four components - a key, a timestamp, a primary data that relates only to the key, secondary data captured as part of the snapshot process with no relationship to the primary.
The key identifies the record and the primary data.
The unit of time refers to the moment where the event being described by the snapshot has occurred.
The primary data is the non-key data
The secondary data is circumstantial data and is typically used with DSS processing
A relationship between the primary and the secondary can be inferred because they are adjacent. This is called an artifact.
We now discuss metadata.Metadata is very useful to describe the data in the warehouse. It acts like an index and it stores the structure of the data as known to the programmer and DSS analyst, a hint for the source data, the data model, and a history of extracts.
Reference Data is another component commonly used with data warehouses. This keeps track of the referential integrity. Reference data tends to be ignored with data accumulation however it should be made time variant just like any other part of the warehouse.
There are two different designs for reference data.
The first approach requires taking a snapshot of the reference data every six months. This approach is quite simple but its not sufficient because it might miss the activity between snapshots.
The second approach keeps track of all the activities against the reference table.
The actual implementation might be something in between the first and the second approach.
There is another important factor pervasive in the data warehouse. This is the length of time a change of data in the operational environment takes to be reflected in the data warehouse and is referred to as the cyclicity of data.
The data warehouse contains historical information. The changes in the operational system should take at least 24 hours to show up in the warehouse. This introduced delay is referred to as the wrinkle of time.
A wrinkle of time smaller than 24 hours is expensive to do but the goal is to not do the operational changes in the warehouse and vice versa.