Cluster computing

Friday, December 20, 2013

Managing volume in a data warehouse
We discussed profile records in a previous post and they come in handy to manage volume. The detailed records are reduced to a single profile record in the warehouse and as such are useful to reduce volume. This comes at a loss of detail but it isn't a bad thing as long as the DSS analysts gets what he wants from the summary. What goes into a profile record may need to be iteratively worked out with DSS analysts.
Another approach used together with the iterative approach is to create another level of historical data. An alternated detail is not designed to be used frequently. This additional level may not be used frequently but it comes in useful to save the architect whenever additional fields are demanded.
Multiple profile records could also be created from the same detail depending on the subject. The metadata records written for profile records are similar to those written for single activity snapshots.These records are generated usually on the operational server and require sorting and merging of data.
In general, data flows forward from operational systems to warehouse. In rare cases, its possible to flow the data backwards.
When data from a data warehouse is required by the operational systems, it is directly queried by the applications. Alternatively, the results of the query could be cached and the application can make do with the results of the query. Such direct queries are typically are responded in time that may take say 24 hours and further the requested data should be small and not in the order of MB or GB. There must be compatibility between the operational and warehouse systems for direct query and results may need to be transformed.
Indirect access of the data warehouse, on the other hand, is entertained. For example in an airline reservation system where an airline employee talks to a travel agent and has a window of 2-3 minutes to complete the order, the flight date and flight status need to be looked up from historical information. This flight status file is created periodically by reading historical data. It then becomes easier for the employee to read the current bookings and flight status table to facilitate the transaction.
Another example for the indirect use of data warehouse in operational environment is credit scoring in banking. When a customer applies for a loan, there's quite a bit of background check involved and this requires extensive historical data such as account history, payment history, job history, salary history, asset management history, etc. To satisfy most customers in short time, a program can periodically run and produce a file for pre-qualifications that has customer identification, approved credit limit, and specially approved limit. When the customer applies for and gets a loan, if the loan is within the approved limit, it is granted.

Thursday, December 19, 2013

We continue our discussion on data transformation from operational to warehouse:
We mentioned summarizing data by properly sequencing the operational data as one of the functionalities required. Others are:
Input records can be of different proprietary or non-standard formats, syntax and even semantics.
Conversion for these records need to be authored and they can be complex.
Semantic relationships in the operational data is harder to translate and often come at the price of reverse engineering.
Data format conversion must be done to keep the data consistent in the warehouse.
Massive volumes of input must be accounted for including parallel loads or reads.
The warehouse must conform to a corporate data model. Since business changes more frequently than when the application was written, these mismatches need to be accounted for in designing and building a warehouse. There is order and discipline to the design and structure of the data warehouse.
The data warehouse reflects the historical need for information and during transformation, we add this element of time where missing.
The warehouse must be built for the informational need of the organization and not the clerical needs.
Since hardware, operating systems and environment changes between operational systems to datawarehouse systems, the newly generated output files from transformation may need some local conversion.
ETL software is often used for these extract-tranform-load operations.
An alternative to ETL software is ELT software to do extract-load-transform operations where the transformation can be skipped, however this does not benefit the warehouse.
The transformations or population of the data warehouse is triggered with an 'event-snapshot' interaction i.e. as and when snapshots become available, they are moved into the warehouse.
Events can be activity generated or time generated.
Events that are triggered by time are usually cyclic.
We look at the components of the snapshot next. As mentioned earlier, a snapshot contains a unit of time, a key, non-key primary data and secondary data. The first usually marks the moment when the snapshot is taken. The key identifies the snapshot. The third is the data that relates to the key and the last is the data
that is circumstantial to the actual data.
We look at the components of the profile records as well.
Data in a warehouse is not all stable and some are susceptible to changes due to various factors. For these some kind of aggregate records is preferable. Such a record is called a profile record. For example, the monthly call records of a subscriber is aggregated to provide a single composite record. Notice how this is similar to HDFS in which the column and the column counts don't matter. Records can have varying number of columns because they are now structured as key value pairs in a column family. The records allow access to individual values based on key lookup for that record in that collection. A profile record in the data warehouse is a single composite record of all the changes captured until the event that triggers the profile record. Note that the comparison is being drawn in terms of the layout and not the content.
Since the profile record contains grouping of many detail records, the aggregation can take many different forms:
Values from the operational data can be summarized
Frequencies can be maintained, units of the operational data can be tallied, their min, max and avg() can be found and they can be put in ranges with boundaries.
A desirable benefit of using a single profile record is that it saves the end user from having to calculate it.

Interview question : Find min in shifted array : {26, 33, 49, 63, 3, 7, 13, 18, 19, 23)

int Min(int[] numbers)
{
if ( numbers == null || numbers.Length <= 0) throw new Exception();
int min = numbers[0];
for (int i = 1; i < numbers.Length; i++)
{
if (numbers[i] < min )
{
min = numbers [i];
break; // for shifted - sorted array
}
}
return min;
}

// binary chop
int Min (int[] numbers, int start, int end)
{
// validate parameters
// termination condition
if (start == end]) return numbers[start];
if (numbers[start] > numbers[end])
{
int mid = start + end / 2;
if ( numbers[start] > numbers[mid])
return Min(numbers, start, mid);
else
return Min(numbers, mid+1, end);
}
else
return numbers[start];
}

Wednesday, December 18, 2013

We continue our discussion on data warehouse.
Extraction of data from one place to another seems deceptively simple but usually turns out to be much more complex and larger than initially assumed.
We see some of these transformations now:
1) The extraction of the data from the operational environment to the legacy environment requires a change in technology including operating systems, hardware and data.
2) The selection of data from the operational environment may require reading several tables, following relational constraints and to be done during the online operational window.
3) Input keys from operational data have to be restructured and converted. before they are written to a data warehouse.
4) Non-key data is reformatted - for example date-time formats may change to be more consistent across the warehouse.
5) Data is also cleansed before passing to the warehouse with cross-record verification and domain checks.
6) Multiple sources of data exist and must pass into the data warehouse. This means that we tap into each data source with a defined logic.
7) When there are multiple input files, key resolution can be done before the files can be merged.
8) If there are multiple files, the sequence of files may not be compatible and this may involve record migrations as well.
9) The transformation logic that creates the data warehouse may produce different summarization levels for different data source and hence different outputs that need to be reconciled.
10) There has to be default values when there is no source of data
11) The efficiency of the selection of data is often a performance consideration because only a few records from the original source may need to be pulled.
12) data also needs to be summarized and combined into the profile record.

Data Warehouses are built for a wide variety of systems.However each datacenter maintains a structure of data called a snapshot. This has four components - a key, a timestamp, a primary data that relates only to the key, secondary data captured as part of the snapshot process with no relationship to the primary.
The key identifies the record and the primary data.
The unit of time refers to the moment where the event being described by the snapshot has occurred.
The primary data is the non-key data
The secondary data is circumstantial data and is typically used with DSS processing
A relationship between the primary and the secondary can be inferred because they are adjacent. This is called an artifact.
We now discuss metadata.Metadata is very useful to describe the data in the warehouse. It acts like an index and it stores the structure of the data as known to the programmer and DSS analyst, a hint for the source data, the data model, and a history of extracts.
Reference Data is another component commonly used with data warehouses. This keeps track of the referential integrity. Reference data tends to be ignored with data accumulation however it should be made time variant just like any other part of the warehouse.
There are two different designs for reference data.
The first approach requires taking a snapshot of the reference data every six months. This approach is quite simple but its not sufficient because it might miss the activity between snapshots.
The second approach keeps track of all the activities against the reference table.
The actual implementation might be something in between the first and the second approach.
There is another important factor pervasive in the data warehouse. This is the length of time a change of data in the operational environment takes to be reflected in the data warehouse and is referred to as the cyclicity of data.
The data warehouse contains historical information. The changes in the operational system should take at least 24 hours to show up in the warehouse. This introduced delay is referred to as the wrinkle of time.
A wrinkle of time smaller than 24 hours is expensive to do but the goal is to not do the operational changes in the warehouse and vice versa.

Tuesday, December 17, 2013

In today's blog post, we continue on our discussion with data warehouse. We mentioned that the warehouse is built iteratively This has been proved correct by the industry track record, because the end user unable to give all the requirements up front, management doesn't make a commitment until there are some results, and visible results must be seen quickly.
We will now see the role of data model in the iterative development. The data warehouse serves as a roadmap for each of the development efforts. The data model plays a critical factor in these efforts. First the data model tells what needs to be done.
Second the data model tells how one development effort is to be integrated with the other.
The development efforts are like puzzle pieces and the data model makes sure the efforts fall in place.
If the data model were absent, many of the efforts would not fit well with others and there would be overlap and redundancies.
The output of a data model is a set of tables in fact a lot of small tables with little data. If a program has to go for several tables it has to interconnect with them.
A better approach would be to physically merge some of the tables so that minimal I/O is consumed.
Merging tables is one approach. Another approach is to create an array of data. An array of data lets the sequences reside in different locations and it would be easier to directly access the resource. For example creating an array by month is easy. It might be interesting to note that sometime redundant data is inserted. For example, if a base parts table has a non-redundant field call description then several parts table have to access the base the parts table for that field. This is expensive. Instead if the description field is redundantly placed in many tables, its access is independent and less expensive. Note however, this is for the access to the tables.
Likewise if there are other fields in the table that is not as frequently or more frequently used, then they can be stored in separate tables such that the most frequent ones are by themselves.
Another technique is to use a derived data with the physical database design. For example, if we calculate the annual taxes at the year-end every year, we can store the calculated data. Then all subsequent access can use the calculated field.
Another technique is to use a creative index or a creative profile. This type of creative index is created as data is passed from operational to warehouse environment. Since the data is accessed unit by unit, the index creation is easy, inexpensive and based on users' interests.
Finally, a referential integrity can be used to establish relationships between databases. Each solution can be tested with a few use cases such as name, dept .

We continue our discussion on data warehouse. We look into the "design" of the interface from operational systems. The design is actually heuristic. The requirements for a warehouse are not known until it is partially populated and in use. So a phase by phase approach is used instead. One portion of data is populated. It is then used and scrutinized by the DSS analyst. Then there is feedback from the user which leads to data modifications. Then another portion of the data warehouse is built. This feedback loop continues until through out the life of the warehouse.
On the other hand, anticipating requirements is still important, so there is a trade-off.
Design usually begins with operational data, the source for the warehouse. However, it's not extract and load operation. There are several transformations involved. Operational data comes from different applications They are often not integrated. Data when it makes its way into the warehouse has to be integrated across applications. Integration means standardizing the conventions across systems, keeping the syntax and semantics same on merging, and correctly mapping source fields to destination.
Three types of loads are made into the data warehouse from the operational environments:
archival data
data currently in the operational environment
periodic refreshes from updates to the data in the operational environment
Archival data is often skipped or one-time.
Loading current data is not disruptive because the data is loaded only once and usually via downloaded files.
Loading periodic refreshes on the other hand is a major concern because we have to find what's already present and what's not. Some common techniques used to limit the scanning of data are as follows:
1) scan the time-stamped data of the source and exclude the ones that are not in range. However source may not have timestamps
2) scan the delta file that contains only the changes made to the application as a result of the transactions.
3) scan the log file or the audit file that's generated a by-product of the transaction processing.
4) modify the application code for managing the amount of data scanned, however application code can be legacy, old and fragile.
5) the last option is literally a last resort in that it takes a snapshot before and after image of the operational file and dedicating large amount of resources