Cluster computing

Wednesday, January 1, 2014

We continue on our discussion on advanced topics of data warehouse. We look at financial data warehouse next.Financials are a great starting point for the data warehouse. They usually involve a small amount of data and are close to the business. However the financial data is very application oriented and not corporate oriented so there are some transformations involved. The basis for the warehouse data could be a different time period, currency unit or even classifications than that from the application. Therefore the financial analyst needs more information about the architecture than the others.
Similar to financial data, the warehouse's system of record such as for account balance must be constantly maintained. A system of record is the definitive source for a data. For example if a bank maintains account balances for customers, there should be one place where the balance is accurately reflected. Everything else is copy of this data. As data is passed to the data warehouse environment, data changes from current value to historical data. As such a system of record for historical data is created and this is then used for all kind of DSS processing.
The Corporate Information Factory that the data warehouse evolved into had two prominent features - the virtual operational data store and the addition of unstructured data. The VODS was a feature that allowed organizations to access data on the fly without building an infrastructure This meant that corporate communications could now be combined with corporate transactions to paint a more complete picture.
Archival data was another feature added to the CIF. Data would now be transferred from data warehouse to nearline storage using Cross media stroage manager (CMSM) and then retired to archival.
Another feature that was added to CIF was the unstructured visualization technology which was the equivalent for business intelligence for quantitative data.
The Government Information Factory was created along the same lines as the CIF but there are differences. These are the need for widespread integration and sharing beyond the agency, the need to accommodate data for very long periods of time and the need for security from the outset of design.
The emerging trends on the CIF now include directions involving Analytics, ERP/SAP business intelligence, unstructured business intelligence, the capturing and management of massive volumes of data

Tuesday, December 31, 2013

We now look at advanced topics of the data warehouse. Some of these don't fit well with others so they are covered here. We start with the requirements. Where do these come from ? They come from the common data model.The end user requirements influence the model. Their role is indirect. The model has different levels - high, mid and low-levels. The high levels consist of the subject areas. The mid level identifies the keys, attributes and relationships and other aspects of the details of the data warehouse. The low level is where the physical database design is done including partitioning, foreign key relationships and indexes. When the relational database is ready, the end user requirements are used to show how the data is to be reshaped.
Resource contentions in the data warehouse often occur due to statistical processing. If they occur periodically, the statistical processing can be interspersed. If there is no room for the statistical processing, it is better to build out a special warehouse called exploration data warehouse for the statistical processing. The purpose of an exploration warehouse is to provide a foundation for heavy statistical analysis.
Another reason for separation of the exploration warehouse from the main is that the statistical processing requires very different environment from the regular analytic processing. Another reason is database design. For example , a convenience field is often added for the purpose of streamlining the statistical analysis and to recast the data warehouse. The exploration warehouse is seldom a direct copy of the data found in the data warehouse.
There is a difference between the exploration warehouse and the data mining warehouse. They are similar but the difference is that the exploration warehouse is used to create assertions, hypothesis and observations while the data mining warehouse is used to prove the strength of the truth of the hypothesis. The exploration warehouse must provide a breadth of information. The data mining warehouse needs to have depth.
The exploration warehouse is not refreshed with the most current detailed data unlike the data warehouse. This is because the processing on the exploration warehouse requires the data to be frozen.
One other factor in the differences between the main data warehouse and the exploration warehouse is that the external data fits easily into the exploration warehouse. External results and internal results often provide interesting comparisons. On the other hand, the external data is difficult to integrate with the fine detailed structure of the warehouse.
Another topic is whether data marts or data warehouses should share the same processor. There are several reasons why : First, the processors are expensive, separating out the data marts onto separate machines brings the costs down. Second the workloads are different so they are more manageable. Third, different departments can take ownership of the data marts.
There is a life cycle of data as it enters the corporation and then is used. It is first captured, then edited, then loaded into the application, then online data is integrated, and passed into the warehouse, where the frequency of usage drops, then sent to near line storage and subsequently retired to archival storage.
We also mention testing and the data warehouse. A test environment is a good staging ground. This means that the data changes can be made without affecting the public before being made visible. The size of the data in the warehouse could be just enough to test the changes.
Tracing the flow of data through the warehouse is another practice. This means there the data is traced as it flows through an initial capture, an ETL enterprise integration, a Datamart transformation, an exploration warehouse or data mining and near-line storage . Tracing gives information on the data such as what revenue when there may be mutliple definitions or unclear origin or lineage. One common issue with tracing is the name change or value change. Another less common issue is the recalculation of the unit of data. Both these data and metadata changes need to be considered when tracing.
Another factor related to the tracing is the velocity. The velocity of data refers to the speed at which it passes through the warehouse from the initial capture to the point of use by an analyst. Integration may hinder velocity
Data may also be pushed from the operational environment and into the data warehouse or pulled by data marts from the warehouse. Both affect its velocity.
An important application of data warehouse is eBusiness. From the start eBusiness systems were kept separate from the operational environment. With the data warehouse and ODS, its easier to interface with the web and corporate information.

Monday, December 30, 2013

We talk about the differences between the relational and the star join models as the basis for data warehouse design. The first difference is in terms of flexibility and performance. The relational model has high flexibility but is not optimized for performance. The multidimensional model targets a user community and hence is more performant. When new requirements are added to the star join model, the optimizations for performance are moot. When the relational model is used, there is no particular optimization for performance one way or another but the data is stored at the lowest level of granularity. If performance is required by the data model, it is better to pull data out of the relational and then optimize access for performance.
The relational model is shaped by a pure data model while the multidimensional model is stored by processing requirements which leads to several consequences. The first consequence is in terms of serviceability. As discussed, this is more flexible and performant when the data is accessed indirectly. The multidimensional model is good for direct access of data. The second consequence is in terms of level of abstraction and origins. The relational model organizes data at a higher level of abstraction and can work with different usages while the dimensional model uses the processes to suit only some workflows.
The relational model supports shaping and reshaping of many different models of data. For example, merging is easy because data is stored at the most granular or normalized levels. With this granularity, it is easy to meet future unknown needs because it is easy to combine the atomic data into more meaningful representations when the need becomes known. There is yet another advantage of the relational model which is the ability to change gracefully. The relational model is designed to be used in an indirect fashion. This means that there is a lag between the direct uses of data and the changes to the data.
The multidimensional model is aligned with the what is referred to as the independent data mart approach. A data mart is a data structure that is dedicated to serving the analytical needs of one group of people. With independent data marts, its easier and less expensive to build systems for the growth of the organization's data. A dependent data mart on the other hand is built from the data coming from the warehouse. The dependent data mart does not depend on legacy or operational data for its source. It depends on only the data warehouse for its source of data. The dependent data mart requires fore thought and investment.
However independent data marts as they crop up also tend to have redundancies between the systems With the progression of time, this data may be unincorporated, uncorrelated and even worse un-synchronized. There is an exponential redundancy in data. The momentum for the independent data mart is now fully established. This affects the long term perspective and maintenance.Consider the warehouse on the other hand. The data would be more reusable, limited number of interface programs and reconcile-ability. Further the architectural problems from the independent data marts would not have arisen.
Hence the recommendation against the independent data marts and in favor of a data warehouse albeit the challenges encountered.

Sunday, December 29, 2013

In this post, we will discuss the really large data warehouse. As we have seen, data warehouses grow at a rapid rate- from a few gigabytes to tera bytes in a few years.This is primarily due to the following contributors:
1) history - no where else in the system, do we keep historical data as we do in the warehouse. Application developers are quick to remove historical data because it affects performance.
2) granularity - The data warehouses collect data at the most granular level. This lets the developers to look at data like no one before.
3) integration - The data warehouse brings together data from different operational systems. Its the database for the entire enterprise. There are common data model, common keys, common reference tables, common definitions,
These are referred to in the book as :
HistoricalData_DiverseData_DetailedData = Lots_of_data
The impact from this data is that there are 1) associated costs, 2) value in terms of usefulness of the data and 3) maintenance in terms of the rules of data management.
Basic data management activities include loading, creating an index etc.
The real costs of storage include the costs for the processor (DBMS + OS) software, communications controller, the disk controller and the data store.
When data grows, the percentage of data that is used actually reduces. For example, at 50GB almost all the data is used, at 500GB only 75% is used, at 2 TB, 35% is used, and at 10tb, only 5% is used. This is important to understand because it enables efficient planning of storage technology.
To calculate this usage of data , we first determine the size of the data access. We calculate this based on
Number of users_Queries per day_Bytes per query_200 days
We could also divide by the overlap factor that indicates the same amount of data pulled more than once.
Typically numbers are 250GB of data servicing 600 users
Next the usage of data is calculated with a ratio as
Usage_ratio = Actual bytes used / Total data warehouse bytes
Note that the separation of frequently used and infrequently used data is different for different systems. In OLTP environment, there is random access and interspersal. In DSS environment, there is a much more sharper division between the frequent and the infrequent data.

Saturday, December 28, 2013

We now look at unstructured data and warehouse. Combining structured data and unstructured data is like integrating two different worlds. CRM is an example. In CRM, we have structured data corresponding to the customer demographics and unstructured data corresponding to the documents and communications from the customer.
Text is common to both worlds and with this, we establish match the communications with the customer. However, text is susceptible to misspelling, mismatched context, same names, nicknames, incomplete names and word stems.
Using stop words and stemming, we can do a probabilistic where all the data that intersects is used to make a match. The strength of the match can be depicted on a numeric scale.
A themed match is another way to make a match where the unstructured data is organized based on themes. such as sales, marketing, human resources, engineering, accounting, distribution. Once the collection of industrially recognized themes have been gathered, the unstructured data is passed against all the themes for a tally.
Linkages between themes and theme words can be done via a raw match of the data.
Another way to link two documents is based on the metadata in the structured environment.
A two tiered data warehouse is typically used for the usage of unstructured data in the data warehouse.
one tier is the structured data and another tier is the unstructured data.
Unstructured data can be further subdivided for example comunications and documents.
The relationship between communications and unstructured data is formed on the basis of identifiers.
Unstructured data visualization is very different from the that for structured data which is more business intelligence. In unstructured data, we create self organizing map where clusters are depicted.
Volumes of data is an issue with every warehouse whether it has unstructured or structured data.
The two environments are fit together based on identifiers. In the structured environment we keep both metadata and data pertaining to the unstructured environment. the metadata has information about the repository. The records have information about the data, identifier and close identifier. close identifiers are those where there is a good indication that a match has been made.

When discussing the migration of data to the architected environment during the build phase of warehouse, it is important to discuss the feedback loop. First the data warehouse is populated from existing environments. Then the DSS analyst uses it for analysis and gives new requirements to the architect. The architect works on the existing systems to meet the requirements from the warehouse. This is the feedback loop.
The loop requires some data to be available in the warehouse for the DSS analyst to come up with new requirements. The shorter the loop the more successful the warehouse. The larger the volume, the longer the loop
The architect owns the data model. All feedback is translated as changes to this data model which is then used together with the existing system of record to improve the warehouse.
The architect also has to keep track of the agents of change viz, the aging of systems, the aging of technology, the organizational upheaval and the drastically changed requirements.
These agents of change let the architect to coordinate the efforts to rebuild the operational systems This works even better than changing the data model.
The steps taken to restructure the operational environment are as follows:
identify how the data model differs from the existing systems
study the impact of each delta item
how much will it cost to fix the delta item
and prepare a report to the management on what needs to be fixed, the estimate of resources, the order of work, and the disruption analysis.
The methodology used in data warehouse is called spiral development methodology. It differs from the migration path described above. The migration path is about general activities. This methodology describes the specific activities, their deliverable and the order of the activities. Together the methodology and migration are needed to find out all that is required for building the warehouse.
A word of caveat about methodologies is that there is an industry track record for even the most rationally appealing methodologies to fail during implementation.
These are due to the reasons such as they describe prescribed set of activities that often ignore iterations, or they describe how to do something but not what to do and they ignore the scope or the detail of the work involved. Other reasons include these : they mix project management with development and design, they don't differentiate operational from DSS systems, they don't include checkpoints and stops on failures.
In this sense the spiral methodology is different from the others in that it proceeds down the same path repeatedly as opposed to say the waterfall methodology. Further, it is said to be data driven.

Friday, December 27, 2013

Since we discussed building the data warehouse, we will now discuss the migration to this environment and the ease of use. We mentioned earlier that this is a step by step activity. Most successful implementations have been built one iteration at a time. The size and the speed of the iterative development is important and the results must be delivered quickly.
But Migrations is different from the iterative development. Many migrations have failed and hence the book mentions about a generic migration plan that has worked in many cases.
This migration plan begins with a corporate data model. This data model includes major subjects, their definitions, relationships, and groupings.
Note that this model excludes the derived data or DSS data because they are subject to change and are temporal.
After the corporate data model and the mid-level models are in place, the next activity is defining the system of record. The system of record is defined in terms of the corporations existing systems. It represents the best data in terms of completeness, timeliness and accuracy.
Next the challenges in bringing the data into the data warehouse are identified.
These could include the following:
if the data arrives in large volumes, then the arrival rate has to be used to handle the influx of data
data that is voluminous is also summarized, aggregated or partitioned.
data may have to be extracted using the interfaces. Special transformation that include integration, alteration of time-basis, condensation and scanning may need to be done.
If the data modeling activity has been done properly, the design of the data warehouse is fairly simple. When the design is complete, the warehouse is organized by subject area.
Once the interfaces have been built, the first subject area is populated. The data is captured and transported into the warehouse environment. Sometimes only a fraction of data is required.
End users work by discovering the interfaces and then using it to populate the data.
Often the first population of the data is incorrect but it gives many of the corrections needed to populate it correctly going forward.
the population and the feedback processes continue afterwards.
By adding a lag time, the data warehouse developer serves the DSS needs of the company. This lag time is usually 24 hours.
Sometimes there is call for rapid population of warehouse which is referred to as active data warehousing since it supports a small amount of online access processing.