Cluster computing

Saturday, January 4, 2014

We continue our post on the data warehouses with a discussion on the end user community. The end users have a lot of say in how the data warehouse shapes. They have a lot of diversity so we recognize four types - the farmers, the explorers, the miners and the tourists. The farmer is the most predominant type of user found in the data warehouse environment. This is a predictable user in that the queries submitted by the user are short, go directly for the data, recur on the same time of the week and is usually successful on finding the data.
The explorer is the user community that does not know what he or she wants and hence takes more time and more volume of data to search. This user covers a lot of data and typically does not know what he or she wants before the exploration process begins. The exploration proceeds in a heuristic mode. In many cases, the exploration looks for something and never finds but there are also cases when the discoveries are specially interesting.
The miner is the user community that digs into piles of data to test assertions. Assertions are tested based on their strength from the data. Usually this user community uses statistical tools. The miner may work closely with the explorer. The explorer creates assertions and hypothesis and the miner may determine their strength. Usually this community has to have mathematical skills.
The tourist is the user community that knows what to find where. This user has a breadth of knowledge as opposed to the depth of the knowledge. This user is familiar with both formal and informal systems. He or she knows the metadata and the indexes, the structured data and the unstructured data, the source code and how to read and interpret it.
There are different types of data targeted by these end users. If data existed in different bands of probability of their use in the data warehouse, the farmers would be very predictable and target only the top small band of this data while the explorers would reach all over the data.
Cost justification and ROI analysis could be described for these user communities as follows:
The farmer's value and probability of success is very high. His queries are useful in decision support. The explorers success rate is not that high although his finds are much more valuable than the regular queries performed by the farmers. The warehouse therefore should present the ROI from farmers community instead of the explorers.

Friday, January 3, 2014

We cover Corporate Information Compliance and data warehousing.
Corporate Information Compliance is a legal compliance. Some examples are Sarbanes Oaxley, Basel II, and HIPAA. These compliance rules were brought about because corporations engaged in accounting fraud such as Enron, WorldCom and global crossings. The Sarbanes Oaxley Act for instance was introduced to enforce proper and honest accounting.
Data warehouse plays an important role in these compliance. When a compliance is implemented, it is done with financial transactions and corporate communications. The financial transactions are subjected to completeness, legitimacy of routing/classification and then separated into past and present data The past data makes it way to the Data warehouse. On the corporate communications side, the communications and compliance terms and phrases are fed into a sorter that builds a word phrase context and together with a simple index and actual messages are pushed into the data warehouse.
The two basic activities for the corporations are
to comply with financial requirements and controls and
to comply with organizational communications aspect of regulation.
Financial compliance deals with recording, procedures and reporting of financial transactions. These translate to a whole set of approvals and formalization of procedures that corporations usually had never encountered. The compliance would scrutinize not just at the micro level but at the macro level. The financial transactions must be cared for at both the micro and the macro levels.
Most corporations start with the present aspect of the financial transactions audit. They use mini-audits up front to make sure the systems comply. Once the audit and the procedures are finished, the data makes its way into the warehouse.
However the compliance requires looking at past data. Since the warehouse holds the historical data, granular data, and integrated data, it becomes useful for financial auditing. When looked at broadly the corporate finances deals with the two aspects of what and why.
The what is answered by the details of all financial transactions - amount, from, date, control number, classification etc. Questions such as whether all transactions are included, are the transactions recorded at the lowest level of granularity, is the relevant information for each financial transaction recorded properly ? is the recording accurate and have the transactions been classified ? are covered.
The one difference between data stored for compliance and the data for warehouse is that the former is seldom used but both are large.
Another difference is that the data for compliance cannot be lost because the audits are always needed. The data in the warehouse doesn't have that much sensitivity to loss.
Another differences is the responsiveness to queries. For warehouse the query responsiveness can range widely. For audit queries, these are usually completed in days.
Yet another difference between data warehouse data and data stored for compliance is content. The length of time that compliance data needs to be stored depends on legislations, company comfort, and physical storage.
The why of the financial transactions are about activities that take place before a transaction occurs. These include things such as proposals, commitments, terms, delivery and guarantee.

Thursday, January 2, 2014

This post returns back to the discussion on data warehouse starting with the cost justification and return on investment for a data warehouse. We look at the macro level for cost justification before we compare the micro level. The first refers to a discussion at a high level such as what were the increase in profits or stock price. However macro level is affected by many factors and not just by improvements in warehouse. So specific association may have to be determined.
For the micro level cost justification, each data pull from operational systems and integration is compared against the ease of use from a data warehouse. Information from the legacy environment is hard to obtain where data may not be proper, undocumented APIs may be involved, guesses have to be made, and the process in general is very convoluted. Even when it isn't, there's still integration involved. Further, the data may not all be available at the same time. A staging area may also be required. and finally, a report might be published.
The difference in the cost of information with or without a data warehouse is the basis for its cost justification.
When we look at the steps involved in building a warehouse, they are similar to what's been just described above, with the difference that there are far less redundancies and more efficiency. Hence the cost of building a warehouse should be lesser than the same opeartions without it. Further this cost of building a warehouse is one time but the operations may need to be performed every now and then in its absence.
There is also very little time required for getting information from the data warehouse. This savings in time also translate to savings in cost.
Also the speed of information is also appreciated for decision support. Sometimes this is critical for new business. There is a time up to which the information may be very valuable and a point of time after it which it may even be worthless. This is called the time value of information and is also helpful in recognizing the significance of the warehouse albeit difficult to quantify.
Integrated information is best available from the warehouse. For example, customer centric data may be very helpful in exploring new opportunities.
The historical data is also a real value. It becomes another dimension in the usefulness of a data warehouse.
Thus we see that a better way to proceed with a cost-justification is at a micro level
The data warehouse doesn't provide real time information but something near that can be provided by an operational data store. The data flows between the ODS and the warehouse in a bidirectional manner. Profile records are often created and placed in the ODS.

Wednesday, January 1, 2014

This is a short break to do UI automation testing with iOS. Here we look at some sample code (untested) for automating Notes app after we set it in Instruments.

var testName = "Notes application testing";
UIALogger.logStart(testName);

UIALogger.logMessage("Create a new note");
var target = app.localTarget();
var app = UIATarget.localTarget().frontMostApp(); // assumes Notes has been specified in Instruments and starts on a fresh page
var page = app.textViews[0];
var sentence = "The big brown bear ate the brownies in the big paper bag.";
var added = "";

UIALogger.logMessage("Repeat entry of : The big brown bear ate the brownies in the big paper bag.");
for( var i = 0; i < 6; i++)
{
if (i == 0) app.navigationBar().buttons()["+"].tap();
else app.textFields[0].tap();
saveSentence(app, page, sentence);
added.concat(sentence); 
}

UIALogger.logMessage("Save a log of the inserted lines in the note.");
saveLog(target, app, page);

UIALogger.logMessage("Retrieve the note created in #5 and confirm the lines are exactly the same.");
if (page.value.indexOf(added) != -1) {
 
    UIALogger.logPass(testName);
 
}
 
else {
 
    UIALogger.logFail(testName);
 
}

UIALogger.logMessage("Delete the note and confirm it is no longer stored in Notes.");
deleteNotes(target, app, sentence, testName);

UIALogger.logMessage("Test completed.");

UIATarget.onAlert = function onAlert(alert) {
 
    alert.buttons()["Delete Note"].tap();
 
    return true;
 
}

function saveSentence(target, app, page, sentence)
{
page = app.textViews[0];
UIALogger.logMessage("Open Notes");
target.delay(1); 
page.tap(); 
UIALogger.logMessage("The new note must have the following line in it - The big brown bear ate the brownies in the big paper bag.");
page.setValue(page.value + sentence) ;
UIALogger.logMessage("Close Notes");
app.navigationBar().buttons()["Done"].tap();
app.navigationBar().buttons()["Notes"].tap();
}

function saveLog(target, app, page)
{
target.delay(1); 
app.textFields[0].tap();
page = app.textViews[0];
page.tap(); 
page.setValue(page.value + page.logElementTree()) ;
app.navigationBar().buttons()["Done"].tap();
app.navigationBar().buttons()["Notes"].tap();
app.textFields[0].tap();
page = app.textViews[0];
}

function deleteNotes(target, app, title, testName)
{
target.delay(1);
UIATarget.localTarget().deactivateAppForDuration(10);
app.navigationBar().buttons()["Delete"].tap();
app.navigationBar().buttons()["Notes"].tap();
if (app.textFields[0].value == title)
 UIALogger.logFail(testName);
}

We continue on our discussion on advanced topics of data warehouse. We look at financial data warehouse next.Financials are a great starting point for the data warehouse. They usually involve a small amount of data and are close to the business. However the financial data is very application oriented and not corporate oriented so there are some transformations involved. The basis for the warehouse data could be a different time period, currency unit or even classifications than that from the application. Therefore the financial analyst needs more information about the architecture than the others.
Similar to financial data, the warehouse's system of record such as for account balance must be constantly maintained. A system of record is the definitive source for a data. For example if a bank maintains account balances for customers, there should be one place where the balance is accurately reflected. Everything else is copy of this data. As data is passed to the data warehouse environment, data changes from current value to historical data. As such a system of record for historical data is created and this is then used for all kind of DSS processing.
The Corporate Information Factory that the data warehouse evolved into had two prominent features - the virtual operational data store and the addition of unstructured data. The VODS was a feature that allowed organizations to access data on the fly without building an infrastructure This meant that corporate communications could now be combined with corporate transactions to paint a more complete picture.
Archival data was another feature added to the CIF. Data would now be transferred from data warehouse to nearline storage using Cross media stroage manager (CMSM) and then retired to archival.
Another feature that was added to CIF was the unstructured visualization technology which was the equivalent for business intelligence for quantitative data.
The Government Information Factory was created along the same lines as the CIF but there are differences. These are the need for widespread integration and sharing beyond the agency, the need to accommodate data for very long periods of time and the need for security from the outset of design.
The emerging trends on the CIF now include directions involving Analytics, ERP/SAP business intelligence, unstructured business intelligence, the capturing and management of massive volumes of data

Tuesday, December 31, 2013

We now look at advanced topics of the data warehouse. Some of these don't fit well with others so they are covered here. We start with the requirements. Where do these come from ? They come from the common data model.The end user requirements influence the model. Their role is indirect. The model has different levels - high, mid and low-levels. The high levels consist of the subject areas. The mid level identifies the keys, attributes and relationships and other aspects of the details of the data warehouse. The low level is where the physical database design is done including partitioning, foreign key relationships and indexes. When the relational database is ready, the end user requirements are used to show how the data is to be reshaped.
Resource contentions in the data warehouse often occur due to statistical processing. If they occur periodically, the statistical processing can be interspersed. If there is no room for the statistical processing, it is better to build out a special warehouse called exploration data warehouse for the statistical processing. The purpose of an exploration warehouse is to provide a foundation for heavy statistical analysis.
Another reason for separation of the exploration warehouse from the main is that the statistical processing requires very different environment from the regular analytic processing. Another reason is database design. For example , a convenience field is often added for the purpose of streamlining the statistical analysis and to recast the data warehouse. The exploration warehouse is seldom a direct copy of the data found in the data warehouse.
There is a difference between the exploration warehouse and the data mining warehouse. They are similar but the difference is that the exploration warehouse is used to create assertions, hypothesis and observations while the data mining warehouse is used to prove the strength of the truth of the hypothesis. The exploration warehouse must provide a breadth of information. The data mining warehouse needs to have depth.
The exploration warehouse is not refreshed with the most current detailed data unlike the data warehouse. This is because the processing on the exploration warehouse requires the data to be frozen.
One other factor in the differences between the main data warehouse and the exploration warehouse is that the external data fits easily into the exploration warehouse. External results and internal results often provide interesting comparisons. On the other hand, the external data is difficult to integrate with the fine detailed structure of the warehouse.
Another topic is whether data marts or data warehouses should share the same processor. There are several reasons why : First, the processors are expensive, separating out the data marts onto separate machines brings the costs down. Second the workloads are different so they are more manageable. Third, different departments can take ownership of the data marts.
There is a life cycle of data as it enters the corporation and then is used. It is first captured, then edited, then loaded into the application, then online data is integrated, and passed into the warehouse, where the frequency of usage drops, then sent to near line storage and subsequently retired to archival storage.
We also mention testing and the data warehouse. A test environment is a good staging ground. This means that the data changes can be made without affecting the public before being made visible. The size of the data in the warehouse could be just enough to test the changes.
Tracing the flow of data through the warehouse is another practice. This means there the data is traced as it flows through an initial capture, an ETL enterprise integration, a Datamart transformation, an exploration warehouse or data mining and near-line storage . Tracing gives information on the data such as what revenue when there may be mutliple definitions or unclear origin or lineage. One common issue with tracing is the name change or value change. Another less common issue is the recalculation of the unit of data. Both these data and metadata changes need to be considered when tracing.
Another factor related to the tracing is the velocity. The velocity of data refers to the speed at which it passes through the warehouse from the initial capture to the point of use by an analyst. Integration may hinder velocity
Data may also be pushed from the operational environment and into the data warehouse or pulled by data marts from the warehouse. Both affect its velocity.
An important application of data warehouse is eBusiness. From the start eBusiness systems were kept separate from the operational environment. With the data warehouse and ODS, its easier to interface with the web and corporate information.

Monday, December 30, 2013

We talk about the differences between the relational and the star join models as the basis for data warehouse design. The first difference is in terms of flexibility and performance. The relational model has high flexibility but is not optimized for performance. The multidimensional model targets a user community and hence is more performant. When new requirements are added to the star join model, the optimizations for performance are moot. When the relational model is used, there is no particular optimization for performance one way or another but the data is stored at the lowest level of granularity. If performance is required by the data model, it is better to pull data out of the relational and then optimize access for performance.
The relational model is shaped by a pure data model while the multidimensional model is stored by processing requirements which leads to several consequences. The first consequence is in terms of serviceability. As discussed, this is more flexible and performant when the data is accessed indirectly. The multidimensional model is good for direct access of data. The second consequence is in terms of level of abstraction and origins. The relational model organizes data at a higher level of abstraction and can work with different usages while the dimensional model uses the processes to suit only some workflows.
The relational model supports shaping and reshaping of many different models of data. For example, merging is easy because data is stored at the most granular or normalized levels. With this granularity, it is easy to meet future unknown needs because it is easy to combine the atomic data into more meaningful representations when the need becomes known. There is yet another advantage of the relational model which is the ability to change gracefully. The relational model is designed to be used in an indirect fashion. This means that there is a lag between the direct uses of data and the changes to the data.
The multidimensional model is aligned with the what is referred to as the independent data mart approach. A data mart is a data structure that is dedicated to serving the analytical needs of one group of people. With independent data marts, its easier and less expensive to build systems for the growth of the organization's data. A dependent data mart on the other hand is built from the data coming from the warehouse. The dependent data mart does not depend on legacy or operational data for its source. It depends on only the data warehouse for its source of data. The dependent data mart requires fore thought and investment.
However independent data marts as they crop up also tend to have redundancies between the systems With the progression of time, this data may be unincorporated, uncorrelated and even worse un-synchronized. There is an exponential redundancy in data. The momentum for the independent data mart is now fully established. This affects the long term perspective and maintenance.Consider the warehouse on the other hand. The data would be more reusable, limited number of interface programs and reconcile-ability. Further the architectural problems from the independent data marts would not have arisen.
Hence the recommendation against the independent data marts and in favor of a data warehouse albeit the challenges encountered.