Cluster computing

Saturday, December 28, 2013

When discussing the migration of data to the architected environment during the build phase of warehouse, it is important to discuss the feedback loop. First the data warehouse is populated from existing environments. Then the DSS analyst uses it for analysis and gives new requirements to the architect. The architect works on the existing systems to meet the requirements from the warehouse. This is the feedback loop.
The loop requires some data to be available in the warehouse for the DSS analyst to come up with new requirements. The shorter the loop the more successful the warehouse. The larger the volume, the longer the loop
The architect owns the data model. All feedback is translated as changes to this data model which is then used together with the existing system of record to improve the warehouse.
The architect also has to keep track of the agents of change viz, the aging of systems, the aging of technology, the organizational upheaval and the drastically changed requirements.
These agents of change let the architect to coordinate the efforts to rebuild the operational systems This works even better than changing the data model.
The steps taken to restructure the operational environment are as follows:
identify how the data model differs from the existing systems
study the impact of each delta item
how much will it cost to fix the delta item
and prepare a report to the management on what needs to be fixed, the estimate of resources, the order of work, and the disruption analysis.
The methodology used in data warehouse is called spiral development methodology. It differs from the migration path described above. The migration path is about general activities. This methodology describes the specific activities, their deliverable and the order of the activities. Together the methodology and migration are needed to find out all that is required for building the warehouse.
A word of caveat about methodologies is that there is an industry track record for even the most rationally appealing methodologies to fail during implementation.
These are due to the reasons such as they describe prescribed set of activities that often ignore iterations, or they describe how to do something but not what to do and they ignore the scope or the detail of the work involved. Other reasons include these : they mix project management with development and design, they don't differentiate operational from DSS systems, they don't include checkpoints and stops on failures.
In this sense the spiral methodology is different from the others in that it proceeds down the same path repeatedly as opposed to say the waterfall methodology. Further, it is said to be data driven.

Friday, December 27, 2013

Since we discussed building the data warehouse, we will now discuss the migration to this environment and the ease of use. We mentioned earlier that this is a step by step activity. Most successful implementations have been built one iteration at a time. The size and the speed of the iterative development is important and the results must be delivered quickly.
But Migrations is different from the iterative development. Many migrations have failed and hence the book mentions about a generic migration plan that has worked in many cases.
This migration plan begins with a corporate data model. This data model includes major subjects, their definitions, relationships, and groupings.
Note that this model excludes the derived data or DSS data because they are subject to change and are temporal.
After the corporate data model and the mid-level models are in place, the next activity is defining the system of record. The system of record is defined in terms of the corporations existing systems. It represents the best data in terms of completeness, timeliness and accuracy.
Next the challenges in bringing the data into the data warehouse are identified.
These could include the following:
if the data arrives in large volumes, then the arrival rate has to be used to handle the influx of data
data that is voluminous is also summarized, aggregated or partitioned.
data may have to be extracted using the interfaces. Special transformation that include integration, alteration of time-basis, condensation and scanning may need to be done.
If the data modeling activity has been done properly, the design of the data warehouse is fairly simple. When the design is complete, the warehouse is organized by subject area.
Once the interfaces have been built, the first subject area is populated. The data is captured and transported into the warehouse environment. Sometimes only a fraction of data is required.
End users work by discovering the interfaces and then using it to populate the data.
Often the first population of the data is incorrect but it gives many of the corrections needed to populate it correctly going forward.
the population and the feedback processes continue afterwards.
By adding a lag time, the data warehouse developer serves the DSS needs of the company. This lag time is usually 24 hours.
Sometimes there is call for rapid population of warehouse which is referred to as active data warehousing since it supports a small amount of online access processing.

We continue our discussion on external data and data warehouse. Today we look at the role of metadata for external data. The metadata includes document ID, date of entry into the warehouse, description of the document, source, source date , classification , index words, purge date, physical location, length, and references. Associated with data is another type of data called notification data.
This is merely a convenience for the users to subscribe to alerts when certain type of external data is available.
External data is not directly stored in the warehouse. Its entries are made in the metadata of the data warehouse and the external data resides elsewhere.
One of the important design considerations of external data is that it often contains many different components some of which are more useful than others, so they have to be divided into manageable units.
Since external data cannot be reshaped, it has no place in the data model. The best that can be done here seems to be to keep references to the external data.
Summarization and Trends of external data are often more useful and are stored with the external data. Take Dow Jones index for example, the daily report is important but the trends and the monthly average can be calculated and put together with the external data.
Every piece of information external or otherwise has a lifetime at the end of which it could be discarded or archived. On the other hand, the cost of the metadata is pretty low and that could be kept in tact.
One of the advantages of using external data is that it can be used to compare with internal data. The comparison allows management a unique perspective. This comparison is usually done on a common key but that common key is sometimes difficult to find. For example, the key-structure of the external data may need to be converted to that of the internal data. Or the data may not be cleaned such as segregating products to compare apples to apples and not apples with oranges.
In short, we have seen that the external data holds a lot of information but has many issues to deal with. Hence it is captured, processed, stored on different media, setup with notifications and referenced within the warehouse. The warehouse manages and tracks the external data without the latter pervasively invading the system.

---------------
In addition to our usual post today, I want to show sample code for using the UI automation library with iOS:
The following code tries to add a sentence six times to the Notes app.

var target = UIATarget.localTarget();
var app = target.frontMostApp();
var testName = 'Take Notes'
UIALogger.logStart(testName);
for( var i = 0; i < 6; i++)
{
if ( i == 0) app.navigationBar().buttons()["Add"].tap();
else app.mainWindow().tableViews()[0].cells()[0].tap();
var page = app.mainWindow().textFields()[0];
var currentContent = page.getValue();
currentContent += "The big brown bear ate the brownies in the big paper bag." ;
page.setValue(currentContent);
UIALogger.logMessage("Added sentence : " + i );
app.navigationBar().buttons()["Notes"].tap();
}
app.mainWindow().tableViews()[0].cells()[0].tap();

if (app.mainWindow().textFields()[0].getValue.match('/bear/g') ) { UIALogger.logPass(testName);}
else { UIALogger.logFail(testName);}

Thursday, December 26, 2013

We return to the book on building the data warehouse:
We will now look into external data for data warehouse. Up to now, all the data we have discussed comes from internal systems i.e. internal to the corporation. This has already been processed into a regularly occurring format. There are several issues that come from external data but first lets look at why the data warehouse is the right place to store external data. The external data should not invade the system. It should not lose the information of its source and this is best done in the single centralized data warehouse. Further, if it doesn't come to the warehouse, it could come in formats and channels that are not managed. Lastly, we want the tracking of external data which is facilitated by a warehouse.
On the other hand, external data is difficult to deal with for the following reasons:

The external data doesn't have a frequency of availability. It's not within the Corporation's control. It may even call for constant monitoring and alerts.

The other problem is that the external data is totally undisciplined. There may be several reformatting and structuring required that breaks ever so often.

The third factor that makes the external data hard to capture is its unpredictability. External data may come from practically any source at any time.

External data can be of the following types:

Records of excellent data collected by some source
External data from random reports, articles and other sources.
There are several ways to capture and store external data. Near-line storage can help with making the external data accessible but costing huge amounts of money to store. Indexes can be created on the external data and they can be kept on disk to alleviate the traffic to the external data.
Another technique for handling external data is to create two stores - one to store all the data and another to store only a subset.

I will take a short break from our readings from the book on building a data warehouse, to discuss the implementation of Matsuo algorithm. I've mentioned that the algorithm proceeds with the following steps : preprocessing, selection of frequent terms, clustering frequent terms, calculation of expected probability as the ratio of term co-occurrences with the cluster to the total terms, calculation of the chi-square value based on the differences between expected and observed frequencies, and output the keywords.
In this post, I want to revisit the step involving clustering frequent terms. Note that we find the term frequencies, set the cutoff to 30% to find the top frequent terms and then find the co-occurrence matrix between all terms and frequent terms where the co-occurrences are found by calculating the number of sentences where they co-occur.
Now we transpose the columns of this co-occurrence matrix and find the clusters taking two at a time. We find the mutual information measure between two terms x and y as P(x,y)/ P(x)P(y)
We can take the P(x) as the expected probability which is the sum of the total number of terms in sentences where x appears divided by the total number of terms in the document.
We could use hierarchical clustering methods. We take terms pairwise and put them in a cluster if their mutual information is above a threshold. This threshold we choose to be log(2.0). A term is said to co-occur with a cluster if it co-occurs with any term in the cluster.
Just like we populated the co-occurrence matrix, we take pairs of terms and cluster them. We maintain a term array containing cluster labels or numbers where we start with 1 for the first cluster and increment for every new cluster. If one of the terms is already in a cluster, the other term joins the cluster. If the other terms belongs to another cluster, the clusters are merged. The highest of either cluster numbers is used for the merged cluster number When the clusters are merged the cluster matrix reassigns all members of both clusters to this new cluster and so on. If a term belongs to more than one clusters, then it is merged with the nearest cluster. Eventually, all the terms will be assigned to one cluster each. More than one terms may be assigned to a single cluster. All clusters will have distinct labels or numbers.

Wednesday, December 25, 2013

We looked at the distributed data warehouses in detail in the previous post. We will look at the executive information systems and the data warehouse next. EIS was the predecessor for data warehouses. EIS was the notion that the computational information should be available to executives via appealing screens but it was more about the presentation than the computation. When the data warehouse first appeared, it was not accepted by EIS but the warehouse provided exactly what the EIS was lacking. EIS could be argued has later morphed into business intelligence especially in terms of key-ratio indicator, trend analysis, drill-down analysis, problem monitoring, competitive analysis and key performance indicator. Trend analysis is the time sequenced measurements, drill-down analysis is the slicing and dicing on the data, key performance indicators could be such things as cash on hand, customer pipeline, length of sales cycle, collection time, product channel and competitive products.
To support such EIS operations, there must be a data infrastructure. In fact it has been seen that for every $1 spent on EIS software there is $9 spent on the data preparation. Furhermore, the executive analysis shifts focus every now and then exacerbating the requirements from the preparation.
This is mitigated with a data warehouse. The warehouse enables searching the definitive source of data, creating special extract programs from existing systems, dealing with un-integrated data, compiling and linking detailed and summary data, finding an appropriate time basis of data and changing business needs.
The EIS analyst can go to various levels in the warehouse to find such information. For example, the departmental level of processing, the lightly summarized level of processing, or the archival or dormant level of data. The drill down for the data is also in this order of levels in the warehouse.
One specific technique in the data warehouse that comes useful to EIS processing is event mapping. The simplest way to show event mapping is to use a trend line - for examples revenues varying over the months. Superimposing trend lines also gives correlated information.
Detailed data is prerequisite for EIS/DSS processing but just how much detail has also been discussed in this book. It's argued that there is no limit to the amount of detail desired. Since the requirements can vary and vary frequently, being prepared with as much detail as necessary is important. However, there is another approach that says just as much as the processing requires which is based on Zeno's paradox. In Zeno's paradox it is suggested that a rabbit can not outrun a turtle as long as the turtle has a head start on the rabbit. The underlying reasons for these arguments involve, storage and processing costs, volume that presents a challenge to readability, and reuse of previous analysis is difficult.

We mentioned the types of the distributed data warehouses in the earlier post. We will look into it now. First, the data warehouse may be affected by many different development efforts and these call for the data architect to manage the changes. For example, the development efforts could be based on different lines of products which are unintegrated or the same warehouse could be distributed based on geography such as north, south, east or west. There could also be different levels of data within the same data warehouse such as lightly summarized data , detailed data and OLAP that are build by different groups. Even the detailed non-distributed part of the same data warehouse could be build by different groups. The unintegrated lines of business pose little or no conflicts however that is rare in the cases of data warehouses. The second case of multiple data warehouses across locations is more common. Different groups want to isolate their development effort often at the risk of repeating efforts across groups. The third case of building multiple levels of data simultaneously is much easier to manage than either of the two earlier cases. However, due to the difference in the levels, there are different uses and expecations. The fourth case requires the most attention from the data warehouse architect. This is because the non-distributed warehouse is being built like slices of a pie and they all need to come together. The architect is responsible for consistency and growth of the data warehouse. Another approach different from the above is to develop completely independent warehouses. They are integrated by a common minimal corporate warehouse . The corporate warehouse also requires metadata just like the others but its simply not tied to the others since there is no business integration.
Now we will look at building the business in multiple levels. Here the main tool for the architect is the definitions for interconnectivity between the levels. And the primary data model is driven by the group that is building the current level of detail because they get to use the data warehoused data model. Interconnectivity addresses compatibility of access at the call level. And the current level of detail must be sufficient to generate the lightly summarized data. The co-ordination proceeds with agreements either informal or formal and time-sequenced such that there is no team waiting for data that is not made available yet.
In the case where there are multiple groups forming current level of detail, if the data is mutually exclusive between the groups, then there is little or no coordination. Otherwise the overlap can create redundancies and cost. Further, this redundancy introduces inconsistencies and the so-called spider web into the DSS environment. Therefore, a common data model needs to be established. With the separation of a common versus local data models, there is no redundancy across the local groups.
The same technology can be used for all the data models. Another strategy could be to use different platforms for different types of data found at the detailed level. There are data transfers between platforms and the boundary between technologies may be crossed many times. In all these cases, metadata sits on top of all these data.