Cluster computing

Friday, December 27, 2013

We continue our discussion on external data and data warehouse. Today we look at the role of metadata for external data. The metadata includes document ID, date of entry into the warehouse, description of the document, source, source date , classification , index words, purge date, physical location, length, and references. Associated with data is another type of data called notification data.
This is merely a convenience for the users to subscribe to alerts when certain type of external data is available.
External data is not directly stored in the warehouse. Its entries are made in the metadata of the data warehouse and the external data resides elsewhere.
One of the important design considerations of external data is that it often contains many different components some of which are more useful than others, so they have to be divided into manageable units.
Since external data cannot be reshaped, it has no place in the data model. The best that can be done here seems to be to keep references to the external data.
Summarization and Trends of external data are often more useful and are stored with the external data. Take Dow Jones index for example, the daily report is important but the trends and the monthly average can be calculated and put together with the external data.
Every piece of information external or otherwise has a lifetime at the end of which it could be discarded or archived. On the other hand, the cost of the metadata is pretty low and that could be kept in tact.
One of the advantages of using external data is that it can be used to compare with internal data. The comparison allows management a unique perspective. This comparison is usually done on a common key but that common key is sometimes difficult to find. For example, the key-structure of the external data may need to be converted to that of the internal data. Or the data may not be cleaned such as segregating products to compare apples to apples and not apples with oranges.
In short, we have seen that the external data holds a lot of information but has many issues to deal with. Hence it is captured, processed, stored on different media, setup with notifications and referenced within the warehouse. The warehouse manages and tracks the external data without the latter pervasively invading the system.

---------------
In addition to our usual post today, I want to show sample code for using the UI automation library with iOS:
The following code tries to add a sentence six times to the Notes app.

var target = UIATarget.localTarget();
var app = target.frontMostApp();
var testName = 'Take Notes'
UIALogger.logStart(testName);
for( var i = 0; i < 6; i++)
{
if ( i == 0) app.navigationBar().buttons()["Add"].tap();
else app.mainWindow().tableViews()[0].cells()[0].tap();
var page = app.mainWindow().textFields()[0];
var currentContent = page.getValue();
currentContent += "The big brown bear ate the brownies in the big paper bag." ;
page.setValue(currentContent);
UIALogger.logMessage("Added sentence : " + i );
app.navigationBar().buttons()["Notes"].tap();
}
app.mainWindow().tableViews()[0].cells()[0].tap();

if (app.mainWindow().textFields()[0].getValue.match('/bear/g') ) { UIALogger.logPass(testName);}
else { UIALogger.logFail(testName);}

Thursday, December 26, 2013

We return to the book on building the data warehouse:
We will now look into external data for data warehouse. Up to now, all the data we have discussed comes from internal systems i.e. internal to the corporation. This has already been processed into a regularly occurring format. There are several issues that come from external data but first lets look at why the data warehouse is the right place to store external data. The external data should not invade the system. It should not lose the information of its source and this is best done in the single centralized data warehouse. Further, if it doesn't come to the warehouse, it could come in formats and channels that are not managed. Lastly, we want the tracking of external data which is facilitated by a warehouse.
On the other hand, external data is difficult to deal with for the following reasons:

The external data doesn't have a frequency of availability. It's not within the Corporation's control. It may even call for constant monitoring and alerts.

The other problem is that the external data is totally undisciplined. There may be several reformatting and structuring required that breaks ever so often.

The third factor that makes the external data hard to capture is its unpredictability. External data may come from practically any source at any time.

External data can be of the following types:

Records of excellent data collected by some source
External data from random reports, articles and other sources.
There are several ways to capture and store external data. Near-line storage can help with making the external data accessible but costing huge amounts of money to store. Indexes can be created on the external data and they can be kept on disk to alleviate the traffic to the external data.
Another technique for handling external data is to create two stores - one to store all the data and another to store only a subset.

I will take a short break from our readings from the book on building a data warehouse, to discuss the implementation of Matsuo algorithm. I've mentioned that the algorithm proceeds with the following steps : preprocessing, selection of frequent terms, clustering frequent terms, calculation of expected probability as the ratio of term co-occurrences with the cluster to the total terms, calculation of the chi-square value based on the differences between expected and observed frequencies, and output the keywords.
In this post, I want to revisit the step involving clustering frequent terms. Note that we find the term frequencies, set the cutoff to 30% to find the top frequent terms and then find the co-occurrence matrix between all terms and frequent terms where the co-occurrences are found by calculating the number of sentences where they co-occur.
Now we transpose the columns of this co-occurrence matrix and find the clusters taking two at a time. We find the mutual information measure between two terms x and y as P(x,y)/ P(x)P(y)
We can take the P(x) as the expected probability which is the sum of the total number of terms in sentences where x appears divided by the total number of terms in the document.
We could use hierarchical clustering methods. We take terms pairwise and put them in a cluster if their mutual information is above a threshold. This threshold we choose to be log(2.0). A term is said to co-occur with a cluster if it co-occurs with any term in the cluster.
Just like we populated the co-occurrence matrix, we take pairs of terms and cluster them. We maintain a term array containing cluster labels or numbers where we start with 1 for the first cluster and increment for every new cluster. If one of the terms is already in a cluster, the other term joins the cluster. If the other terms belongs to another cluster, the clusters are merged. The highest of either cluster numbers is used for the merged cluster number When the clusters are merged the cluster matrix reassigns all members of both clusters to this new cluster and so on. If a term belongs to more than one clusters, then it is merged with the nearest cluster. Eventually, all the terms will be assigned to one cluster each. More than one terms may be assigned to a single cluster. All clusters will have distinct labels or numbers.

Wednesday, December 25, 2013

We looked at the distributed data warehouses in detail in the previous post. We will look at the executive information systems and the data warehouse next. EIS was the predecessor for data warehouses. EIS was the notion that the computational information should be available to executives via appealing screens but it was more about the presentation than the computation. When the data warehouse first appeared, it was not accepted by EIS but the warehouse provided exactly what the EIS was lacking. EIS could be argued has later morphed into business intelligence especially in terms of key-ratio indicator, trend analysis, drill-down analysis, problem monitoring, competitive analysis and key performance indicator. Trend analysis is the time sequenced measurements, drill-down analysis is the slicing and dicing on the data, key performance indicators could be such things as cash on hand, customer pipeline, length of sales cycle, collection time, product channel and competitive products.
To support such EIS operations, there must be a data infrastructure. In fact it has been seen that for every $1 spent on EIS software there is $9 spent on the data preparation. Furhermore, the executive analysis shifts focus every now and then exacerbating the requirements from the preparation.
This is mitigated with a data warehouse. The warehouse enables searching the definitive source of data, creating special extract programs from existing systems, dealing with un-integrated data, compiling and linking detailed and summary data, finding an appropriate time basis of data and changing business needs.
The EIS analyst can go to various levels in the warehouse to find such information. For example, the departmental level of processing, the lightly summarized level of processing, or the archival or dormant level of data. The drill down for the data is also in this order of levels in the warehouse.
One specific technique in the data warehouse that comes useful to EIS processing is event mapping. The simplest way to show event mapping is to use a trend line - for examples revenues varying over the months. Superimposing trend lines also gives correlated information.
Detailed data is prerequisite for EIS/DSS processing but just how much detail has also been discussed in this book. It's argued that there is no limit to the amount of detail desired. Since the requirements can vary and vary frequently, being prepared with as much detail as necessary is important. However, there is another approach that says just as much as the processing requires which is based on Zeno's paradox. In Zeno's paradox it is suggested that a rabbit can not outrun a turtle as long as the turtle has a head start on the rabbit. The underlying reasons for these arguments involve, storage and processing costs, volume that presents a challenge to readability, and reuse of previous analysis is difficult.

We mentioned the types of the distributed data warehouses in the earlier post. We will look into it now. First, the data warehouse may be affected by many different development efforts and these call for the data architect to manage the changes. For example, the development efforts could be based on different lines of products which are unintegrated or the same warehouse could be distributed based on geography such as north, south, east or west. There could also be different levels of data within the same data warehouse such as lightly summarized data , detailed data and OLAP that are build by different groups. Even the detailed non-distributed part of the same data warehouse could be build by different groups. The unintegrated lines of business pose little or no conflicts however that is rare in the cases of data warehouses. The second case of multiple data warehouses across locations is more common. Different groups want to isolate their development effort often at the risk of repeating efforts across groups. The third case of building multiple levels of data simultaneously is much easier to manage than either of the two earlier cases. However, due to the difference in the levels, there are different uses and expecations. The fourth case requires the most attention from the data warehouse architect. This is because the non-distributed warehouse is being built like slices of a pie and they all need to come together. The architect is responsible for consistency and growth of the data warehouse. Another approach different from the above is to develop completely independent warehouses. They are integrated by a common minimal corporate warehouse . The corporate warehouse also requires metadata just like the others but its simply not tied to the others since there is no business integration.
Now we will look at building the business in multiple levels. Here the main tool for the architect is the definitions for interconnectivity between the levels. And the primary data model is driven by the group that is building the current level of detail because they get to use the data warehoused data model. Interconnectivity addresses compatibility of access at the call level. And the current level of detail must be sufficient to generate the lightly summarized data. The co-ordination proceeds with agreements either informal or formal and time-sequenced such that there is no team waiting for data that is not made available yet.
In the case where there are multiple groups forming current level of detail, if the data is mutually exclusive between the groups, then there is little or no coordination. Otherwise the overlap can create redundancies and cost. Further, this redundancy introduces inconsistencies and the so-called spider web into the DSS environment. Therefore, a common data model needs to be established. With the separation of a common versus local data models, there is no redundancy across the local groups.
The same technology can be used for all the data models. Another strategy could be to use different platforms for different types of data found at the detailed level. There are data transfers between platforms and the boundary between technologies may be crossed many times. In all these cases, metadata sits on top of all these data.

Tuesday, December 24, 2013

After the data warehouse is built, the focus shifts from building the warehouse to day to day operations. It turns out that the cost of operating and maintaining a warehouse is high and the volume of data is increasing. The usages by the DSS may cause contention for resources but the biggest expense comes from the periodic refreshes of legacy data.
In order to refresh the data, it must be read from old legacy databases. Repeated and direct reads of old legacy databases is a costly operation. First, the legacy DBMS must be active and online during the read process. It must occur in a window of time and stretching that window is not welcome. The refreshment scan must process the entire legacy file when only a few percentage is required.
A mitigation to such repeated reads is to trap the data as it is being updated. Now the trapped data can be processed offline.
Data can be trapped either by data replication or change data capture. Both techniques are effective. The first requires data to be identified prior to update and a trigger causes the update activity to be trapped. The content and the format of data is clean.
The second approach is the change data capture . This approach captures all the changes that have occurred say in a day. In this approach the log or journal tape is read and while that can be cumbersome, its offline and captures all updates.
As with any production database environment, the data warehouse testing seems to call for two parallel environments be setup one for stage and one for production. The staging environment is where programmers test out new programs.
However due to the size of the warehouse, providing resources for even one of the environments is hard at times. Moreover, the programs are run on the warehouse in a heuristic manner and not repetitive manner as with conventional databases. If there is an issue with the data in the warehouse, the program is simply redone.
As a summary, data warehouses must support:
1) robust language interface
2) compound keys and variable length data
3) manage large amounts of data
4) easily index and monitor data
5) Interface with a wide number of technologies
6) allow the data to be placed directly on disk
7) store and access data in parallel
8) have metadata control in the warehouse
9) efficiently load the warehouse
10) efficiently use the indexes
11) store data in a compact way
12) support compound keys
13) selectively turn off the lock manager
14) do index-only processing
15) quickly restore from bulk-storage

Also we discussed that multi-dimensional OLAP technology is suited for data mart processing and not data warehouse processing and that metadata plays a different role in data warehouse.
Most organizations build a single centralized data warehouse environment.

Data warehouses are not always centralized. Businesses are distributed geographically or over a multiple, differing product lines.
Distributed data warehouses are of three different types:
1) local and global data warehouses where the former represents data and processing at a remote site and the latter represents the integration across business
2) technologically distributed data warehouse where the same single logical data warehouse is spread out over different physical data warehouse.
3) independently evolving distributed data warehouse where one warehouse comes up and then another in an uncoordinated manner.

Monday, December 23, 2013

Data Warehouse has special needs so the DBMS have also evolved to include data warehouse-specific features. Instead of transaction processing, data warehouse calls for load-and-access processing. Data is integrated, transformed and loaded into the warehouse from the operational environments and ODS. Data warehouse does not have updates. Instead data is stored in a series of snapshot records. When there is a change, a new snapshot record is added, rather than an update being done.
There are different storage media required for a data warehouse often used in a dual environment where one processing environment is the DASD environment where online interactive processing is done and the other is the tape or mass store where the data is accumulated.
The DASD environment could use a different vendor from the mass store environment. It's also possible that the DASD environment is split over more than one vendors. As long as these are deliberate instead of political or historical, these should not pose issues.
When comparing operational environment to data warehouse environment, the role of metadata is also very different. For example, in the operational environment, the audience for the metadata is the IT professional and he/she is usually savvy about computers whereas in the data warehouse the metadata is used by DSS analyst who could do with as much help as possible and hence the use for metadata services.
In the data warehouse, the metadata tracks the significant transformations to the data as it passes from operational to data warehouse environments.
Moreover, data in a data warehouse exists for a lengthy span of time and the warehouse changes its structure in this time. Keeping track of this change in structure is a natural task for the metadata. In the operational environment on the other hand, the metadata keeps track of one and only one correct definition of the structure of the data.
Also, operational systems focus on the current data as in the current balance or the current inventory, however trends are not available from current data. They become apparent only when there is data accumulated over time. For the span of time that this data is accumulated, a new dimension by name data context becomes important. This data context is all the context information such as changes in the data source etc. that may explain the discrepancies in the trends on the mere data accumulated over time.
Three levels of context information must be maintained - simple, complex and external.
Simple contextual information relates to the basic structure of the data itself and includes structure, encoding, naming conventions, and metrics. Complex contextual information includes such things as product definitions, marketing territories, pricing, packaging etc. External contextual information includes such things as inflation, financial trends, taxation and economic growth. Complex and external contextual information can vary a lot and harder to keep in a structure.
In the past, collecting contextual information has been difficult because the audience was different, the interest was passive, the resources had vanished, and even when it was collected, it was restricted to simple contextual information.