Cluster computing

Sunday, December 29, 2013

In this post, we will discuss the really large data warehouse. As we have seen, data warehouses grow at a rapid rate- from a few gigabytes to tera bytes in a few years.This is primarily due to the following contributors:
1) history - no where else in the system, do we keep historical data as we do in the warehouse. Application developers are quick to remove historical data because it affects performance.
2) granularity - The data warehouses collect data at the most granular level. This lets the developers to look at data like no one before.
3) integration - The data warehouse brings together data from different operational systems. Its the database for the entire enterprise. There are common data model, common keys, common reference tables, common definitions,
These are referred to in the book as :
HistoricalData_DiverseData_DetailedData = Lots_of_data
The impact from this data is that there are 1) associated costs, 2) value in terms of usefulness of the data and 3) maintenance in terms of the rules of data management.
Basic data management activities include loading, creating an index etc.
The real costs of storage include the costs for the processor (DBMS + OS) software, communications controller, the disk controller and the data store.
When data grows, the percentage of data that is used actually reduces. For example, at 50GB almost all the data is used, at 500GB only 75% is used, at 2 TB, 35% is used, and at 10tb, only 5% is used. This is important to understand because it enables efficient planning of storage technology.
To calculate this usage of data , we first determine the size of the data access. We calculate this based on
Number of users_Queries per day_Bytes per query_200 days
We could also divide by the overlap factor that indicates the same amount of data pulled more than once.
Typically numbers are 250GB of data servicing 600 users
Next the usage of data is calculated with a ratio as
Usage_ratio = Actual bytes used / Total data warehouse bytes
Note that the separation of frequently used and infrequently used data is different for different systems. In OLTP environment, there is random access and interspersal. In DSS environment, there is a much more sharper division between the frequent and the infrequent data.

Saturday, December 28, 2013

We now look at unstructured data and warehouse. Combining structured data and unstructured data is like integrating two different worlds. CRM is an example. In CRM, we have structured data corresponding to the customer demographics and unstructured data corresponding to the documents and communications from the customer.
Text is common to both worlds and with this, we establish match the communications with the customer. However, text is susceptible to misspelling, mismatched context, same names, nicknames, incomplete names and word stems.
Using stop words and stemming, we can do a probabilistic where all the data that intersects is used to make a match. The strength of the match can be depicted on a numeric scale.
A themed match is another way to make a match where the unstructured data is organized based on themes. such as sales, marketing, human resources, engineering, accounting, distribution. Once the collection of industrially recognized themes have been gathered, the unstructured data is passed against all the themes for a tally.
Linkages between themes and theme words can be done via a raw match of the data.
Another way to link two documents is based on the metadata in the structured environment.
A two tiered data warehouse is typically used for the usage of unstructured data in the data warehouse.
one tier is the structured data and another tier is the unstructured data.
Unstructured data can be further subdivided for example comunications and documents.
The relationship between communications and unstructured data is formed on the basis of identifiers.
Unstructured data visualization is very different from the that for structured data which is more business intelligence. In unstructured data, we create self organizing map where clusters are depicted.
Volumes of data is an issue with every warehouse whether it has unstructured or structured data.
The two environments are fit together based on identifiers. In the structured environment we keep both metadata and data pertaining to the unstructured environment. the metadata has information about the repository. The records have information about the data, identifier and close identifier. close identifiers are those where there is a good indication that a match has been made.

When discussing the migration of data to the architected environment during the build phase of warehouse, it is important to discuss the feedback loop. First the data warehouse is populated from existing environments. Then the DSS analyst uses it for analysis and gives new requirements to the architect. The architect works on the existing systems to meet the requirements from the warehouse. This is the feedback loop.
The loop requires some data to be available in the warehouse for the DSS analyst to come up with new requirements. The shorter the loop the more successful the warehouse. The larger the volume, the longer the loop
The architect owns the data model. All feedback is translated as changes to this data model which is then used together with the existing system of record to improve the warehouse.
The architect also has to keep track of the agents of change viz, the aging of systems, the aging of technology, the organizational upheaval and the drastically changed requirements.
These agents of change let the architect to coordinate the efforts to rebuild the operational systems This works even better than changing the data model.
The steps taken to restructure the operational environment are as follows:
identify how the data model differs from the existing systems
study the impact of each delta item
how much will it cost to fix the delta item
and prepare a report to the management on what needs to be fixed, the estimate of resources, the order of work, and the disruption analysis.
The methodology used in data warehouse is called spiral development methodology. It differs from the migration path described above. The migration path is about general activities. This methodology describes the specific activities, their deliverable and the order of the activities. Together the methodology and migration are needed to find out all that is required for building the warehouse.
A word of caveat about methodologies is that there is an industry track record for even the most rationally appealing methodologies to fail during implementation.
These are due to the reasons such as they describe prescribed set of activities that often ignore iterations, or they describe how to do something but not what to do and they ignore the scope or the detail of the work involved. Other reasons include these : they mix project management with development and design, they don't differentiate operational from DSS systems, they don't include checkpoints and stops on failures.
In this sense the spiral methodology is different from the others in that it proceeds down the same path repeatedly as opposed to say the waterfall methodology. Further, it is said to be data driven.

Friday, December 27, 2013

Since we discussed building the data warehouse, we will now discuss the migration to this environment and the ease of use. We mentioned earlier that this is a step by step activity. Most successful implementations have been built one iteration at a time. The size and the speed of the iterative development is important and the results must be delivered quickly.
But Migrations is different from the iterative development. Many migrations have failed and hence the book mentions about a generic migration plan that has worked in many cases.
This migration plan begins with a corporate data model. This data model includes major subjects, their definitions, relationships, and groupings.
Note that this model excludes the derived data or DSS data because they are subject to change and are temporal.
After the corporate data model and the mid-level models are in place, the next activity is defining the system of record. The system of record is defined in terms of the corporations existing systems. It represents the best data in terms of completeness, timeliness and accuracy.
Next the challenges in bringing the data into the data warehouse are identified.
These could include the following:
if the data arrives in large volumes, then the arrival rate has to be used to handle the influx of data
data that is voluminous is also summarized, aggregated or partitioned.
data may have to be extracted using the interfaces. Special transformation that include integration, alteration of time-basis, condensation and scanning may need to be done.
If the data modeling activity has been done properly, the design of the data warehouse is fairly simple. When the design is complete, the warehouse is organized by subject area.
Once the interfaces have been built, the first subject area is populated. The data is captured and transported into the warehouse environment. Sometimes only a fraction of data is required.
End users work by discovering the interfaces and then using it to populate the data.
Often the first population of the data is incorrect but it gives many of the corrections needed to populate it correctly going forward.
the population and the feedback processes continue afterwards.
By adding a lag time, the data warehouse developer serves the DSS needs of the company. This lag time is usually 24 hours.
Sometimes there is call for rapid population of warehouse which is referred to as active data warehousing since it supports a small amount of online access processing.

We continue our discussion on external data and data warehouse. Today we look at the role of metadata for external data. The metadata includes document ID, date of entry into the warehouse, description of the document, source, source date , classification , index words, purge date, physical location, length, and references. Associated with data is another type of data called notification data.
This is merely a convenience for the users to subscribe to alerts when certain type of external data is available.
External data is not directly stored in the warehouse. Its entries are made in the metadata of the data warehouse and the external data resides elsewhere.
One of the important design considerations of external data is that it often contains many different components some of which are more useful than others, so they have to be divided into manageable units.
Since external data cannot be reshaped, it has no place in the data model. The best that can be done here seems to be to keep references to the external data.
Summarization and Trends of external data are often more useful and are stored with the external data. Take Dow Jones index for example, the daily report is important but the trends and the monthly average can be calculated and put together with the external data.
Every piece of information external or otherwise has a lifetime at the end of which it could be discarded or archived. On the other hand, the cost of the metadata is pretty low and that could be kept in tact.
One of the advantages of using external data is that it can be used to compare with internal data. The comparison allows management a unique perspective. This comparison is usually done on a common key but that common key is sometimes difficult to find. For example, the key-structure of the external data may need to be converted to that of the internal data. Or the data may not be cleaned such as segregating products to compare apples to apples and not apples with oranges.
In short, we have seen that the external data holds a lot of information but has many issues to deal with. Hence it is captured, processed, stored on different media, setup with notifications and referenced within the warehouse. The warehouse manages and tracks the external data without the latter pervasively invading the system.

---------------
In addition to our usual post today, I want to show sample code for using the UI automation library with iOS:
The following code tries to add a sentence six times to the Notes app.

var target = UIATarget.localTarget();
var app = target.frontMostApp();
var testName = 'Take Notes'
UIALogger.logStart(testName);
for( var i = 0; i < 6; i++)
{
if ( i == 0) app.navigationBar().buttons()["Add"].tap();
else app.mainWindow().tableViews()[0].cells()[0].tap();
var page = app.mainWindow().textFields()[0];
var currentContent = page.getValue();
currentContent += "The big brown bear ate the brownies in the big paper bag." ;
page.setValue(currentContent);
UIALogger.logMessage("Added sentence : " + i );
app.navigationBar().buttons()["Notes"].tap();
}
app.mainWindow().tableViews()[0].cells()[0].tap();

if (app.mainWindow().textFields()[0].getValue.match('/bear/g') ) { UIALogger.logPass(testName);}
else { UIALogger.logFail(testName);}

Thursday, December 26, 2013

We return to the book on building the data warehouse:
We will now look into external data for data warehouse. Up to now, all the data we have discussed comes from internal systems i.e. internal to the corporation. This has already been processed into a regularly occurring format. There are several issues that come from external data but first lets look at why the data warehouse is the right place to store external data. The external data should not invade the system. It should not lose the information of its source and this is best done in the single centralized data warehouse. Further, if it doesn't come to the warehouse, it could come in formats and channels that are not managed. Lastly, we want the tracking of external data which is facilitated by a warehouse.
On the other hand, external data is difficult to deal with for the following reasons:

The external data doesn't have a frequency of availability. It's not within the Corporation's control. It may even call for constant monitoring and alerts.

The other problem is that the external data is totally undisciplined. There may be several reformatting and structuring required that breaks ever so often.

The third factor that makes the external data hard to capture is its unpredictability. External data may come from practically any source at any time.

External data can be of the following types:

Records of excellent data collected by some source
External data from random reports, articles and other sources.
There are several ways to capture and store external data. Near-line storage can help with making the external data accessible but costing huge amounts of money to store. Indexes can be created on the external data and they can be kept on disk to alleviate the traffic to the external data.
Another technique for handling external data is to create two stores - one to store all the data and another to store only a subset.

I will take a short break from our readings from the book on building a data warehouse, to discuss the implementation of Matsuo algorithm. I've mentioned that the algorithm proceeds with the following steps : preprocessing, selection of frequent terms, clustering frequent terms, calculation of expected probability as the ratio of term co-occurrences with the cluster to the total terms, calculation of the chi-square value based on the differences between expected and observed frequencies, and output the keywords.
In this post, I want to revisit the step involving clustering frequent terms. Note that we find the term frequencies, set the cutoff to 30% to find the top frequent terms and then find the co-occurrence matrix between all terms and frequent terms where the co-occurrences are found by calculating the number of sentences where they co-occur.
Now we transpose the columns of this co-occurrence matrix and find the clusters taking two at a time. We find the mutual information measure between two terms x and y as P(x,y)/ P(x)P(y)
We can take the P(x) as the expected probability which is the sum of the total number of terms in sentences where x appears divided by the total number of terms in the document.
We could use hierarchical clustering methods. We take terms pairwise and put them in a cluster if their mutual information is above a threshold. This threshold we choose to be log(2.0). A term is said to co-occur with a cluster if it co-occurs with any term in the cluster.
Just like we populated the co-occurrence matrix, we take pairs of terms and cluster them. We maintain a term array containing cluster labels or numbers where we start with 1 for the first cluster and increment for every new cluster. If one of the terms is already in a cluster, the other term joins the cluster. If the other terms belongs to another cluster, the clusters are merged. The highest of either cluster numbers is used for the merged cluster number When the clusters are merged the cluster matrix reassigns all members of both clusters to this new cluster and so on. If a term belongs to more than one clusters, then it is merged with the nearest cluster. Eventually, all the terms will be assigned to one cluster each. More than one terms may be assigned to a single cluster. All clusters will have distinct labels or numbers.