Cluster computing

Thursday, December 19, 2013

Interview question : Find min in shifted array : {26, 33, 49, 63, 3, 7, 13, 18, 19, 23)

int Min(int[] numbers)
{
if ( numbers == null || numbers.Length <= 0) throw new Exception();
int min = numbers[0];
for (int i = 1; i < numbers.Length; i++)
{
if (numbers[i] < min )
{
min = numbers [i];
break; // for shifted - sorted array
}
}
return min;
}

// binary chop
int Min (int[] numbers, int start, int end)
{
// validate parameters
// termination condition
if (start == end]) return numbers[start];
if (numbers[start] > numbers[end])
{
int mid = start + end / 2;
if ( numbers[start] > numbers[mid])
return Min(numbers, start, mid);
else
return Min(numbers, mid+1, end);
}
else
return numbers[start];
}

Wednesday, December 18, 2013

We continue our discussion on data warehouse.
Extraction of data from one place to another seems deceptively simple but usually turns out to be much more complex and larger than initially assumed.
We see some of these transformations now:
1) The extraction of the data from the operational environment to the legacy environment requires a change in technology including operating systems, hardware and data.
2) The selection of data from the operational environment may require reading several tables, following relational constraints and to be done during the online operational window.
3) Input keys from operational data have to be restructured and converted. before they are written to a data warehouse.
4) Non-key data is reformatted - for example date-time formats may change to be more consistent across the warehouse.
5) Data is also cleansed before passing to the warehouse with cross-record verification and domain checks.
6) Multiple sources of data exist and must pass into the data warehouse. This means that we tap into each data source with a defined logic.
7) When there are multiple input files, key resolution can be done before the files can be merged.
8) If there are multiple files, the sequence of files may not be compatible and this may involve record migrations as well.
9) The transformation logic that creates the data warehouse may produce different summarization levels for different data source and hence different outputs that need to be reconciled.
10) There has to be default values when there is no source of data
11) The efficiency of the selection of data is often a performance consideration because only a few records from the original source may need to be pulled.
12) data also needs to be summarized and combined into the profile record.

Data Warehouses are built for a wide variety of systems.However each datacenter maintains a structure of data called a snapshot. This has four components - a key, a timestamp, a primary data that relates only to the key, secondary data captured as part of the snapshot process with no relationship to the primary.
The key identifies the record and the primary data.
The unit of time refers to the moment where the event being described by the snapshot has occurred.
The primary data is the non-key data
The secondary data is circumstantial data and is typically used with DSS processing
A relationship between the primary and the secondary can be inferred because they are adjacent. This is called an artifact.
We now discuss metadata.Metadata is very useful to describe the data in the warehouse. It acts like an index and it stores the structure of the data as known to the programmer and DSS analyst, a hint for the source data, the data model, and a history of extracts.
Reference Data is another component commonly used with data warehouses. This keeps track of the referential integrity. Reference data tends to be ignored with data accumulation however it should be made time variant just like any other part of the warehouse.
There are two different designs for reference data.
The first approach requires taking a snapshot of the reference data every six months. This approach is quite simple but its not sufficient because it might miss the activity between snapshots.
The second approach keeps track of all the activities against the reference table.
The actual implementation might be something in between the first and the second approach.
There is another important factor pervasive in the data warehouse. This is the length of time a change of data in the operational environment takes to be reflected in the data warehouse and is referred to as the cyclicity of data.
The data warehouse contains historical information. The changes in the operational system should take at least 24 hours to show up in the warehouse. This introduced delay is referred to as the wrinkle of time.
A wrinkle of time smaller than 24 hours is expensive to do but the goal is to not do the operational changes in the warehouse and vice versa.

Tuesday, December 17, 2013

In today's blog post, we continue on our discussion with data warehouse. We mentioned that the warehouse is built iteratively This has been proved correct by the industry track record, because the end user unable to give all the requirements up front, management doesn't make a commitment until there are some results, and visible results must be seen quickly.
We will now see the role of data model in the iterative development. The data warehouse serves as a roadmap for each of the development efforts. The data model plays a critical factor in these efforts. First the data model tells what needs to be done.
Second the data model tells how one development effort is to be integrated with the other.
The development efforts are like puzzle pieces and the data model makes sure the efforts fall in place.
If the data model were absent, many of the efforts would not fit well with others and there would be overlap and redundancies.
The output of a data model is a set of tables in fact a lot of small tables with little data. If a program has to go for several tables it has to interconnect with them.
A better approach would be to physically merge some of the tables so that minimal I/O is consumed.
Merging tables is one approach. Another approach is to create an array of data. An array of data lets the sequences reside in different locations and it would be easier to directly access the resource. For example creating an array by month is easy. It might be interesting to note that sometime redundant data is inserted. For example, if a base parts table has a non-redundant field call description then several parts table have to access the base the parts table for that field. This is expensive. Instead if the description field is redundantly placed in many tables, its access is independent and less expensive. Note however, this is for the access to the tables.
Likewise if there are other fields in the table that is not as frequently or more frequently used, then they can be stored in separate tables such that the most frequent ones are by themselves.
Another technique is to use a derived data with the physical database design. For example, if we calculate the annual taxes at the year-end every year, we can store the calculated data. Then all subsequent access can use the calculated field.
Another technique is to use a creative index or a creative profile. This type of creative index is created as data is passed from operational to warehouse environment. Since the data is accessed unit by unit, the index creation is easy, inexpensive and based on users' interests.
Finally, a referential integrity can be used to establish relationships between databases. Each solution can be tested with a few use cases such as name, dept .

We continue our discussion on data warehouse. We look into the "design" of the interface from operational systems. The design is actually heuristic. The requirements for a warehouse are not known until it is partially populated and in use. So a phase by phase approach is used instead. One portion of data is populated. It is then used and scrutinized by the DSS analyst. Then there is feedback from the user which leads to data modifications. Then another portion of the data warehouse is built. This feedback loop continues until through out the life of the warehouse.
On the other hand, anticipating requirements is still important, so there is a trade-off.
Design usually begins with operational data, the source for the warehouse. However, it's not extract and load operation. There are several transformations involved. Operational data comes from different applications They are often not integrated. Data when it makes its way into the warehouse has to be integrated across applications. Integration means standardizing the conventions across systems, keeping the syntax and semantics same on merging, and correctly mapping source fields to destination.
Three types of loads are made into the data warehouse from the operational environments:
archival data
data currently in the operational environment
periodic refreshes from updates to the data in the operational environment
Archival data is often skipped or one-time.
Loading current data is not disruptive because the data is loaded only once and usually via downloaded files.
Loading periodic refreshes on the other hand is a major concern because we have to find what's already present and what's not. Some common techniques used to limit the scanning of data are as follows:
1) scan the time-stamped data of the source and exclude the ones that are not in range. However source may not have timestamps
2) scan the delta file that contains only the changes made to the application as a result of the transactions.
3) scan the log file or the audit file that's generated a by-product of the transaction processing.
4) modify the application code for managing the amount of data scanned, however application code can be legacy, old and fragile.
5) the last option is literally a last resort in that it takes a snapshot before and after image of the operational file and dedicating large amount of resources

Monday, December 16, 2013

In data warehouses, exploration and data mining are used to analyze masses of historical data and to discover patterns of unknown business activity. The data warehouse contains cleansed, integrated and organized data.
Data warehouse can be designed in a hybrid form to create what is known as a living sample database which is especially useful when the data has grown very large. The data in this database could be true archival data or lightly summarized data. Since its a subset of the original data warehouse, it needs to be periodically refreshed. This is also not a general purpose database and they are useful only for analysis or to find trends. Queries that can't be run over the full database can only be run on this database.
The selection of data for this database is usually random but in some cases, a "judgement sample" is taken in which the records meet a certain criteria.
This translates to improved productivity for a DSS analyst with reduced time for turnarounds.
A second major issue for the data warehouse is partitioning. Partitioning of data refers to the breakup of data into separate physical units that can be handled independently.
Proper partitioning helps the warehouse in the following ways:
Loading data
Accessing data
Archiving data
Deleting data
Monitoring data
and Storing data
Independently managed partitions of data are portable for different processing environments. When data is monolithic, it becomes harder to restructure, index, sequentially scan, reorganize, recover and monitor.
Flexible access of data is a key design goal and partitions help with that goal. Data can be partitioned in many different ways such as by date, by line of business, by geography, by organizational unit, and all of the above. The choices for partitioning data is dependent on the developer.
Partitioning can be done in many ways. Partition can be done at the system level or at the application level. If the partitioning is at the system level, the DBMS requires that there is a single definition of data. If the partitioning is at the application level, the data can be moved around between environments. One test for a good partitioning of data is to see if an index can be added without major restructuring. or hampering of operations.

Sunday, December 15, 2013

We describe a simple tweak to Matsuo's paper on keyword extraction from a single document where we replace the unconditional probability of a term in the document with the prior Bayesian probability from a corpus. We could pre-compute and store the Bayesian probabilities of all terms in a table. We include all words from an English Dictionary and assign a default value to those terms not found in the corpus.We use the nltk.NaiveBayesClassifier to populate this table beforehand. Since the original approach finds the probabilities of only the top 30% of the frequent terms, the number of documents where the frequent terms won't be found in the corpus should be fairly small.
Said another way, we don't just set the weights on the terms based on their frequency in the document, but taken the overall likelihood of that term to appear in any document. In Matsuo's approach, this substitution does not affect alter the meaning of the expected probability too much. If anything, it improves on the probability from a wider source than the document.
We also aim to differentiate terms that occur similar number of times in a document. This is especially useful for short text where the frequency may not give as much information as in a larger text. Further the original approach was to find the term weights based on the document itself and avoid the corpus. In this case too, we avoid the corpus with a prior probability table that useful in cases where we can discriminate the terms.
Note that this tweak is more suitable to the approach we modify and may not apply in all cases. For example in the KLD distance calculation, KLD distance is about relative probability distribution and a measure how different one is from the other. In which case, we want the probabilities from the same sample space.
Here we cluster sentences based on the KLD similarity between a pair of sentences where sentences are represented by their term vector of probabilities. We compute the probability that a term occurs in a sentence as the ratio of the term frequency of the term in the sentence to the total term frequency across all the sentences. Depending on the scope we could treat sentences with paragraphs or call them short text documents. By clustering, we categorize the text and therefore select the keywords that represent the categories. In both symmetric and pairwise clustering, we use co-occurrence of terms. The difference is that symmetric clustering groups phrases and pairwise clustering groups relevant terms with the original approach.
KLD equations come in many form. All of them use the same probability distributions.