Cluster computing

Saturday, January 25, 2014

In this post, we complete the discussion on Teradata and its SQL. We briefly cover the stored procedures SQL. Teradata follows the SQL conventions so these are similar to writing them elsewhere. However they include comprehensive stored procedure language. They are stored in either DATABASE or USER PERM and are invoked using the CALL statement. It may return one or more values to client as parameter. IN, OUT, and INOUT parameters can be used with the stored procedure.
Triggers can similarly be authored. It is an event driven maintenance operation. When a table is modified, the trigger is executed. The original data modification, the trigger and all subsequent triggers are all part of the same transaction. Triggers can be specified as 'FOR EACH STATEMENT' as well as for each data modification.
The hashing functions on Teradata are also very popular. The HASHROW function is used to produce the 32bit binary Row Hash that is stored as part of the data row. The HashRow function can be executed on data column values. From the above answer of 1, it is a great sign that the data is perfectly distributed or even unique.
The HashBucket function is used to produce the 16 bit binary Hash bucket that is used with the hash map to determine the AMP that should store and retrieve the data row. It can return a maximum of just over 1,000,000 unique values. The HashAmp function returns the identification number of the primary AMP for any hash bucket number. The HashBakAmp function returns the identification number of the Fallback AMP for any Hash Bucket number. A great way to see distribution of primary and fallback rows.
The Batch Teradata query tool has been around for some time. It is a report writer and imports and exports data from a Teradata server one row at a time. Queries can be run interactively or in a batch script. After you logon, you can directly execute queries in interactive mode. The WITH and WITH BY commands are available for totaling and sub-totaling. Scripts can be batched and saved in a text file. This concludes our reading of the book on Teradata

Having looked at occurrence and co-occurrence based keyword selection, let's recap: Both the Zipf's law of word occurrences and the refined studies of Booth demonstrate that mid-frequency terms are closely related to the conceptual content of a document. Therefore, it is possible to hypothesize that terms closer to TP can be used as index terms. TP = (-1 + square-root(8 * I + 1))/2 where I represents the number of words with frequency = 1. Alternatively, TP can be found by the first frequency that is not repeated in a non-increasing frequency sorted vocabulary.
We will now attempt to explain a simple clustering based method for co-occurrence based keyword selection.
We start with a co-occurrence matrix where we count the occurrences of pairwise terms. The matrix is N * M where N is the total number of terms and M is the set of words selected from the keywords with TP above. M << N. We ignore the diagonals of this matrix. The row wise total for a term is the sum of all co-occurrences with the known set for that term.
If terms w1 and w2 have similar distribution of co-occurrence with other terms, w1 and w2 are considered to be in the same cluster. Similarity is measured based on Kullback Leibler divergence. Kullback-Leibler divergence is calculated as KLD = sum for all terms p.log (p/q) We use standard k-means clustering method. Here the initial cluster centroids are chosen far from each other and then the terms are assigned to the nearest cluster, their centroids recomputed and the cycle repeated until cluster centers stabilize. we take p and q to be attribute-wise co-occurrences from the matrix we populated and aggregate them to compute the KLD
The distribution of co-occurring terms in particular is given by
Pz(t) = weighted-avg attribute-wise Pz(t)
= Sum-z((nz/n)*(n(t,z)/nz))
= 1/n (Sum-z(n(t,z))

Friday, January 24, 2014

We continue with Vector Space Model next. In this model, we represent documents of a collection by using vectors with weights according to the terms appearing in each document. The similarity between two documents d1 and d2 with respect to query q is used is measured as the cosine of the angle between the documents and the query. A document is said to be closer to the query if its cosine is smaller. Each query is modeled as a vector using the same attribute space as the documents.
Groups of documents that are similar with respect to the users' information needs will be obtained by clustering, the documents representing a collection of points in a multidimensional space where the dimensions are given by the features selected There are different algorithms to clustering and almost all algorithms can cluster. Their fitness or quality is measured.
Sometimes, it is better to reduce the number of dimensions without significant loss of information. This transformation to a subspace helps with clustering and reducing the computations Another technique uses a matrix for principal component analysis. In this method, the document and query methods are projected onto a different subspace spanned by k principal components. The matrix helps define the principal values. as for example a covariance matrix.
Notice that reducing the number of dimensions does not shift the origin but only the number of dimensions, so the vectors are spaced apart the same. The principal component analysis on the other hand shifts the origin to the center of the basis vectors in the subspace so the documents are better differentiated.

def LSI (vectors):
return vectors.transform(vectors.k_largest_singular_values);

def COV(vectors):
return vectors.transform(vectors.subset_principal_components_with_k_highest_values);

In this post, we will begin with a way to differentiate between LSI and PLSI. Latent Semantic Indexing is an approach where we try to associate terms with the same main hidden semantic component as found by co-occurrences in similar context. LSI uses conditional probabilities.. In PLSI, we use similar occurrence and co-occurrence of terms but with non-negative weights whose sum is 1. Comparing and clustering this co-occurrence distributions gives us the latent components as represented by the centers of the clusters.
We use similar clustering methods with document classification as well. Formally, we use the vector space model. This model represents documents of a collection by using vectors with weights according to the terms appearing in each document.
Vector Space model can be both boolean and term-weighted models. In the boolean model, the presence or occurrence of each term is given a boolean value. In the term-weighted model, the same term may be assigned different weights by different weighting methods such as tf-idf, frequency ratio. When the weights add up to one, they have been normalized. Typically probabilities or conditional probabilities adding up to one are used.
One of the advantages of using vector space model is that it enables relevance ranking of documents of heterogeneous format with respect to user input queries as long as the attributes are well-defined characteristics of the document.

Thursday, January 23, 2014

In continuation of our discussion on keywords and documents:
We looked at a few key metrics for selecting keywords based on content, frequency and distribution.
However keywords can also be selected based on co-occurrence of similar terms in the same context.
i.e similar meanings will occur with similar neighbors.

This model proposes that the distribution of a word together with other words and phrases is highly indicative of its meaning This method represents the meaning of a word with a vector where each feature corresponds to a context and its value to the frequency of the word's occurring in that context.
This vector is referred to as the Term profiles. The profile V(t) of the term t is a set of terms from the list T that co-occur in sentences together with term t, that is P(t) = {t' : t' belongs to s(t)}

Once such a representation is built, machine learning techniques can be used to automatically classify or cluster words according to their meaning. However, these are often noisy representations. This is due to say polysemy and synonymy. Synonymy refers to the existence of similar terms that can be used to express an idea or object in most languages. Polysemy refers to the fact that some words have multiple unrelated meanings. If we ignore synonymy, the clusters we create will have many small disjoint clusters while some of these could have been clustered together. If we ignore polysemy, it can lead to clustering of unrelated documents.

Another challenge in using this technique to massive databases is efficient and accurate computation of the largest few singular values and their vectors in the highly sparse matrix. In fact scalability beyond a few hundred thousand documents is difficult.

Note the technique above inherently tries to find the keywords based on the latent semantics and hence is referred to as Latent semantic indexing. The comparable to clustering in this approach is to find the conditional probability distributions such that the observed word distributions can be decomposed into a few of the latent distributions and a noisy remainder. The latent components are expected to be the same for co-occurring words.

Next, we will also look at a popular way to represent documents in a collection which is with Vector Space Model.

Material read from survey of text mining, papers by Pinto, papers by Wartena

Today we will recap some keyword selection techniques:

In this article, we look at different feature selection algorithms. Feature selection metrics are based on content, frequency and distribution.

Content based feature selection can be based on say mutual information or information gain.

MI is about how much information feature contains about class and it is given by log P(f,c)/P(f)P(c)

Information Gain measures the number of bits of information obtained about the presence and absence of a class by knowing the presence or absence of a feature.

It is given by Sum-class Sum-feature P(g,d) .log P(g,d)/P(g)P(d)

Gain Ratio improves IG by including features with low entropy that may be strongly correlated with a class. It does this by normalizing IG with the entropy of the class.

Term Strength is based on the idea that most valuable features are shared by related documents.
The weight of a feature is defined as the probability of finding it in some document d given that it is also appeared in the document d' similar to d.
To calculate TS for feature f, we should have a threshold between the similarity measure used to judge two words to be sufficiently related. This is done by first deciding how many documents can be related to a given one and then finding the average minimum similarity measure for this number of neighbors over all documents in collection.

Frequency based keyword selection metrics include tf-idf and Transition point.

tf-idf is the term-frequency and inverse document frequency. This is quite a simple and effective measure for keywords.

Transition Point is about the idea that medium frequency are closely related to the conceptual content of a document. A higher value of weight can be given to each term as its frequency is closer to a frequency named the transition point. This is determined by inspecting the vocabulary frequencies of each text, identifying the lowest frequency from the highest frequency where the frequency is not repeated. The repeating frequencies are more common for low frequency words.
Distribution wise it is debatable whether a term spread out throughout the document can be included or excluded. On one hand its a good candidate for an index and on the other hand, it can be considered overly general.

Wednesday, January 22, 2014

In this post, we revert to the discussion on Teradata:
We will cover help and show commands. Help is a command to list the database objects. For example, we have Help Database, help user, help table, help volatile table, help view, help macro, help procedure, help trigger, help column etc. These display information of the corresponding database objects. The different database objects are aggregate functions (A), combined/analytical functions(B), External stored procedure(E), Function (F), Trigger(G), Method(H), Join Index(I), Journal Table(J), Macro(M), Hash Index(N), Procedure (P), Queue Table (Q), Table Function (R), Ordered Analytical Function (S), Table (T), UDT(U), View(V) and Authorization(X) There's also corresponding help available for index, statistics, constraint, session, SQL, SPL etc.
Comments can also be added to database objects. These are displayed on help.
Runtime information such as sessions are also displayed. The parsing engine assigns Session.
The show command is used to show the SQL used to create the corresponding database object. For example, the SHOW TABLE <table-name> can be used to create the table statement needed to create the table. Table, view, macro, trigger, procedure can all be shown.
Aggregates such as count when used with a column specifier ignores null values. When used with a star to denote all columns, includes rows with null values. The common aggregates are MIN, MAX, AVG, SUM and COUNT. GROUP BY delivers one row per group. Rows can be restricted when used with a predicate and this improves performance. The predicate clause also helps skip calculations in aggregation. The having clause acts as a filter on all aggregates.
Grouping Sets, rollup and cube are three grouping options.
The Nexus tool could be used for all groupings.
Joins can be INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN and FULL OUTER JOIN. A PRODUCT JOIN is a Cartesian product and often not intended.