Cluster computing

Wednesday, June 26, 2013

Comparison of decision tree versus vector space model for document databases
A decision tree operates well on structured data where each tuple is defined by a set of attribute value pairs. This analysis decides which features are more discriminating and build the tree top down. Once the tree is constructed, it is used across all records one by one. Alternatively, the rules can be derived and ordered in a flat list based on their discriminative power and occurrence frequency. Either way, both the attribute value pairs and the order of evaluation are set prior to evaluating new data. This is efficient for large data sets where each data object can be evaluated against the tree to find the label to assign.
But for document databases, this is not the case. The set of attributes or dimensions is not fixed and furthermore, some attributes may be more relevant than others. Support vector space model work well in such high-dimensional space since they use a mapping function that maps term space to a quantitative variable. However, vector space model may be assigning improper weights to rare items since it does not take into account the distribution characteristics of other data points. This can be addressed with a preprocessing step called feature selection which removes the terms in the training documents that are statistically uncorrelated with the class labels.
To recap, here are the advantages and disadvantages of using decision trees versus vector space model.

	Advantages	Disadvantages
Decision Tree	· Useful for structured data · Applicable to hierarchical categorical distinctions. · Automated tree induction is straightforward. · Accuracy at each level can be determined before growing the tree · Can be printed out and interpreted	· Amount of training data reduced as we descend the decision stumps · Force features to be checked in a specific order even if the features may act relatively independent. · Doesn’t work well with weak predictors of correct label.
Vector space model	· Works with varied length attribute value pairs · Allows quantitative representation of data points · Allows all features to act in parallel. · Ordering is not required · Overcomes weak predictors with feature selection	· Disregards class distributions · May assign large weights to rare items. · Can be extremely slow

For example, the following code attempts to find the topic as a single keyword in the provided text.

import nltk.corpus
from nltk.text import TextCollection
from nltk import cluster
from numpy import array

# method to get the topic of a given text
def getTopic(text):
    # clean input
    stop = open('stopwords.txt').read()
    l = []
    src = [w.strip(" .,?!") for w in nltk.word_tokenize(text.lower()) if w not in stop]
    candidates = nltk.FreqDist(w for w in src if w.__len__ > 3)
    candidates = candidates.keys()[:10]

    # initialize vectors
    brown = TextCollection(nltk.corpus.brown)
    for w in candidates:
        l.append((w,brown.tf_idf(w, candidates)))
    vectors = [array(l)]

    # initialize the clusterer
    clusterer = nltk.cluster.kmeans.KMeansClusterer(10, euclidean_distance)
    clusterer.cluster(vectors, True)

    #pick the one closest to the center of the largest
    o = [(clusterer.classify(l.index(i)), l.index(i)) for i in range(l.__len__)]
    o.reverse()
    print o.pop().index(1)

Tuesday, June 25, 2013

Keyword based association analysis:Such analysis collects sets of keywords or terms that occur together and finds correlations between them. Each document is considered a transaction and the set of keywords in that documents are considered a set of items in a transaction. The association mining process can detect compound associations, such as domain dependent terms or phrases and non-compound associations such as units of measure. These are called term level association mining as opposed to mining on individual words and consist of bigrams and trigrams. Terms and phrases are automatically tagged and the number of meaningless results are greatly reduced. With such term and phrase recognition, standard association mining or max pattern mining may be evoked.
Document classification analysis: Automated document classification assigns labels to documents for easy lookup and retrieval This is used for tagging topics, constructing topic directory, identifying document writing styles, and the purposes of grouping hyperlinks associated with a set of documents. Document classification is automated in the following way: First, a set of pre-classified documents is used as the training set. The training set is then analyzed in order to derive a classification scheme. Such a classification scheme often needs to be refined with a testing process, then it can be used for classifying other documents. This is different from classifying relational data which is structured. Each tuple in the relational data is defined by a set of attribute value pairs and the classification analysis decides which attribute value pair is most discriminating. Document databases on the other hand contain a set of a keywords per document and don't have any fixed set of attributes and treating each keyword as a dimension results in a large dimension set. Decision tree analysis is not suitable for document classification.
Techniques commonly used for document classification include nearest neighbor classification, feature selection methods, Bayesian classification, support vector machines, and association based classification.
The k- nearest neighbor classifier assumes that similar documents share similar document vectors and are expected to be assigned the same class label. We can simply index all of the training documents, each associated with its corresponding class label. For a given test document query, we can retrieve the k - most similar and use their label. To refine the label selection, we can tune k or use weights associated with documents. Furthermore, a feature selection process can be used to remove terms in the training document that are statistically uncorrelated with the class labels. Bayesian classification is another popular technique that involves the statistical distribution of documents in specific classes. This classifier first trains the model by generating a document distribution to each class c of document d and then tests which class is most likely to generate the test document. Another classification method is the support vector machine. Here the classes are represented by numbers and a direct mapping function from term space to the class variable is constructed. The least square linear regression method can be used to discriminate this classification. Association based classification classifies document based on a set of associated frequently occurring text patterns. However frequent terms are usually poor discriminators and the not so frequent ones have better discriminative power. Such an association based classification method proceeds as follows First, we extract the keywords and terms by information retrieval and simple association analysis techniques mentioned above. Second, we use a concept hierarchy with term classes such as WordNet, or expert knowledge or keyword classification systems. Documents in the training set can also be classified into class hierarchies. A term association mining can then be applied to discover sets of associated term that can be used to maximally distinguish one class of documents from another. This derives a set of association rules associated with each document class which can be ordered based on their discriminative power and occurrence frequency and then used to classify new documents.

Monday, June 24, 2013

Steps for processing text:
Text cleaning : clean the data, remove stop words and resolve inconsistencies and stem the terms.
Keywords are extracted based on their conditional probability distributions.
Text is usually analyzed for topics by clustering with K-means using a cosine based distance vector.
Another data mining tool is to use the clustering for high-dimensional data. For example, text documents may contain thousands of terms or keywords as features. Document classification differ from topic analysis in that the document is high-dimensional where there are many keywords as features. Text documents are clustered on the frequent terms they contain. Terms are extracted and stemmed to reduce the term to its basic stem. This reduces the document to a bag of words. If each term is treated as a dimension, the document becomes high-dimensional where there are as many features as keywords. That is, by using a frequent term based analysis, a well selected subset of all documents can be considered as a clustering. The frequent term based cluster is used to refer to a cluster which consists of the set of documents containing all of the terms of the frequent term set. A good subset of all the frequent term sets is selected. A good subset is one that covers all of the documents. Cluster overlap is measured by the distribution of documents supporting some clusters over the remaining cluster candidates. Clusters are automatically described by their frequent set.
Information retrieval is concerned with the organization and retrieval of information from a large number of documents usually with metrics such as Term-frequency-inverse-document-frequency and Shannon information. Two documents are considered to be similar if they have a high cosine measure. Unlike database systems that address concurrency control, recovery, transaction management and update, information retrieval is concerned with keywords and the notion of relevance.
Popular Text Indexing techinques include inverted indices and signature files. An inverted index comprises of two indexed tables (hash or B+-tree): document_table and term_table. The document table comprises of the document_id and a list of terms for each document aka posting_list. The term table consists of the term_id and the list of document identifiers in which the term appears. The list of terms can be quite large so a hashing technique and a superimposed coding technique to encode a list of terms into bit representation.
A signature file stores a signature record for each document in the database. Signatures are usually b bits long, initialized to zero, and a bit is set to 1 if the term appears in the document. To reduce the high-dimensionality of documents, reduction techniques such as latent semantic indexing, probabilistic latent semantic indexing and locality preserving indexing can be used.
Latent semantic indexing decomposes the document matrix using singular value decomposition. i.e. it extracts the most representative features while minimizing the reconstruction error. If the rank of the term document X is r, then LSI decomposes X using X=USigmaVsuffixT where Sigma is the representation of the singular values of X. The LSI uses the first K vectors to embed the documents in a k-dimensional subspace.
Locality preserving index - extracts the most discriminative features by preserving the locality of documents in the reduced dimensionality space. Locality implies semantic closeness. LPI uses a minimization function with constraints.
While LSI seeks to uncover the most representative features, LPI aims to discover the geometrical structure of the document space.
Probabilistic Latent Semantic indexing is similar to LSI but achieves reduction through a probabilistic mixture model. Specifically there are k latent common themes in the document collection and each is characterized by a multinomial word distribution.
There are other approaches to text mining as well in addition to keyword based approach. These are 1) tagging approach and 2) information extraction approach. Tagging may be manual or done by some automated categorization algorithm where the tag set is small and predefined. Information extraction is deeper in that it requires semantic analysis of text by natural language understanding and machine learning methods. Other text mining tasks may include document clustering, classification, information extraction, association analysis, and trend analysis.

Self organizing feature maps is a neural network method for cluster analysis. A neural network is a set of connected input/output units, where each connection has a weight associated with it. They are popular for clustering because 1) they are inherently parallel and distributed processing architectures 2) they learn by adjusting their inter-connection weights so as to best fit the data. With this, they normalize the patterns and act as feature extractors for the various clusters. 3) They process numerical vectors and require object patterns to be represented by quantitative patterns.
Each cluster is represented as an exemplar which means a prototype and does not have to match a data example. Data points are assigned to cluster that is most similar to an exemplar based on a distance measure. The attributes are then predicted from the attributes of the exemplar.
Self-organizing feature maps represent all points in a high-dimensional source space by a points in 2-D or 3-D space such that distance and closeness are maintained. The method is useful when the problem is inherently non-linear.
SOM can be viewed as constrained versions of k-means clustering where the cluster centers are in low dimensional space.
Clustering is performed with several units competing for the current object. The unit whose weight vector is closest to the current object becomes the winning unit. The weights of this unit and those of its neighbors are adjusted to get closer to the input object. The assumption is there is a topology or ordering in the input that the units will eventually take shape. This is called a feature map. Such processing is applied to web mining but is costly for large databases.

Sunday, June 23, 2013

A closer look at decision tree induction
Decision tree can be built from training data using this kind of algorithms. The non-leaf nodes denote a test on an attribute and the leaf node denotes a class label. Attribute values of each tuple are evaluated before a class-label is assigned to it. Decision trees can be applied to high-dimensional data because multiple attributes can be added to the tree.
Tuples from the training data are class-labeled and these are used to build the decision tree. Attribute selection measures are used to select the attribute that best partitions the tuples into distinct cases. The set of candidate attributes and the attribute selection method that best partitions the data tuples into individual classes are available as input.
Generate_Decision_Tree(N, attribute_list):
First we create a node N.
If all the tuples in D are all of the same class C then return N as a leaf node labeled with class C.
If attribute list is empty then return the label that is majority
Apply attribute_selection_method(D, attribute_list) to find the best splitting criterion. The splitting criterion tells us which attribute to test at node N by determining the best way to partition the tuples. It also tells us which branches to grow from node N with respect to the outcomes of the chosen test. The partitions are kept as pure as possible i.e. they belong to the same class. A partition is pure if all of the tuples in it belong to the same class.
Label the node N with splitting criterion
A branch is grown from each of the outcomes of the splitting criterion. The tuples in D are partitioned accordingly.
If splitting Attribute is discrete valued and there are more than one splits possible, then set the attribute list to the remainder without the splitting attribute i.e. remove the splitting attribute.
foreach outcome j of splitting criterion
partition the tuples and grow subtrees for each partition
let Dj is the set of data tuples satisfying the outcome j
if Dj is empty then attach a leaf labeled with the majority class in D to node N;
else attach the node returned by Generate_Decision_Tree(Dj, attribute_list) to node N;
end for
return N

Microsoft OLE DB for data mining :
OLE DB standardized data mining language primitives and became an industry standard. Prior to OLE DB it was difficult to integrate data mining products. If one product was written using decision tree classifiers and another was written with support vectors and they do not have a common interface, then the application had to be rebuilt from scratch. Furthermore, the data that these products analyzed was not always in a relational database which required data porting and transformation operations.
OLEDB for DM consolidates all these. It was designed to allow data mining client applications to consume data mining services from a wide variety of data mining software packages. Clients communicate with data mining providers via SQL.
The OLE DB for Data Mining stack uses a data mining extensions (DMX), a SQL like data mining query language to talk to different DM Providers. DMX statements can be used to create, modify and work with different data mining models. DMX also contains several functions that can be used to retrieve statistical information. Furthermore, the data and not just the interface is also unified. The OLE DB integrates the data mining providers from the data stores such as a Cube, a relational database, or miscellaneous other data source can be used to retrieve and display statistical information.
The three main operations performed are model creation, model training and model prediction and browsing.
Model creation A data mining model object is created just like a relational table. The model has a few input columns and one or more predictable columns, and the name of the data mining algorithm to be used when the model is later trained by the data mining provider.
Model training : The data are loaded into the model and used to train it. The data mining provider uses the algorithm specified during the creation to search for patterns. These patterns are the model content.
Model prediction and browsing: A select statement is used to consult the data mining model content in order to make model predictions and browse statistics obtained by the model.
An example of a model can be seen with a nested table for customer id, gender, age and purchases. The purchases are associations between item_name and item_quantitiy. There are more than one purchases made by the customer. Models can be created with attribute types such as ordered, cyclical, sequence_time, probability, variance, stdev and support.
Model training involves loading the data into the model. The openrowset statement supports querying data from a data source through an OLE DB provider. The shape command enables loading of nested data.

Saturday, June 22, 2013

Interview questions on SQL Server:
1) What are the different isolation levels
2) What is the difference between repeatable read and read committed ?
3) What does the statement select from update do ?
4) What is the difference between Where and Having clauses ?
5) What are the differences between delete and truncate ?
6) What is key lookup in query plan ?
7) What is the statement to get query execution plan ?
8) What are the three parameters used to find bad query
9) What is the difference between clustered index and non-clustered index
10) What is SET NO COUNT ON ? What is the count after three DML statements ?
11) What is collation and case sensitivity ?
12) How do you handle errors in stored procedures ?
13) What is the statement to create a table by copying the schema of another table ?
Interview questions on WCF, ASP.Net, NHibernate or MVC:
1) What is the difference between NHibernate and EntityFramework ?
2) If hbm files are not loaded, how do you include them ?
3) How do you define transactions for service calls ?
4) What is the transaction scoped to ?
5) What is MVC and how is it used ?
6) How is ASP.Net pages different from the MVC pages ?
7) What is the difference between the post back call both ASP.Net and MVC ?
8) What are the other characteristics of ASP.Net ?