Monday, June 24, 2013

Steps for processing text:
Text cleaning : clean the data, remove stop words and resolve inconsistencies and stem the terms.
Keywords are extracted based on their conditional probability distributions.
Text is usually analyzed for topics by clustering with K-means using a cosine based distance vector.
Another data mining tool is to use the clustering for high-dimensional data.  For example, text documents may contain thousands of terms or keywords as features. Document classification differ from topic analysis in that the document is high-dimensional where there are many keywords as features. Text documents are clustered on the frequent terms they contain. Terms are extracted and stemmed to reduce the term to its basic stem. This reduces the document to a bag of words.  If each term is treated as a dimension, the document becomes high-dimensional where there are as many features as keywords. That is, by using a frequent term based analysis, a well selected subset of all documents can be considered as a clustering. The frequent term based cluster is used to refer to a cluster which consists of the set of documents containing all of the terms of the frequent term set. A good subset of all the frequent term sets is selected. A good subset is one that covers all of the documents. Cluster overlap is measured by the distribution of documents supporting some clusters over the remaining cluster candidates. Clusters are automatically described by their frequent set.
Information retrieval is concerned with the organization and retrieval of information from a large number of documents usually with metrics such as Term-frequency-inverse-document-frequency and Shannon information. Two documents are considered to be similar if they have a high cosine measure. Unlike database systems that address concurrency control, recovery, transaction management and update, information retrieval is concerned with keywords and the notion of relevance.
Popular Text Indexing techinques include inverted indices and signature files. An inverted index comprises of two indexed tables (hash or B+-tree): document_table and term_table. The document table comprises of the document_id and a list of terms for each document aka posting_list. The term table consists of the term_id and the list of  document identifiers in which the term appears. The list of terms can be quite large so a hashing technique and a superimposed coding technique to encode a list of terms into bit representation.
 A signature file stores a signature record for each document in the database. Signatures are usually b bits long, initialized to zero, and a bit is set to 1 if the term appears in the document.  To reduce the high-dimensionality of documents, reduction techniques such as latent semantic indexing, probabilistic latent semantic indexing and locality preserving indexing can be used.
Latent semantic indexing decomposes the document matrix using singular value decomposition. i.e. it extracts the most representative features while minimizing the reconstruction error. If the rank of the term document X is r, then LSI decomposes X using X=USigmaVsuffixT where Sigma is the representation of the singular values of X. The LSI uses the first K vectors to embed the documents in a k-dimensional subspace.
Locality preserving index - extracts the most discriminative features by preserving the locality of documents in the reduced dimensionality space. Locality implies semantic closeness. LPI uses a minimization function with constraints.
While LSI seeks to uncover the most representative features, LPI aims to discover the geometrical structure of the document space.
Probabilistic Latent Semantic indexing is similar to LSI but achieves reduction through a probabilistic mixture model. Specifically there are k latent common themes in the document collection and each is characterized by a multinomial word distribution.
There are other approaches to text mining as well in addition to keyword based approach. These are 1) tagging approach and 2) information extraction approach. Tagging may be manual or done by some automated categorization algorithm where the tag set is small and predefined. Information extraction is deeper in that it requires semantic analysis of text by natural language understanding and machine learning methods. Other text mining tasks may include document clustering, classification, information extraction, association analysis, and trend analysis.
Self organizing feature maps is a neural network method for cluster analysis. A neural network is a set of connected input/output units, where each connection has a weight associated with it. They are popular for clustering because 1) they are inherently parallel and distributed processing architectures 2) they learn by adjusting their inter-connection weights so as to best fit the data. With this, they normalize the patterns and act as feature extractors for the various clusters. 3) They process numerical vectors and require object patterns to be represented by quantitative patterns.
Each cluster is represented as an exemplar which means a prototype and does not have to match a data example. Data points are assigned to cluster that is most similar to an exemplar based on a distance measure. The attributes are then predicted from the attributes of the exemplar.
Self-organizing feature maps represent all points in a high-dimensional source space by a points in 2-D or 3-D space such that distance and closeness are maintained. The method is useful when the problem is inherently non-linear.
SOM can be viewed as constrained versions of k-means clustering where the cluster centers are in low dimensional space.
Clustering is performed with several units competing for the current object. The unit whose weight vector is closest to the current object becomes the winning unit. The weights of this unit and those of its neighbors are adjusted to get closer to the input object. The assumption is there is a topology or ordering in the input that the units will eventually take shape. This is called a feature map. Such processing is applied to web mining but is costly for large databases.

Sunday, June 23, 2013

A closer look at decision tree induction
Decision tree can be built from training data using this kind of algorithms. The non-leaf nodes denote a test on an attribute and the leaf node denotes a class label. Attribute values of each tuple are evaluated before a class-label is assigned to it. Decision trees can be applied to high-dimensional data because multiple attributes can be added to the tree.
Tuples from the training data are class-labeled and these are used to build the decision tree. Attribute selection measures are used to select the attribute that best partitions the tuples into distinct cases. The set of candidate attributes and the attribute selection method that best partitions the data tuples into individual classes are available as input.
Generate_Decision_Tree(N, attribute_list):
First we create a node N.
If all the tuples in D are all of the same class C then return N as a leaf node labeled with class C.
If attribute list is empty then return the label that is majority
Apply attribute_selection_method(D, attribute_list) to find the best splitting criterion. The splitting criterion tells us which attribute to test at node N by determining the best way to partition the tuples. It also tells us which branches to grow from node N with respect to the outcomes of the chosen test. The partitions are kept as pure as possible i.e. they belong to the same class. A partition is pure if all of the tuples in it belong to the same class.
Label the node N with splitting criterion
A branch is grown from each of the outcomes of the splitting criterion. The tuples in D are partitioned accordingly.
If splitting Attribute is discrete valued and there are more than one splits possible, then set the attribute list to the remainder without the splitting attribute i.e. remove the splitting attribute.
foreach outcome j of splitting criterion
  partition the tuples and grow subtrees for each partition
  let  Dj is the set of data tuples satisfying the outcome j
  if Dj is empty then attach a leaf labeled with the majority class in D to node N;
  else attach the node returned by Generate_Decision_Tree(Dj, attribute_list) to node N;
end for
return N
 
Microsoft OLE DB for data mining :
OLE DB standardized data mining language primitives  and became an industry standard. Prior to OLE DB it was difficult to integrate data mining products. If one product was written using decision tree classifiers and another was written with support vectors and they do not have a common interface, then the application had to be rebuilt from scratch.  Furthermore, the data that these products analyzed was not always in a relational database which required data porting and transformation operations.
OLEDB for DM consolidates all these. It was designed to allow data  mining client applications to consume data mining services from a wide variety of data mining software packages. Clients communicate with data mining providers via SQL. 
The OLE DB for Data Mining stack uses a data mining extensions (DMX), a SQL like data mining query language to talk to different DM Providers. DMX statements can be used to create, modify and work with different data mining models. DMX also contains several functions that can be used to retrieve statistical information.  Furthermore, the data and not just the interface is also unified. The OLE DB integrates the data mining providers from the data stores such as a Cube, a relational database, or miscellaneous other data source can be used to retrieve and display statistical information.
The three main operations performed are model creation, model training and model prediction and browsing.
 Model creation A data mining model object is created just like a relational table. The model has a few input columns and one or more predictable columns, and the name of the data mining algorithm to be used when the model is later trained by the data mining provider.
Model training : The data are loaded  into the model and used to train it. The data mining provider uses the algorithm specified during the creation to search for patterns. These patterns are the model content.
Model prediction and browsing: A select statement is used to consult the data mining model content in order to make model predictions and browse statistics obtained by the model.
An example of a model can be seen with a nested table for customer id, gender, age and purchases. The purchases are associations between item_name and item_quantitiy. There are more than one purchases made by the customer. Models can be created with attribute types such as ordered, cyclical, sequence_time, probability, variance, stdev and support.
Model training involves loading the data into the model. The openrowset statement supports querying data from a data source through an OLE DB provider. The shape command enables loading of nested data.
 

Saturday, June 22, 2013

Interview questions on SQL Server:
1) What are the different isolation levels
2) What is the difference between repeatable read and read committed ?
3) What does the statement select from update do ?
4) What is the difference between Where and Having clauses ?
5) What are the differences between delete and truncate ?
6) What is key lookup in query plan ?
7) What is the statement to get query execution plan ?
8) What are the three parameters used to find bad query
9) What is the difference between clustered index and non-clustered index
10) What is SET NO COUNT ON ? What is the count after three DML statements ?
11) What is collation and case sensitivity ?
12) How do you handle errors in stored procedures ?
13) What is the statement to create a table by copying the schema of  another table ?
Interview questions on WCF, ASP.Net, NHibernate or MVC:
1) What is the difference between NHibernate and EntityFramework ?
2) If hbm files are not loaded, how do you include them ?
3) How do you define transactions for service calls ?
4) What is the transaction scoped to ?
5) What is MVC and how is it used ?
6) How is ASP.Net pages different from the MVC pages ?
7) What is the difference between the post back call both ASP.Net and MVC ?
8) What are the other characteristics of ASP.Net ?


 
Applications and Trend in data mining:
Data mining tools have been domain specific such as in finance, telecommunications and retail industry.  These tools integrate the domain specific knowledge with data analysis techniques to answer usually very specific queries.
Tools are evaluated on data types, system issues, data sources, data mining functions, coupling with a database or data warehouse, scalability, visualization and user interface.
Visual data mining is done with a designer style user interface that renders data, results and process in a graphical and usually interactive presentation. Audio data mining uses audio signals to indicate patterns or features.
Data analysis that use statistical methods involve regression, generalized linear models, analysis of variance, mixed-effect models, factor analysis, discriminant analysis, time-series analysis, survival analysis, and quality control.
Data mining or statistical techniques can also be applied to recommendations and opinions of customers to search for similarities and rank products. Such systems are called collaborative recommender systems.
Data mining may be ubiquitous where it is applied in the way we shop, work, search for information, use our leisure time, maintain our health, and well-being. It may not always be visible and it may participate behind the scenes in managing our e-mails, or in web search engines. Such usages sometimes bring up questions around privacy and data security which have been attempted to be addressed with fair information practices act that govern the usage of such data. On the other hand, data mining can help with counterterrorism. Solutions that balance these trade-offs try to not interpret the data while obtaining mining results to preserve privacy and attempt to encrypt data to preserve their security. Recent trends include standardization of data mining languages, visualization methods, and new methods for handling complex data types,
 

Given a vector of integers, find the longest consecutive sub-sequence of increasing numbers. If two sub-sequences have the same length, use the one that occurs first. An increasing sub-sequence must have a length of 2 or greater to qualify.
Example input:
[1 0 1 2 3 0 4 5]
Result:
[0 1 2 3]

void GetLongestRun ( int* A, uint N )
{
     Assert ( A != NULL && N > 0) ;
  
     uint i  = 0;
     uint start = 0;   // start index of current candidate sequence
     uint end = 0;     // end index of current candidate sequence
     uint globalStart = 0; // start index of overall winner among candidates if more than one
     uint globalEnd = 0;        // end index of overall winner among candidates assuming more than one

     i++;
     while ( i < N )
     {
        if ( A[i] >= A[i-1] )
       {
              end++;
       }
        else
       {
              start = i;
              end = i;
       }

       if (end - start > globalEnd - globalStart)
        {
            globalStart = start;
            globalEnd = end;
        }
        
        i++;
     }
    
     if (globalStart == globalEnd || globalStart + 1 == globalEnd) return;

     for (uint j = globalStart; j <= globalEnd; j++)
         printf ("%d ", A[j]);
}
A tic-tac-toe board is represented by a two dimensional vector. X is represented by :x, O is represented by :o, and empty is represented by :e. A player wins by placing three Xs or three Os in a horizontal, vertical, or diagonal row. Write a function which analyzes a tic-tac-toe board and returns :x if X has won, :o if O has won, and nil if neither player has won.
Example input:
[[:x :e :o]
[:x :e :e]
[:x :e :o]]
Result:
:x
public char GetWinner(char[,] Board)
{
     var ret = '\0';

     // check horizontals
     for (int i = 0; i < 3; i++)
     {
          if (Board[i,0] != 'e' && Board[i,0] == Board[i,1] && Board[i,1] == Board[i,2]) return Board[i,0];
     }

     // check verticals
     for (int j = 0; j < 3; j++)
     {
          if (Board[i,0] != 'e' && Board[0,j] == Board[1,j] && Board[1,j] == Board[2,j]) return Board[0,j];
     }

     // check diagonals
     if (Board[i,0] != 'e' && Board[0,0] == Board[1,1] && Board[1,1] == Board[2,2]) return Board[0,0];
     if (Board[i,0] != 'e' && Board[0,2] == Board[1,1] && Board[1,1] == Board[2,0]) return Board[0,2];

     return ret;
}