Cluster computing

Monday, June 24, 2013

Self organizing feature maps is a neural network method for cluster analysis. A neural network is a set of connected input/output units, where each connection has a weight associated with it. They are popular for clustering because 1) they are inherently parallel and distributed processing architectures 2) they learn by adjusting their inter-connection weights so as to best fit the data. With this, they normalize the patterns and act as feature extractors for the various clusters. 3) They process numerical vectors and require object patterns to be represented by quantitative patterns.
Each cluster is represented as an exemplar which means a prototype and does not have to match a data example. Data points are assigned to cluster that is most similar to an exemplar based on a distance measure. The attributes are then predicted from the attributes of the exemplar.
Self-organizing feature maps represent all points in a high-dimensional source space by a points in 2-D or 3-D space such that distance and closeness are maintained. The method is useful when the problem is inherently non-linear.
SOM can be viewed as constrained versions of k-means clustering where the cluster centers are in low dimensional space.
Clustering is performed with several units competing for the current object. The unit whose weight vector is closest to the current object becomes the winning unit. The weights of this unit and those of its neighbors are adjusted to get closer to the input object. The assumption is there is a topology or ordering in the input that the units will eventually take shape. This is called a feature map. Such processing is applied to web mining but is costly for large databases.

Sunday, June 23, 2013

A closer look at decision tree induction
Decision tree can be built from training data using this kind of algorithms. The non-leaf nodes denote a test on an attribute and the leaf node denotes a class label. Attribute values of each tuple are evaluated before a class-label is assigned to it. Decision trees can be applied to high-dimensional data because multiple attributes can be added to the tree.
Tuples from the training data are class-labeled and these are used to build the decision tree. Attribute selection measures are used to select the attribute that best partitions the tuples into distinct cases. The set of candidate attributes and the attribute selection method that best partitions the data tuples into individual classes are available as input.
Generate_Decision_Tree(N, attribute_list):
First we create a node N.
If all the tuples in D are all of the same class C then return N as a leaf node labeled with class C.
If attribute list is empty then return the label that is majority
Apply attribute_selection_method(D, attribute_list) to find the best splitting criterion. The splitting criterion tells us which attribute to test at node N by determining the best way to partition the tuples. It also tells us which branches to grow from node N with respect to the outcomes of the chosen test. The partitions are kept as pure as possible i.e. they belong to the same class. A partition is pure if all of the tuples in it belong to the same class.
Label the node N with splitting criterion
A branch is grown from each of the outcomes of the splitting criterion. The tuples in D are partitioned accordingly.
If splitting Attribute is discrete valued and there are more than one splits possible, then set the attribute list to the remainder without the splitting attribute i.e. remove the splitting attribute.
foreach outcome j of splitting criterion
partition the tuples and grow subtrees for each partition
let Dj is the set of data tuples satisfying the outcome j
if Dj is empty then attach a leaf labeled with the majority class in D to node N;
else attach the node returned by Generate_Decision_Tree(Dj, attribute_list) to node N;
end for
return N

Microsoft OLE DB for data mining :
OLE DB standardized data mining language primitives and became an industry standard. Prior to OLE DB it was difficult to integrate data mining products. If one product was written using decision tree classifiers and another was written with support vectors and they do not have a common interface, then the application had to be rebuilt from scratch. Furthermore, the data that these products analyzed was not always in a relational database which required data porting and transformation operations.
OLEDB for DM consolidates all these. It was designed to allow data mining client applications to consume data mining services from a wide variety of data mining software packages. Clients communicate with data mining providers via SQL.
The OLE DB for Data Mining stack uses a data mining extensions (DMX), a SQL like data mining query language to talk to different DM Providers. DMX statements can be used to create, modify and work with different data mining models. DMX also contains several functions that can be used to retrieve statistical information. Furthermore, the data and not just the interface is also unified. The OLE DB integrates the data mining providers from the data stores such as a Cube, a relational database, or miscellaneous other data source can be used to retrieve and display statistical information.
The three main operations performed are model creation, model training and model prediction and browsing.
Model creation A data mining model object is created just like a relational table. The model has a few input columns and one or more predictable columns, and the name of the data mining algorithm to be used when the model is later trained by the data mining provider.
Model training : The data are loaded into the model and used to train it. The data mining provider uses the algorithm specified during the creation to search for patterns. These patterns are the model content.
Model prediction and browsing: A select statement is used to consult the data mining model content in order to make model predictions and browse statistics obtained by the model.
An example of a model can be seen with a nested table for customer id, gender, age and purchases. The purchases are associations between item_name and item_quantitiy. There are more than one purchases made by the customer. Models can be created with attribute types such as ordered, cyclical, sequence_time, probability, variance, stdev and support.
Model training involves loading the data into the model. The openrowset statement supports querying data from a data source through an OLE DB provider. The shape command enables loading of nested data.

Saturday, June 22, 2013

Interview questions on SQL Server:
1) What are the different isolation levels
2) What is the difference between repeatable read and read committed ?
3) What does the statement select from update do ?
4) What is the difference between Where and Having clauses ?
5) What are the differences between delete and truncate ?
6) What is key lookup in query plan ?
7) What is the statement to get query execution plan ?
8) What are the three parameters used to find bad query
9) What is the difference between clustered index and non-clustered index
10) What is SET NO COUNT ON ? What is the count after three DML statements ?
11) What is collation and case sensitivity ?
12) How do you handle errors in stored procedures ?
13) What is the statement to create a table by copying the schema of another table ?
Interview questions on WCF, ASP.Net, NHibernate or MVC:
1) What is the difference between NHibernate and EntityFramework ?
2) If hbm files are not loaded, how do you include them ?
3) How do you define transactions for service calls ?
4) What is the transaction scoped to ?
5) What is MVC and how is it used ?
6) How is ASP.Net pages different from the MVC pages ?
7) What is the difference between the post back call both ASP.Net and MVC ?
8) What are the other characteristics of ASP.Net ?

Applications and Trend in data mining:
Data mining tools have been domain specific such as in finance, telecommunications and retail industry. These tools integrate the domain specific knowledge with data analysis techniques to answer usually very specific queries.
Tools are evaluated on data types, system issues, data sources, data mining functions, coupling with a database or data warehouse, scalability, visualization and user interface.
Visual data mining is done with a designer style user interface that renders data, results and process in a graphical and usually interactive presentation. Audio data mining uses audio signals to indicate patterns or features.
Data analysis that use statistical methods involve regression, generalized linear models, analysis of variance, mixed-effect models, factor analysis, discriminant analysis, time-series analysis, survival analysis, and quality control.
Data mining or statistical techniques can also be applied to recommendations and opinions of customers to search for similarities and rank products. Such systems are called collaborative recommender systems.
Data mining may be ubiquitous where it is applied in the way we shop, work, search for information, use our leisure time, maintain our health, and well-being. It may not always be visible and it may participate behind the scenes in managing our e-mails, or in web search engines. Such usages sometimes bring up questions around privacy and data security which have been attempted to be addressed with fair information practices act that govern the usage of such data. On the other hand, data mining can help with counterterrorism. Solutions that balance these trade-offs try to not interpret the data while obtaining mining results to preserve privacy and attempt to encrypt data to preserve their security. Recent trends include standardization of data mining languages, visualization methods, and new methods for handling complex data types,

Given a vector of integers, find the longest consecutive sub-sequence of increasing numbers. If two sub-sequences have the same length, use the one that occurs first. An increasing sub-sequence must have a length of 2 or greater to qualify.
Example input:
[1 0 1 2 3 0 4 5]
Result:
[0 1 2 3]

void GetLongestRun ( int* A, uint N )
{
     Assert ( A != NULL && N > 0) ;

     uint i = 0;
     uint start = 0;   // start index of current candidate sequence
     uint end = 0;     // end index of current candidate sequence
     uint globalStart = 0; // start index of overall winner among candidates if more than one
     uint globalEnd = 0;        // end index of overall winner among candidates assuming more than one

     i++;
     while ( i < N )
     {
        if ( A[i] >= A[i-1] )
       {
              end++;
       }
        else
       {
              start = i;
              end = i;
       }

       if (end - start > globalEnd - globalStart)
        {
            globalStart = start;
            globalEnd = end;
        }

        i++;
     }

     if (globalStart == globalEnd || globalStart + 1 == globalEnd) return;

     for (uint j = globalStart; j <= globalEnd; j++)
         printf ("%d ", A[j]);
}
A tic-tac-toe board is represented by a two dimensional vector. X is represented by :x, O is represented by :o, and empty is represented by :e. A player wins by placing three Xs or three Os in a horizontal, vertical, or diagonal row. Write a function which analyzes a tic-tac-toe board and returns :x if X has won, :o if O has won, and nil if neither player has won.
Example input:
[[:x :e :o]
[:x :e :e]
[:x :e :o]]
Result:
:x
public char GetWinner(char[,] Board)
{
     var ret = '\0';

     // check horizontals
     for (int i = 0; i < 3; i++)
     {
          if (Board[i,0] != 'e' && Board[i,0] == Board[i,1] && Board[i,1] == Board[i,2]) return Board[i,0];
     }

     // check verticals
     for (int j = 0; j < 3; j++)
     {
          if (Board[i,0] != 'e' && Board[0,j] == Board[1,j] && Board[1,j] == Board[2,j]) return Board[0,j];
     }

     // check diagonals
     if (Board[i,0] != 'e' && Board[0,0] == Board[1,1] && Board[1,1] == Board[2,2]) return Board[0,0];
     if (Board[i,0] != 'e' && Board[0,2] == Board[1,1] && Board[1,1] == Board[2,0]) return Board[0,2];

     return ret;
}

Friday, June 21, 2013

Mining object spatial multimedia and text
These are complex types of data. If the database is object-relational database or object oriented database, then they can be mined by using generalization and assigning classes to these complex objects including set based, list, inheritance and composition based hierarchies. They can also be mined by visualizing them as object data cubes. Finally, they can be used with generalization based mining.
Spatial data mining finds interesting patterns from large geospatial databases Spatial data cubes are constructed using spatial dimensions and measures. These can be queried using spatial OLAP . Spatial mining includes mining spatial association and collocation patterns, clustering, classification and special trend and outlier analysis.
Multimedia data mining finds interesting patterns from multimedia databases which store audio data, image data, video data, sequence data, and hypertext data containing text, markups and linkages. Mining involves finding patterns based on content and similarity measures, generalization and multidimensional analysis. Mining also involves classification and prediction, mining associations and audio and video data mining.
Text or document database mining uses precision, recall and F-score to measure the effectiveness of mining. As discussed earlier, various text retrieval methods have been developed where the queries can specify constraints on the documents to select or the documents have a ranking that enables a selection. As an example, if we use similarity measures between keywords, then the documents can be ranked in the order of relevance. Text that has a lot of attributes can be reduced with indexes such as in Latex Semantic Indexing (LSI), Locality preserving Indexing (LPI), and probabilistic LSI. Text mining is not limited to keyword based and similarity based search. It could involve key-board based associations, document classification and document clustering.
Web mining looks for web linkage structures, web contents and web access patterns. Web page layouts, web link structures, associated multimedia data and classification of web pages are all used in this mining.