Saturday, June 22, 2013

Applications and Trend in data mining:
Data mining tools have been domain specific such as in finance, telecommunications and retail industry.  These tools integrate the domain specific knowledge with data analysis techniques to answer usually very specific queries.
Tools are evaluated on data types, system issues, data sources, data mining functions, coupling with a database or data warehouse, scalability, visualization and user interface.
Visual data mining is done with a designer style user interface that renders data, results and process in a graphical and usually interactive presentation. Audio data mining uses audio signals to indicate patterns or features.
Data analysis that use statistical methods involve regression, generalized linear models, analysis of variance, mixed-effect models, factor analysis, discriminant analysis, time-series analysis, survival analysis, and quality control.
Data mining or statistical techniques can also be applied to recommendations and opinions of customers to search for similarities and rank products. Such systems are called collaborative recommender systems.
Data mining may be ubiquitous where it is applied in the way we shop, work, search for information, use our leisure time, maintain our health, and well-being. It may not always be visible and it may participate behind the scenes in managing our e-mails, or in web search engines. Such usages sometimes bring up questions around privacy and data security which have been attempted to be addressed with fair information practices act that govern the usage of such data. On the other hand, data mining can help with counterterrorism. Solutions that balance these trade-offs try to not interpret the data while obtaining mining results to preserve privacy and attempt to encrypt data to preserve their security. Recent trends include standardization of data mining languages, visualization methods, and new methods for handling complex data types,
 

Given a vector of integers, find the longest consecutive sub-sequence of increasing numbers. If two sub-sequences have the same length, use the one that occurs first. An increasing sub-sequence must have a length of 2 or greater to qualify.
Example input:
[1 0 1 2 3 0 4 5]
Result:
[0 1 2 3]

void GetLongestRun ( int* A, uint N )
{
     Assert ( A != NULL && N > 0) ;
  
     uint i  = 0;
     uint start = 0;   // start index of current candidate sequence
     uint end = 0;     // end index of current candidate sequence
     uint globalStart = 0; // start index of overall winner among candidates if more than one
     uint globalEnd = 0;        // end index of overall winner among candidates assuming more than one

     i++;
     while ( i < N )
     {
        if ( A[i] >= A[i-1] )
       {
              end++;
       }
        else
       {
              start = i;
              end = i;
       }

       if (end - start > globalEnd - globalStart)
        {
            globalStart = start;
            globalEnd = end;
        }
        
        i++;
     }
    
     if (globalStart == globalEnd || globalStart + 1 == globalEnd) return;

     for (uint j = globalStart; j <= globalEnd; j++)
         printf ("%d ", A[j]);
}
A tic-tac-toe board is represented by a two dimensional vector. X is represented by :x, O is represented by :o, and empty is represented by :e. A player wins by placing three Xs or three Os in a horizontal, vertical, or diagonal row. Write a function which analyzes a tic-tac-toe board and returns :x if X has won, :o if O has won, and nil if neither player has won.
Example input:
[[:x :e :o]
[:x :e :e]
[:x :e :o]]
Result:
:x
public char GetWinner(char[,] Board)
{
     var ret = '\0';

     // check horizontals
     for (int i = 0; i < 3; i++)
     {
          if (Board[i,0] != 'e' && Board[i,0] == Board[i,1] && Board[i,1] == Board[i,2]) return Board[i,0];
     }

     // check verticals
     for (int j = 0; j < 3; j++)
     {
          if (Board[i,0] != 'e' && Board[0,j] == Board[1,j] && Board[1,j] == Board[2,j]) return Board[0,j];
     }

     // check diagonals
     if (Board[i,0] != 'e' && Board[0,0] == Board[1,1] && Board[1,1] == Board[2,2]) return Board[0,0];
     if (Board[i,0] != 'e' && Board[0,2] == Board[1,1] && Board[1,1] == Board[2,0]) return Board[0,2];

     return ret;
}

Friday, June 21, 2013

Mining object spatial multimedia and text
These are complex types of data. If the database is object-relational database or object oriented database,  then they can be mined by using generalization and assigning classes to these complex objects including set based, list, inheritance and composition based hierarchies. They can also be mined by visualizing them as object data cubes. Finally, they can be used with generalization based mining.
Spatial data mining finds interesting patterns from large geospatial databases  Spatial data cubes are constructed using spatial dimensions and measures. These can be queried using spatial OLAP . Spatial mining includes mining spatial association and collocation patterns, clustering, classification and special trend and outlier analysis.
Multimedia data mining finds interesting patterns from multimedia databases which store audio data, image data, video data, sequence data, and hypertext data containing text, markups and linkages.  Mining involves finding patterns based on content and similarity measures, generalization and multidimensional analysis. Mining also involves classification and prediction, mining associations and audio and video data mining.
Text or document database mining  uses precision, recall and F-score to measure the effectiveness of mining. As discussed earlier, various text retrieval methods have been developed where the queries can specify constraints on the documents to select or the documents have a ranking that enables a selection. As an example, if we use similarity measures between keywords, then the documents can be ranked in the order of relevance. Text that has a lot of attributes can be reduced with indexes such as in Latex Semantic Indexing (LSI), Locality preserving Indexing (LPI), and probabilistic LSI. Text mining is not limited to keyword based and similarity based search. It could involve key-board based associations, document classification and document clustering.
Web mining looks for web linkage structures, web contents and web access patterns. Web page layouts, web link structures, associated multimedia data and classification of web pages  are all used in this mining.

Thursday, June 20, 2013

Graph Mining
This involves finding frequent sub-graph patterns  and performing some of the mining operations discussed earlier such as categorization, classification, and cluster analysis over large graph data sets. These approaches are of two types - Apriori based and pattern growth based approaches. The Apriori based approach generates candidates by extracting the nodes at each level using a breadth first approach. The pattern growth approach is more flexible with respect to the search method.  Graph mining uses efficient graph index structures and can be used to search for similarity based on the structure and not the values. Using structure has more benefits than comparing values. Classification and cluster analysis of graph data can now be integrated with pattern mining process.
An example for a graph for mining is the social network which is a heterogenous and multirelational data set represented by a graph. They can be arbitrarily large. They tend to follow the densification power law, which states that the network become increasingly dense over time. They also exhibit a property where the effective diameter decreases as the network grows. Nodes that are facing out and those facing in referred to as out-degrees and in-degrees are distributed such that there are a large number of outliers. The growth of these networks are often modeled and referred to with the analogy of a forest fire.
Link mining is an interdisciplinary pursuit between social networks, link analysis, hypertext and web mining, graph mining, relational learning, and inductive logic programming. The tasks here involve classifying objects based on links, prediction based on object types, link types and link existence, estimating link cardinality, predicting whether two objects are the same and detecting cluster objects, other tasks include. Some other tasks may involve finding sub-graphs within networks and finding metadata from unstructured data.
Link prediction uses measures for the proximity of network nodes and ranks links based on these measures. Some examples include shortest path, and the number of neighbors that two nodes share.  This degree of connections is exploited in viral marketing where more money is spent on the individual who can advertise by positive word of mouth to more number of individuals. Similarly in newsgroups, people respond to posts more when they disagree than when they agree. So graph partitioning can be used to mine newsgroups by grouping authors into opposite camps. Most community mining methods assume that there is only one kind of relation in the network and in addition assume that the results are independent of the users information needs, however there may be multiple relations and relations that matter more depending on the users information needs. Furthermore, a combination of such hidden relations might reveal a hidden community within this network.
 
Mining stream, time-series, and Sequence Data
Stream data could come from telecom, financial markets and satellite data and they are typically continuous with varying update rate, sequenced in time, fast changing and massive Their summaries are called synopses. Synopses help with answer queries on stream data with an approximation. They can include random sampling, sliding windows just like TCP, histograms, multire solutions, sketches, and randomized algorithms.
The tilted time frame model allows data to be stored at the finest granularity in the most recent time and the coarsest granularity in the most distant time.  A stream data cube can store compressed data by using the tilted time frame model on the time dimension, storing data at only some critical layers, which reflect the levels of data that are of most interest to the analyst,.and performing partial materialization based on "popular paths" through the critical layers.
Stream based operations are typically forward only. The earlier discussed methods of frequent itemset mining, classification and clustering tend to scan the data multiple times  Stream-based methods instead find approximate answers as for example with lossy counting algorithm.and CluStream algorithms for stream data clustering. A time series database consists of sequences of values or events changing with time at regular time intervals such as for weather forecasting. This database is studied with trend analysis that includes long term trend movements, cyclic movements, seasonal movements, and irregular movements. Subsequence matching is a form of similarity search that finds subsequences that are similar to a given query subsequence. Such methods match subsequence that have the same shape while accounting for gaps and differences in baseline and scale.
A sequence database consists of ordered element not necessarily based on time such as web clickstreams. Any sequence that satisfies a minimum support is frequent and these patterns are queried from the database such as for example a customer buying this also bought that. Constraint based mining of sequential pattern are user defined and help to further narrow down the patterns being searched. These constraints can be expressed in terms of the duration of the sequence, a window of time when events occur, and gaps between events. Analysis of recurring patterns is called periodicity analysis and may involve full or half periods or association rules between periods. Another example of sequence analysis is the biological sequence analysis which compares, aligns, indexes and analyzes biological sequences such as sequences of amino acids. These are of two types pairwise sequence alignment and multiple sequence alignment and usually involve dynamic programming. Common techniques to analyze biological sequences are the Markov chains and hidden Markov models. These attempt to find the probability of a symbol x in the model given the sequence of symbols.

Wednesday, June 19, 2013

Cluster analysis:
A cluster is a collection of data objects that are similar to one another and dissimilar to the objects in another cluster. The quality of  a cluster can be measured based on the dissimilarity of objects, which can be computed for different types of data such as interval-scaled, binary, categorical, ordinal and ratio-scaled variables. Vector data can be measured with cosine measure and Tanimoto coefficient. Cluster analysis can be a stand alone data mining tool. Clustering methods can be grouped as partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data (including frequent pattern based methods) and constraint-based methods.
A partitioning method creates an initial set of k partitions and assigns the data points to the nearest clusters. Then it iteratively re-computes the center of the cluster and reassigns the data points. Some examples are K-means, k-medoids, and CLARANS. A hierarchical method groups data points into a hierarchy by either bottom up or top down construction. Iterative relocation can also be applied to the sub clusters. A density based approach clusters objects based on a density function or according to the density of neighboring objects. Some examples are DBSCAN, DENCLUE and OPTICS.  A grid based method first allocates the objects into a finite number of cells that form a grid structure, and then performs clustering on the grid structure. STING is an example of grid based where the cells contain statistical information.  A model based method hypothesizes a model for each of the clusters and finds the best fit of the data into the model. Self-organizing feature maps is an example of model based method. Clustering high-dimensional data has applications to text documents. There are three methods - dimension growth subspace clustering, dimension reduction projected clustering and frequent pattern based clustering. A constraint based clustering method groups objects based on application dependent or user-specified constraints. Outlier detection and analysis can be useful for applications such as fraud detection.  The outliers are detected based on statistical distribution, distance, density, or deviation.
Classification and prediction are forms of data analysis to extract models. Classification predicts labels or data classes while prediction models continuous-valued functions.
Same steps for data preprocessing are applied as discussed before.
ID3, C4.5 and CART are greedy algorithms for the induction of decision trees. Each algorithm tests the attribute for each non-leaf node before selection.
Naïve Bayes classification and Bayesian belief networks are based on Bayes probability theory. The former assumes data classes are conditionally independent while the latter assumes subsets of variables can be conditionally independent.
A rule based classifier uses a set of IF-THEN rules for classification. Rules can be generated directly from training data using sequential covering algorithms and associative classification algorithms.
Associative classification uses association mining techniques that search for frequently occurring patterns in large databases
The set of all classifiers that use training data to build a generalization model are called eager learners. This is different from lazy learners or instance based learners that store different training tuples in pattern space and wait for test data before performing generalization. Lazy learners require indexing techniques.
Linear, non-linear and generalized linear models of regression can be used for prediction. Non-linear programs are converted to linear problems by transforming predictor variables.