Thursday, June 20, 2013

Mining stream, time-series, and Sequence Data
Stream data could come from telecom, financial markets and satellite data and they are typically continuous with varying update rate, sequenced in time, fast changing and massive Their summaries are called synopses. Synopses help with answer queries on stream data with an approximation. They can include random sampling, sliding windows just like TCP, histograms, multire solutions, sketches, and randomized algorithms.
The tilted time frame model allows data to be stored at the finest granularity in the most recent time and the coarsest granularity in the most distant time.  A stream data cube can store compressed data by using the tilted time frame model on the time dimension, storing data at only some critical layers, which reflect the levels of data that are of most interest to the analyst,.and performing partial materialization based on "popular paths" through the critical layers.
Stream based operations are typically forward only. The earlier discussed methods of frequent itemset mining, classification and clustering tend to scan the data multiple times  Stream-based methods instead find approximate answers as for example with lossy counting algorithm.and CluStream algorithms for stream data clustering. A time series database consists of sequences of values or events changing with time at regular time intervals such as for weather forecasting. This database is studied with trend analysis that includes long term trend movements, cyclic movements, seasonal movements, and irregular movements. Subsequence matching is a form of similarity search that finds subsequences that are similar to a given query subsequence. Such methods match subsequence that have the same shape while accounting for gaps and differences in baseline and scale.
A sequence database consists of ordered element not necessarily based on time such as web clickstreams. Any sequence that satisfies a minimum support is frequent and these patterns are queried from the database such as for example a customer buying this also bought that. Constraint based mining of sequential pattern are user defined and help to further narrow down the patterns being searched. These constraints can be expressed in terms of the duration of the sequence, a window of time when events occur, and gaps between events. Analysis of recurring patterns is called periodicity analysis and may involve full or half periods or association rules between periods. Another example of sequence analysis is the biological sequence analysis which compares, aligns, indexes and analyzes biological sequences such as sequences of amino acids. These are of two types pairwise sequence alignment and multiple sequence alignment and usually involve dynamic programming. Common techniques to analyze biological sequences are the Markov chains and hidden Markov models. These attempt to find the probability of a symbol x in the model given the sequence of symbols.

Wednesday, June 19, 2013

Cluster analysis:
A cluster is a collection of data objects that are similar to one another and dissimilar to the objects in another cluster. The quality of  a cluster can be measured based on the dissimilarity of objects, which can be computed for different types of data such as interval-scaled, binary, categorical, ordinal and ratio-scaled variables. Vector data can be measured with cosine measure and Tanimoto coefficient. Cluster analysis can be a stand alone data mining tool. Clustering methods can be grouped as partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data (including frequent pattern based methods) and constraint-based methods.
A partitioning method creates an initial set of k partitions and assigns the data points to the nearest clusters. Then it iteratively re-computes the center of the cluster and reassigns the data points. Some examples are K-means, k-medoids, and CLARANS. A hierarchical method groups data points into a hierarchy by either bottom up or top down construction. Iterative relocation can also be applied to the sub clusters. A density based approach clusters objects based on a density function or according to the density of neighboring objects. Some examples are DBSCAN, DENCLUE and OPTICS.  A grid based method first allocates the objects into a finite number of cells that form a grid structure, and then performs clustering on the grid structure. STING is an example of grid based where the cells contain statistical information.  A model based method hypothesizes a model for each of the clusters and finds the best fit of the data into the model. Self-organizing feature maps is an example of model based method. Clustering high-dimensional data has applications to text documents. There are three methods - dimension growth subspace clustering, dimension reduction projected clustering and frequent pattern based clustering. A constraint based clustering method groups objects based on application dependent or user-specified constraints. Outlier detection and analysis can be useful for applications such as fraud detection.  The outliers are detected based on statistical distribution, distance, density, or deviation.
Classification and prediction are forms of data analysis to extract models. Classification predicts labels or data classes while prediction models continuous-valued functions.
Same steps for data preprocessing are applied as discussed before.
ID3, C4.5 and CART are greedy algorithms for the induction of decision trees. Each algorithm tests the attribute for each non-leaf node before selection.
Naïve Bayes classification and Bayesian belief networks are based on Bayes probability theory. The former assumes data classes are conditionally independent while the latter assumes subsets of variables can be conditionally independent.
A rule based classifier uses a set of IF-THEN rules for classification. Rules can be generated directly from training data using sequential covering algorithms and associative classification algorithms.
Associative classification uses association mining techniques that search for frequently occurring patterns in large databases
The set of all classifiers that use training data to build a generalization model are called eager learners. This is different from lazy learners or instance based learners that store different training tuples in pattern space and wait for test data before performing generalization. Lazy learners require indexing techniques.
Linear, non-linear and generalized linear models of regression can be used for prediction. Non-linear programs are converted to linear problems by transforming predictor variables.

Tuesday, June 18, 2013

Some of the common techniques in finding patterns with data mining include Association rule mining which consists of finding frequent item sets from which strong association rules are generated.Associations can be analyzed to uncover correlation rules which give statistical information.
Frequent pattern mining can be categorized based on completeness, levels and dimensions of data, types of values, kinds of rules, patterns. Frequent pattern mining can be classified into frequent itemset mining, sequential pattern mining, structured pattern mining etc. Algorithms for frequent itemset mining can be of three types :
1) apriori-like algorithms: The Apriori algorithm mines frequent item sets for Boolean association rules. Based on the property that the non-empty subsets of a frequent itemset must also be frequent, the kth iteration, it forms frequent k-itemset candidates based on the frequent (k-1) itemsets.
2) frequent pattern based algorithms: FP-Growth does not generate any candidates but constructs a highly compact data structure (FP-tree) and uses fragment growth.
3) algorithms that use vertical data format transform a given data set of transactions in the horizontal data format of TID-itemset into the vertical data format of item-TID_set. Then it mines using the Apriori property and additional optimization techniques such as diffset.
These same methods can be extended for the mining of closed frequent itemsets from which the set of frequent itemsets can easily be derived. These include additional techniques such as item merging, sub-itemset pruning and item skipping.
These techniques can be extended to multilevel association rules and multidimensional association rules.
Techniques for mining multidimensional association rules can be categorized according to their treatment of quantitative attributes. For example, they can be discretized statically based on predefined concept hierarchies. Quantitative association rules can be mined where quantitative attributes are discretized dynamically based on binning/clustering.
Association rules should be augmented with a correlation measure such as lift, all_confidence and cosine.
Constraint-based rule mining refines the search for rules by providing meta rules and constraints which can be antimonotonic, monotonic, succint, convertible and inconvertible. Association rules should not be used for prediction without training.
 

book review

Data Mining concepts and techniques Jiawei Han and Micheline Kamber

This book mentions the data preprocessing steps as descriptive data summarization, data cleaning, data integration and transformation, data reduction, data discretization and automatic generation of concept hierarchies.
Descriptive data summarization provides the analytical foundation for data pre-processing using statistical measures such as mean, weighted mean, median and mode for center, range, quartiles, interquartile range, variance, and standard deviation for dispersion, histograms, boxplots, quantile plots, scatter plots,  and scatter plot matrices for visual representation.
Data cleaning routines fill in missing values, smooth out noise, identifying outliers and inconsistencies in the data.
Data integration combines data from multiple sources into a coherent data store by smoothing out data conflicts, semantic heterogeneity and contribute towards data integration.
Data transformation routines convert the data into appropriate forms for mining involving steps such as normalization.
Data reduction techniques such as data cube aggregation, dimensionality reduction, subset selection and discretization can be used to obtain a reduced representation of data.
Data discretization can involve techniques such as binning, histogram analysis, entropy based discretization, cluster analysis and intuitive partitioning. Data processing methods continue to evolve due to the size and complexity of the problem.

Data Mining is the discovery of knowledge based on finding hidden patterns and associations, constructing analytical models, perform classification and prediction and presenting the mining results using visualization tools. Data Warehousing helps with providing summarized data. A data warehouse is defined in this book as a subject-oriented, integrated, time-variant, and non-volatile collection of data organized for decision making. A multidimensional data model is used to design the data warehouse and consists of a data cube with a large set of facts or measures  and a number of dimensions. A data cube consists of a lattice of cuboids.Concept hierarchies organize the values into levels of abstraction.
Data Mining can use OLAP queries as well as On-line analytical mining (OLAM)

Monday, June 17, 2013

Q: Find the second largest number in an integer array
A: int GetSecondLargest(int[] array, uint length)
{
if (array == null || length < 1 ) throw new Exception();
int max = array[0];
bool found = false;
int secondmax = 0;
for (int i =1; i < length; i++)
{
if (found == false)
{
if (array[i] < max)
{
found = true;
secondmax = array[i];
}
}
if (found != false && array[i] > secondmax)
{
if (array[i] < max)
{
found = true;
secondmax = array[i];
}
}
if (array[i] > max)
{
found = true;
secondmax = max;
max = array[i];
}
}
if ( found == false) throw new Exception();
if ( secondmax == max) throw new Exception();
return secondmax;
}
 
Question: Given an arbitrary two dimensional matrix of integers where the elements are sorted in increasing order along rows and columns, find a number in the matrix closest to and less than or equal to a given number.
Answer:
 uint GetElement(int [,] matrix, uint startrow, uint startcol, uint endrow, uint endcol, uint number)
{
while(startrow < endrow && startcol < endCol)
{
uint midrow = (startrow + endrow) / 2 ;
uint midcol = (startcol + endcol) / 2;

if (matrix[midrow, midcol] < number))
{
startrow = midrow;
startcol = midcol;
}
else
{
endrow = midrow;
endcol = midcol;
}
}
if (startrow == endrow && startcol == endcol)
{
  return matrix[startrow, startcol] < number ? matrix[startrow, startcol] : 0;
}
if ((startcol == endcol && startrow == endrow - 1) || (startrow == endrow && startcol == endcol - 1) )
{
  if (matrix[endrow, endcol] < number) return matrix[endrow, endcol];
  if (matrix[startrow, startcol] < number) return matrix [ startrow, startcol];
  return 0;
}
if (matrix[startrow, startcol] < number)
{
startrow = endrow;
startcol = endcol;
}
uint topright =  startcol - 1 > 0 && startrow - 1 > 0  ? GetElement(matrix, 0, startcol, startrow - 1, endcol, number) : 0;
uint bottomleft = startrow + 1 <= endrow && startcol - 1 > 0 ? GetElement(matrix, startrow + 1, 0, endrow, startcol - 1,
number) : 0;
if (topright < bottomleft)
  return bottomleft;
else
  return topright;
}