Saturday, January 26, 2013

Data Mining (Continued)


Decision Tree

A useful data structure for mining the logical data model is the decision tree. Structure involves interior nodes = set (A1, … An) of categorical attributes . The leaf is the class label from domain(C). The edge is a value from domain(Ai), Ai associated with parent node. The property is a search tree. The tuples in R -> leafs in class labels . The decision tree's property is that it associates the tuples in R to the leafs i.e. class labels. The advantage of using a decision tree is that it can work with heterogenous data and the decision boundary are parallel to the axis.

Decision Tree Mining
There are efficient algorithms to mine decision trees. Given a table R (A1, A2, ... An, C), find a decision tree to minimize classification error. The constraint is to prefer a smaller tree and no unique columns.
One algorithm is the naive exhaustive enumeration. It involves the following steps. 1. Generate candidate trees. 2. Scan R to compute their classification error. 3. Select a tree with small classification error and small size. The downside is its complexity is exponential (number of columns of R).
A better algorithm is ID3. The ID3 is a greedy algorithm where each node reduces entropy. It involves the following steps 1. Select an attribute for root node. Choose attributes with highest information gain Partition tuples over domain of chosen attributes. 2. Use recursion on each child. 3. The stopping criteria is don't recurse on nodes with few samples. The advantages of using ID3 is that it is faster than exhaustive but the disadvantage is that it is heuristic that is there is no guarantee of solution quality.

Clustering

Clustering is a technique for categorization and segmentation of tuples. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R. Find a set of those groups of rows in R with the objectives that the groups should be cohesive and not coupled. The tuples within a group are similar to each other. The tuples across group are dissimilar. The constraint is that the number of clusters may be given and the clusters should be significant.

The choices for similarity measures include distance functions such as euclidean, manhattan, string edits,  graph-distance etc. and with L2 metrics.

The choices for group representations are made with finding center such as with  mean, medoid, (mean and standard deviation) or with boolean expression on attributes or with named subsets of rows.

Cluster mining

There are efficient algorithms to mine cluster centers. Given a table R(A1, A2, ..., An) find cluster with say center-points and minimize average-distance with the constraint on the number of clusters. The average distance is computed on the rows of R, nearest center-points. The center-points belong to R

A naive algorithm is to use exhaustive enumeration. If the domains of the attributes are small, use the following steps : 1) Enumerate cross-products of domains of attributes 2) Enumerate size K subsets of tuple and 3) choose size-K subset with smallest average distance to R.

An alternative idea is to use the K-mean algorithm which uses a greedy strategy - each iteration reduces average distance to R. 1. Select K points at random. 2. Assign tuples of R to the nearest point. 3. Compute center of each cluster. 4. Repeat steps 2 and 3 to reach local minima. The advantage of cluster based outliers is that it is faster than exhaustive but it is heuristic i..e there is no guarantee of solution quality and it may miss outliers.

Outliers

Outliers are the rows that are most dissimilar. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R, find rows in R which are dissimilar to most point in R. The objective is to maximize dissimilarity function in with a constraint on the number of outliers or significant outliers if given. 
The choices for similarity measures between rows include distance functions such as Euclidean, manhattan, string-edits, graph-distance etc and L2 metrics. The choices for aggregate dissimilarity measures is the distance of K nearest neighbours, density of neighborhood outside the expected range and the attribute differences with nearby neighbours.

Outlier mining:

A naive algorithm could compute dissimilarity measure for each row in R with the rest of R, then choose the top-K rows, or those with scores outside the expected range.

An alternative idea is clustering based where we use the clusters as a summary of data to reduce computational costs. It involves the following steps 1. Cluster  R eg via K-means, 2.  Compute distance of each tuple in R to nearest cluster center  and 3. choose top-K rows, or those with scores outside the expected range.

The advantage of using the cluster based outliers  is that it is faster but the disadvantages are the same as with clustering i.e. it is heuristic ( there is no guarantee of solution quality) and may miss outliers.


 

Friday, January 25, 2013

Data Mining


Data Mining

Applications use data mining to answer questions for Market Analysis such as which items sell together? (Diaper and beer) Who buys iPhones? (customer segmentation ). Should this credit card application be approved? etc.

The idea in data mining is to apply a data driven, inductive and backward technique to identifying model. Get model from learning data and check with test data, refine the model if there’s a mismatch (prediction, reality. This is different from forward deductive methods in that those build model first, then deduce conclusions and then match with data. If there’s a mismatch between the model prediction and reality, the model would then be refined.

The history of Data Mining is that it was started in Statistics and Machine Learning. It is often main memory based software and uses patterns such as classification and clustering. Database researchers contributed to this field by adding scalability to large databases i.e. with main memories. They also introduced new patterns called association rules and integration with SQL, data warehouse etc.

The conceptual data model involves tables for items (instance-id, item-type-id, attribute1, attribute2, … ) and transactions ( xid, subset of item-types, customer-id, timestamp ) and item-types (item-type-id, type-name, … ).

With this data model, a family of patterns called Associations, are established. They involve subsets of similar item-types such as say customer buying item-type ITI also buys item-type IT2. Another example is sequential associations where the customers buying item-type IT1 will buy item-type

Another Pattern family is Categorization / Segmentation that partition item-types into groups. Unsupervised clustering is one where number of groups is not given as for example, market segmentation. Supervised classification is one where there is optional table category (instance-id, class label). To grant or decline a credit card application is an example of supervised classification.

Here are a few more examples and their classification:

People buying book b1 at Amazon.com also bought book b2 => Association pattern

Smoking causes lung cancer => Sequential association pattern

Apparels have t-shirts and pants => segmentation

Customers who have a credit rating above 3.0 and a balance of over hundred dollars (supervised classification)

Associations have association rules. Let I be a set of items, T be a set of transactions. Then an association A is defined as a subset of I occurring together in T. Support (S1) is a fraction of T containing S1. Let S1 and S2 be subsets of I, then association rule to associate S1 to S2 has a support(S1->S2) defined as Support(S1 union S2) and a confidence (S1->S2) = Support(S1 union S2)/ Support(S1)

There are variations to association rules which include the following: hierarchical set of items, negative associations and continuous attributes. Association rules can also be compared with statistical concepts such as conditional probability and correlation.

Association rules apply to logical data model as well as physical data model. Logical data model extend relational model and SQL. It specifies patterns over a relational schema and one can specify data mining request to extract. An example is the data-cube model for DWH where the pattern meets certain criteria. Due to its popularity, there are common data models for many families of patterns and syntax for specifying pattern. Statistical data mining is supported in SQL.

Physical data model is used when efficient algorithms to mine association rules are required.  As an example, given an item-set I, a transaction set T and thresholds, find association rule S1->S2 with support above a given threshold.  The naive algorithm would be to do an exhaustive enumeration. The steps would include such things as generate candidate subsets of I, scan T to compute their support, select subsets with adequate support and the downside is complexity exponential (size(I)). A better algorithm may be the “Apriori” algorithm. This method generates candidate in order of size. It combines subsets which differ only in the  last item which results in an efficient enumeration. The idea is to prune the candidate families as soon as possible. Support is monotonic on the size of candidates and pruning can be applied in each iteration. Sorted candidate sets  are efficient for apriori method and a hash tree can be used for candidate set of different sizes

The data mining such as association rules mining, clustering mining and decision tree mining,  is always on the physical data model. The association rule, clustering and decision trees pertain to the logical data model.

 

Tuesday, January 22, 2013

Query Processor optimizations




The query optimizer is a software module that transforms internal representation of a query into an efficient query plan for executing the query. This query plan is a flow chart for table data to pipe through and the chart contains operators such as sort, scan or join. These operations can be very tricky to perform when the table data is very large. Consider a join of two tables where the number of records in both left and right tables is in the order of millions. A join is a match between the records of two tables based on a common satisfying condition such as for example, a record in both tables that have the same value in one of their columns. Now a query optimizer must choose what is called an appropriate plan size for finding all such matches. Either it could use the records on the left table as a basis for performing a match with those on the right, so the resulting set will have all the records of the left and a few from the right or it could switch the tables so that the resulting set will have all the records of the right with a few from the left. Alternatively, the query optimizer could also compute a Cartesian product of all the records from both tables. A query optimizer could choose a “left-deep” query plan and “postpone Cartesian products” ensuring that the Cartesian products appear only after all the joins in a dataflow. This was being done several years ago. Today the systems use “bushy” trees i.e. a join based on nested right-hand inputs  and early use of Cartesian product which are useful in some cases. Both may be used by a query optimizer and the plan space is an import consideration for the query optimizer. Another factor for consideration by the query optimizer is the “Selectivity estimation”. The cardinality of a table or index in a database can be considered a simple selectivity estimation parameter. Most systems however go beyond that. They summarize and analyze such things as the distribution of values in attributes via histograms and other summary statistics. Since this involves visiting every value in each column, it’s quite expensive and is typically done by sampling instead of a complete scan of all the data. Now this can come in useful for the selectivity estimation of a join where the tables are not compared but their histograms are.  There are more sophisticated schemes as well but their adoption was hampered because a query optimizer along with the rest of the software modules in a database server has to meet “performance benchmarks” – an accepted industry standard for high performance and the tests in these benchmarks used mostly statistically independent generated data instead of “real” data. Another challenge for a query optimizer is that the errors in optimization made earlier in a plan tree cascade down the plan tree with significant hit to performance or mistakes in subsequent estimations. While talking about factors affecting query optimization, it’s also worthwhile to mention search algorithms. Algorithms could choose one or the other of the strategies such as dynamic programming or “top-down” search. Each of these algorithms has a positive and negative effect on a query optimizer’s use of memory or processing cycles. Top-down search for example can lower the number of plans considered by an optimizer but increases the memory consumption.  Another factor for consideration is parallelism where the workload can be shared across one or more CPUs and reduce the time taken for the computations. These are measured with a parameter called the degree of parallelism (DOP)
Role of MVCC for for in-memory applications to boost performance and concurrency as opposed to OCC. In MVCC, each transaction generates a unique start-timestamp from a monotonically increasing counter and when it ends, it gets a unique end-timestamp. The transaction commits only if there is no overlap with another start/end-transaction pair. This way locking is avoided and a version is generated for each change. The role of MVCC is different from OCC even though both may be an add-on to two-phase locking (2PL) to ensure isolation. In OCC, multiple transactions are allowed to read and update an item without blocking. Instead, transactions maintain histories of their reads and writes which are checked for isolation conflicts before the transaction commits. If a conflict is detected, one of the conflicting transactions is rolled back. The cost of rollbacks and retries increases with conflicts in the OCC scheme. The 2PL refers to the method of locking where a shared lock is obtained on all items for reading and an exclusive lock is acquired on all items written to. The transaction blocks on a wait queue while waiting to acquire a lock. In order to reduce locking, hierarchical locks may be provided. Talking about isolation levels, here are the ones ANSI SQL defines:
1. READ UNCOMMITTED: Any version of the data can be read without taking locks. This is achieved in a locking implementation by read requests proceeding without acquiring any locks.
2. READ COMMITTED: A transaction may read any committed version of the data. Repeated reads of an object may result in different versions being read.
3. REPEATABLE READ: A transaction will read only one version of committed data. Once the transaction reads an object, it will always read the same version of that object until the end of the transaction.
4. SERIALIZABLE: All reads and writes are serialized.

The phantom problem is described as follows:  Repeated access of the transaction see tuples that were not seen on the first access. This happens because the locks are for the tuples and does not prevent insert of new tuples. Two phase locking of tables prevents phantoms, but table level locking can be restrictive in cases where transaction access only a few tuples with an index.

Other than the ANSI SQL levels, these additional levels are also popular.
CURSOR STABILITY: This is used to fix the "lost update" nuance of read committed. The lost update is when the changes made by one transaction  are lost because another makes an update. This is solved by holding a lock on the most recently read item on a cursor and dropped when the cursor is moved or the transaction terminates.
SNAPSHOT ISOLATION uses the MVCC to version every change by a transaction and prevent the overlap by any two start/end-transaction pairs.
READ CONSISTENCY is another MVCC where the locks are acquired at the statement level so the statements see the most recently committed value at the start and uses that logical version. A single transaction could use many different logical version. Only the most recent version is stored and older versions are retrieved using undo log records. Modifications are maintained via long term write locks and first writer wins.
 

In-memory applications

Now back to the post on in-memory applications. Let's look at some of the data structures that lends themselves to better concurrent programming and traditional data-store data structures. Skip List or balanced search trees etc are suitable because of the locks per node and the order in which they are applied.Let's take a look at skip lists and see how these are faster than say B+-trees. First, the skip lists have locks on each node for the purposes of localized updates. Second, the locks are needed only for two or three nodes at the same time for an insert or delete operation. Third, the locks are acquired in the order of the traversal. The readers and writers traversing the skip list will not pass each other. The locks taken in specific order will not cause deadlocks. In the case of the balanced search tree and say a red-black tree, each of the nodes have the locks but insert or delete does not acquire all the locks up until the root as this would throttle concurrency. The rotations and recoloring required for re-balancing the red black trees are performed with only three or four nodes at the same time. This is a bit more involved than skip lists.
As a practical example for say search over text where all the words have been populated in tuples of the word and the list of their occurrences, the skip lists enables for searching over a range of words by finding adjacent nodes in the skip lists and allowing for fast search, insert, delete operations.  Data parallelism can also be used to make performance improvements to the same.  Concurrency is also about hardware as much as it is for code. If the architecture involves memory synchronization between threads so that changes made by one is visible to another, there is a cost involved. In a shared nothing model, this cost is removed. Also, in a shared nothing model, the nodes can be added or removed from a cluster without affecting other operations. And moreover, in a model where the concurrency is distributed over shared nothing servers, the system can scale out to several thousands of nodes. Concurrency, however, is not the only thing to improve performance with.

Sunday, January 20, 2013

Search over text is enabled by creating tuples of (word, documentID, position), and building a B+-tree index over the word column. Sure words may need to be canonicalized and additional per-tuple attributes to aid in rank-ordering search results. For improved performance, some search optimizations include tuples to have each word appear once and a list of their occurences as in (word, list<documentID, position>). This is helpful especially given that skewed distribution of words in documents. However, the text search implementation described above may run slower by an order of magnitude than custom text indexing engines. This is true for both full-text documents and over short textual attributes in tuples. In most cases, the full-text index is updated asynchronously  ("crawled") rather than being maintained transactionally. In transactional queries, the semantics of relational queries with ranked document search results need to be bridged.

Log Manager

This may not be relevant to the previous blog post. But this is one of the things I came across today and blogging about it here. The log manager in a DBMS shows the changes committed to the database, helps with rollback of aborted changes and recovery from system failure. The standard way to do this is called the Write Ahead Logging (WAL). The WAL protocol states three rules:
1) Each change to the database should generate a log record.
2) Database log records should be flushed in order.
3) Log record must be flushed before the commit request for the change (or transaction) completes.
These rules of Write Ahead Logging are not restricted to a database. They are used by file systems as well. Logging faces challenges when performance is desired. To enable so called fast path to boost performance, database systems uses a "direct, steal  / not-force" mode. This rule states that
1) the data objects are updated in place
2) unpinned buffer pool frames can be "stolen"
3) buffer pool pages need not be forced or flushed
The log manager then has to handle both the undoing the flushes of stolen pages from aborted transactions and redoing the changes to not-forced pages of committed transactions.
Another challenge is to keep the log records small. The log manger could log the logical operations for optimization instead of the physical operations but redo and undo are onerous. In practice, physical operations are logged to support redo while logical operations are used to support  undo. This is how the crash state is detected during recovery and the transaction from that point are rolled back. Log records are written with incrementing log sequence numbers.
To optimize crash recovery, not all of the history is replayed. The start point for the recovery is chosen as the oldest of these two log records 1) one describing the change to the oldest dirty page in the buffer pool and 2) the log record describing the start of the oldest transaction in the system.  This start point is called the recovery LSN and is computed at what is called checkpoints.  Since this computation could take some time,  frequent checkpoints are suggested. Another way to make this efficient is to write out the buffer pool pages asynchronously instead of synchronously. Logging and recovery work not only on data pages but a variety of other information related to internal data structures.