Cluster computing

Saturday, February 2, 2013

DBMS storage management

Sequential access to data is at least an order of magnitude faster than random access and has been increasing. As a result, a DBMS store manager finds it critical to place blocks on the disk such that the queries can access it sequentially. Since the DBMS knows the workload access patterns better than the underlying system, there is a requirement to exercise full control over the spatial positioning of database blocks on disk. The best way for the DBMS to enforce spatial locality of the data is to store it directly on the raw device. However, this uses up the disk partitions and the interfaces are OS specific. Instead developments like RAID and SAN have made the virtual device more appealing. Consequently, the store manager now accesses a single files and places these block directly in the file. The file is essentially treated as a linear array of disk-resident pages. Further DBMS vendors also allow database size to be customized to a size appropriate to a workload. In addition to where the data is written, the discussion on when the data is written is equally important. DBMS will try to postpone or reorder writes and this may conflict with OS read-ahead and write behind approach. The write ahead logging is required by the DBMS to provide durability and correctness. Another reason for a conflict may be OS buffering impact on DBMS IO optimizations. For example, the data access algorithm and plan is known from the query plan where as the OS only knows about bytes. Consequently, this double buffering between DBMS and OS is considered redundant. Copying is often ignored however the major bottleneck for throughput in well tuned transaction processing has not been I/O. Sufficient disks and higher RAM can enable the processors to have data all the time. OS now support options to turn off this additional buffering. This ensures that writes go to the disk when requested. In order that the DBMS provides access to database pages, it implements a buffer pool. Along with arrays of buffer pool, it maintains a hash table that maps the page numbers currently held in memory to their location in the frame table, the location for the page on the disk storage, and metadata about the page such as a dirty bit and any information needed by the page replacement policy. Active pages are "pinned" in memory and this is typically a small fraction of the overall buffer pool. Since the contents of page could be diverse, the page replacement policy algorithms are focused on instead. With 64 bit addressing and falling memory prices, very large buffer pools are now possible. Flash memory is also making its impact with a tradeoff between cost/performance relative to disk and RAM. Another factor to improve data residency has been compression but this has required representations that are amenable to data processing and query processing internals. Finally, column oriented storage is being considered in non-traditional database market.

Friday, February 1, 2013

Data Warehouses

Data Warehouses are large historical databases that get loaded with data periodically. They are then used for decision support which account for nearly one third of all DBMS activity. They have come to require custom query optimization and execution engine support. The first of these is that they are not suited for online transaction processing OLTP where response time was critical. Historical data, on the other hand, was used for analysis such as which items to put on promotion and where layout was important for the upcoming season. These kind of queries were based on the wisdom that historical information enabled better stock management and more savings. The second of these is that such databases required a schema very typical and different from others such as star, multi-level star or snowflake schema where customers, products, stores, times etc. were the dimesions or fact tables. Many dimensions were naturally hierarchical. and hence the use of multi-level star or snowflake. The third of these is that they required very different data structures from the B+-trees which were optimized for fast insertion, deletion and update of records. In contrast, a data warehouse performs an initial load and then the data is static for months. Bitmaps helped store this information with far fewer bits and were also advantageous for various sophisticated bitmap arithmetic including conjunctive filters. The fourth difference in these were that these databases required fast loads of large data periodically. These loads not only were accumulated data but also loaded from systems during say night time.Loads and queries could potentially conflict and hence there was a need to provide techniques such as update-in-place and historical queries. Updates were timestamped and MVCC isolation were provided by a few vendors. The fifth difference is that the joins on the data were very costly to perform and hence there was a need to capture the view and persist it till the next update. The sixth difference is that these databases supported predictable queries that consisted of various aggregates. These aggregates often called data cubes were special class of materialized views that allowed user to navigate the data. Lastly, such systems required special hardware and some vendors like Teradata and Netezza provided proprietary hardware.

Data archival landscape

Data explodes continously and companies are forced to retain this data over longer periods of time. Consequently, the companies look forward to more capable archival solutions. And there are different vendors and technologies in this space. Many have overlapping ideas or technology propositions and it can become quite murky with the various jargons each one uses. Here is a glossary of a few key concepts.
1. NAS : Network Accessible Storage is probably the most common form of large scale storage for files and clusters. These are shared nothing commodity serves joined together with high-speed network so that they create a combined storage space for the data. As opposed to SAN or Storage Area Network, NAS uses IP network to connect the servers. SAN is used for large databases.
2. Appliance : A storage appliance is logically a drive to interacting systems. Physically, it may comprise of a server that allows for addition and removal of disk drives. These can grow to very large number and size in terms of drive and capacity. Moreover, they can comprise of very complex organization of servers and drives in a scaled architecture to enable generational data and archival. Their appeal includes low maintenance and little or no custom application for integration.
3. Tiering: Tiering is yet another solution from storage industry which enables policies to be specified for generational mark-down of data and its movement between tiers. This enables differentiation of hardware for space to suit various storage traffic. By providing tiers, the storage space is now prioritized based on media cost and usage.

Thursday, January 31, 2013

Threads and Process

"An operating system thread is a unit of control without additional private OS context and without a private address space. Each OS thread has full access to the memory of the other threads executing within the same multithreaded OS process. Thread execution is scheduled by the operating system kernel scheduler and these threads are often called "kernel threads" or k-threads.
A Lightweight Thread package is an application-level construct that supports multiple threads within a single OS process. Unlike OS threads scheduled by the OS, lightweight threads are scheduled by an application-level thread scheduler. The difference between a lightweight thread and a kernel thread is that a lightweight thread is scheduled in user-space without kernel scheduler involvement or knowledge. The combination of the user-space scheduler and all of its lightweight threads run withing a single OS process and appears to the OS scheduler as a single thread of execution. Lightweight threads have the advantage of faster thread switches when compared to OS threads since there is no need to do an OS kernel mode switch to schedule the next thread. Lightweight threads ahve the disadvantage, however, that any blocking operation such as a synchronous I/O by any thread will block all threads in the process. This prevents any of the other threads frommaking progress while one thread is blocked waiting for an OS resource. Lightweight thread packages avoid this by (1) issuing only asychronous (non-blocking) I/O requests and (2) not invoking any OS operations that could block. Generally, lightweight threads offer a more difficult programming model than writing software based on either OS processes or OS threads. Some DBMSs implement their own lightweight thread (LWT) packages. These are a special case of general LWT packages. We refer to these threads as DBMS threads and simply threads when the distinction between DBMS, general LWT and OS threads are unimportant to the discussion. A DBMS client is the software component that implements the API used by the application programs to communicate with a DBMS. Some example database access APIs are JDBC, ODBC, and OLE/DB. In addition, there are a wide variety of proprietary database access API sets. Some programs are written using embedded SQL, a technique of mixing programming language statements with database access statements. This was first delivered in IBM COBOL and PL/I and, much later, in SQL/J which implements embedded SQL for Java. Embedded SQL is processed by preprocessors that translate the embedded SQL statements into direct calls to data access APIs. Calls made to these APIs are marshaled by the DBMS client component and sent to the DBMS over some communications protocol. The protocols are usually proprietary and often undocumented. In the past, there have been several efforts to standardize client-to-database communication protocols, with Open Group DRDA being perhas the best known, but none have achived broad adoption. A DBMS worker is the thread of execution in the DBMS that does work on behalf of a DBMS client. A 1:1 mapping exists between a DBMS worker and a DBMS client: the DBMS worker handles all SQL requests from a single DBMS client. The DBMS client sends SQL requests to the DBMS server. The worker executes each request and returns the result to the client."
Reference: Architecture of a database system - Hellerstein, Stonebraker, Hamilton

Wednesday, January 30, 2013

Python NLTK processing

The NLTK short form for natural language toolkit library available in Python programming language defines the following constructs:
concordance : this provides all occurrances of the word in the text block along with their context.
similar : this provides all the words similar to the one specified in concordance that appear in the same range of contexts.

Tuesday, January 29, 2013

Lock free

Lock primitives such as mutual exclusion locks are popular but when they number more than a handful, their benefits wane. Programmers try to maintain correctness with conflicting operations but in serializing some of these, they also serialize some non-conflicting operations. Similarly, programmers try to maintain live-ness but these locks are held longer than would otherwise be necessary. Also, without scheduler support, programmers need to be aware of priority inversion problems. Lastly, for high performance programmers must balance the granularity at which locking operates against the time that the application will spend acquiring and releasing locks.
Hence, three different APIs are proposed by Fraser and Harris, each of which attempts to improve the shortcomings mentioned above. Further, they are non-blocking and generally allow direct-access parallelism. They all follow a common optimistic style.
The first API provides multi-word compare-and-swap (MCAS) which generalizes single-word CAS operation found on many processors. It automatically updates one or more memory locations from a set of expected values to a set of new values. This API comes with two methods MCASRead and MCAS where the former allows for the values to be read and remembered until the subsequent MCASRead while the latter allows for single update.
The second API provides word-based software transactional memory (WSTM) which avoids some of these problems by allowing a series of reads and writes to be grouped as a software transaction and applied to heap automatically. This API comes with two methods WSTMRead and WSTMWrite ( one for read and another for write ).
The Third API provides an object-based software transactional memory (OSTM) which allows a thread to 'open' a set of objects for transactional access and, once more, to commit updates to them atomically. Each object is accessed through an OSTMHandle which must be subject to an OSTMOpenForReading or OSTMOpenForWriting call in order to obtain access to underlying data.
As a comparison between the three APIS, here are some of the criteria to use:
First, disjoint-access parallelism is demonstrated by MCAS when accessing disjoint sets of words and probabilistically by WSTM. OSTM shows that when accessing disjoint sets of objects.
Second, MCAS shows no read parallelism whereas both WSTM and OSTM allows the same.
Third, as a space cost, two bits are reserved in each word by MCAS and a fixed size table of 65,536 double-word entries is used by WSTM. OSTM uses up one word in each object handle.
Fourth, MCAS has no compos-ability whereas both WSTM and OSTM allow compos-ability.

Saturday, January 26, 2013

Data Mining (Continued)

Decision Tree

A useful data structure for mining the logical data model is the decision tree. Structure involves interior nodes = set (A1, … An) of categorical attributes . The leaf is the class label from domain(C). The edge is a value from domain(Ai), Ai associated with parent node. The property is a search tree. The tuples in R -> leafs in class labels . The decision tree's property is that it associates the tuples in R to the leafs i.e. class labels. The advantage of using a decision tree is that it can work with heterogenous data and the decision boundary are parallel to the axis.

Decision Tree Mining

There are efficient algorithms to mine decision trees. Given a table R (A1, A2, ... An, C), find a decision tree to minimize classification error. The constraint is to prefer a smaller tree and no unique columns.

One algorithm is the naive exhaustive enumeration. It involves the following steps. 1. Generate candidate trees. 2. Scan R to compute their classification error. 3. Select a tree with small classification error and small size. The downside is its complexity is exponential (number of columns of R).

A better algorithm is ID3. The ID3 is a greedy algorithm where each node reduces entropy. It involves the following steps 1. Select an attribute for root node. Choose attributes with highest information gain Partition tuples over domain of chosen attributes. 2. Use recursion on each child. 3. The stopping criteria is don't recurse on nodes with few samples. The advantages of using ID3 is that it is faster than exhaustive but the disadvantage is that it is heuristic that is there is no guarantee of solution quality.

Clustering

Clustering is a technique for categorization and segmentation of tuples. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R. Find a set of those groups of rows in R with the objectives that the groups should be cohesive and not coupled. The tuples within a group are similar to each other. The tuples across group are dissimilar. The constraint is that the number of clusters may be given and the clusters should be significant.

The choices for similarity measures include distance functions such as euclidean, manhattan, string edits, graph-distance etc. and with L2 metrics.

The choices for group representations are made with finding center such as with mean, medoid, (mean and standard deviation) or with boolean expression on attributes or with named subsets of rows.

Cluster mining

There are efficient algorithms to mine cluster centers. Given a table R(A1, A2, ..., An) find cluster with say center-points and minimize average-distance with the constraint on the number of clusters. The average distance is computed on the rows of R, nearest center-points. The center-points belong to R

A naive algorithm is to use exhaustive enumeration. If the domains of the attributes are small, use the following steps : 1) Enumerate cross-products of domains of attributes 2) Enumerate size K subsets of tuple and 3) choose size-K subset with smallest average distance to R.

An alternative idea is to use the K-mean algorithm which uses a greedy strategy - each iteration reduces average distance to R. 1. Select K points at random. 2. Assign tuples of R to the nearest point. 3. Compute center of each cluster. 4. Repeat steps 2 and 3 to reach local minima. The advantage of cluster based outliers is that it is faster than exhaustive but it is heuristic i..e there is no guarantee of solution quality and it may miss outliers.

Outliers

Outliers are the rows that are most dissimilar. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R, find rows in R which are dissimilar to most point in R. The objective is to maximize dissimilarity function in with a constraint on the number of outliers or significant outliers if given.
The choices for similarity measures between rows include distance functions such as Euclidean, manhattan, string-edits, graph-distance etc and L2 metrics. The choices for aggregate dissimilarity measures is the distance of K nearest neighbours, density of neighborhood outside the expected range and the attribute differences with nearby neighbours.

Outlier mining:

A naive algorithm could compute dissimilarity measure for each row in R with the rest of R, then choose the top-K rows, or those with scores outside the expected range.

An alternative idea is clustering based where we use the clusters as a summary of data to reduce computational costs. It involves the following steps 1. Cluster R eg via K-means, 2. Compute distance of each tuple in R to nearest cluster center and 3. choose top-K rows, or those with scores outside the expected range.

The advantage of using the cluster based outliers is that it is faster but the disadvantages are the same as with clustering i.e. it is heuristic ( there is no guarantee of solution quality) and may miss outliers.