Cluster computing

Wednesday, February 20, 2013

finding answers quickly

Many times when we search, we really are interested in the first few results rather than all of the results we get. There is an urgency for the user to get only the first few or the best few answers quickly. Rarely does the user go past the second or third page of the results. If the user does not find what they want, they usually refine the query and resubmit it. This is true not just for searches but also for decision support. Another pattern is when the user may want to avoid writing a complex query and instead write one that gives an approximate answer quickly, then refine the query continually until an exact answer is found.
Top N Queries refers to the first of these kinds of querying. This is very helpful for the users to be able to explicitly indicate how many answers they want, making it possible for the DBMS to optimize execution. As an example, one could request the products that are top 10 by sales.
So here the optimizer would add a predicate that checks for sales to be greater than a cut off value. Histograms can be used to find to reflect the top values. This cut-off value can be refined if the the tuples are less or more than the number specified by the user but the approach is based on pre-computation.
Another approach is to use online aggregation where the answer quality is continually refined during computation progress. First, the grouping column value is determined to be within a range with a confidence level, then the user input priority is used to determine the order of the tuples to work with and finally use non-blocking algorithms for relational operators. For example, the sort-merge join algorithm blocks because sorting requires all input tuples. Instead nested loops join and hash join are therefore preferable to sort-merge join for online aggregation

Tuesday, February 19, 2013

Distributed Databases (2)

Points to review:
Parallel databases evaluate queries in parallel in order to improve performance. Parallel queries can be executed on shared memory, shared disk system and shared-nothing system. In the shared-nothing system, there's linear speed-up and scale-up possible.
In pipelined parallelism, one operator consumes the output of another operator.
In data-partitioned parallelism, the input data is partitioned and we work on each partition in parallel.
Data can be partitioned with round robin partitioning where say the tuple i goes to processor i mod n or data can be partitioned with hash partitioning where a hash function distributes the tuples or range partitioning where the tuples are distributed over n search key ranges. Existing code can be parallelized by introducing split and merge operators.
Shared nothing architecture lends itself to data partitioning.
For query optimizations, we take parallelism within each operator and use pipelined parallelism to exploit parallelism between operators.
In a distributed database, data is stored across several locations. When the user does not have to know the location of the data, we achieve distributed data independence. When there is no difference between distributed transactions and local transactions, we achieve distributed transaction atomicity. If a transaction involves activities at different sites, the activity at a given site is called a subtransaction.
In distributed databases, a single relation might be fragmented or replicated across several sites. Horizontal and vertical fragmentation may apply to rows and columns respectively. Fragments need qualifying names. Distributed catalog management is used to keep track of what is stored where.
When processing queries in a distributed DBMS, the location of the partitions of the relations need to be taken into account. Two relations that reside on different sites can be joined by sending one relation to the other site and performing the join locally. Choice of sites could be based on the number of tuples to select or join. Semijoins and Bloomjoins reduce the number of tuples sent across the network. In such query processing, communication costs and autonomy of individual sites matter.
In replication, we store several copies of the relation or partition on different sites. In synchronous replication, all copies of a replicated relation are updated before the transaction commits.In asynchronous replication, copies are only updated periodically. Synchronous replication involves voting where an update must write a majority of copies, and a read must access at least enough copies to make sure that one of the copies is current. Asynchronous replication involves peer-to-peer replication where more than one copy can be updatable a conflict resolution strategy deals with conflicting changes. Alternatively, in primary site replication, there is one primary copy that is updatable, the other secondary copies cannot be updated. Changes on the primary are captured and then applied to the other sites.
Locking can be either central at a designated primary copy or fully distributed. Distributed deadlock detection is required.
Recovery is done with a commit protocol that co-ordinates activities at different sites involved in the transaction. In two phase commit each transaction has a designated co-ordinator site. Subtransactions are executed at subordinate sites. Subordinate sites block until co-ordinator site is recovered.
Three phase Commit can avoid blocking even if the co-ordinator site. In 3PC, a co-ordinator site sends out prepare messages and receives yes votes from all sub-ordinates. Then it sends out pre-commit messages, rather than a commit messages. When a sufficient number of acknowledgements have been received, the co-ordinator force-writes a commit log record and sends a commit message to all the sub-ordinates. The precommit message lets the sites to communicate with each other without waiting for the co-ordinator to recover.

Monday, February 18, 2013

Distributed databases

What are the features in the database server for distributed databases ?

Distributed databases are faced with the challenge of dealing with heterogenity and autonomy. Logically, these databases are connected via links. Physically, they share storage and query processing. In addition, reovery and concurrency control now spans linked servers.

Distributed databases originated from internet and mobile computing. The motivation was to provide distributed computing over LAN, WAN, internet etc to boost availability and reliability via redundancy and performance via distribution of work.

The computing environment for distributed databases can vary from homogenous nodes in a shared nothing architecture to a more centralized or truly distributed heterogenous computers. In such cases, the transparency of computing, the networking costs, the replication, fragmentation and allocation become important considerations. And site and communcation failures add to issues with robustness.

In a conceptual distributed management, heterogenity and autonomy classification criteria are considered.
The degree of homogenity depends on the choices of data models, constraints and query languages. Semantic attributes of an entity, attribute names, meaning of an attributes, isolation level choices.

The degree of local autonomy depends on the data model design autonomy, communication autonomy, execution autonomy and association autonomy.

The above define the classification criteria for conceptual DMs. With this classification, we can consider the following categories of distributed databases.
1. Distribution transparent category: These are homogeneous, no local autonomy databases that looks like a single centralized DB to users.
2. Multidatabase category: These have no common schema.
3. and Federated category: These have common schema, but local autonomy.
In addition, hardware and software may be heterogeneous

In the Federated databases, we can have a schema hierarchy where in the local component is translated to canonical data model and exported such as with public view or federated such as with a global view to the DBAs or made external such as with views to user groups.

Fedearated schema is usually built with consensus. And data interchange standards complement federated schema. As an example, XML is a syntax for federated schema and data in federated schema. This category has support of major IT vendors.

Let's consider schema integration examples.

An engineering database may have relational schema such as with the following entities
Engineer ( Engineer no, name, title, salary)
Project (Project Number, Project Name, budget, location)
Client (Client name, address)
and with following relationships (N:1)
Engineer works in Project (Responsibility, Duration)
Projject Contract by Client (Contract Date)

Also, an Employee database may have a CODASYL schema
Department (dept-name, budget, manager)
Employee( e# name, address, title, salary)
Department employs Employee (1:N relationship)

Now we can integrate this schema.
Here's one solution that uses Entity relationship Diagram)

Entities:
Department(dname, budget, manager)
Employee(E#, name, title, address, salary)
Engineer()
Project(PNo, Pname, Budget, Location)
Client(Client Name, Address)

Relationships
Engineer() is a subtype of employee
Department employs Employee(1:N)
Engineer works in project (M:N) : (Responsibility, Duration)
Project Contracted by Clients(M:N) (Contract Date)

Linking of databases can be done with the following;
Global naming of data objects such as tables. So we can have two part names for tables in the form tablename@databasename eg emp@sales
Database Links may be created for one way communication such as sales over us.america
Database replication could be setup with read-only or replicated master table and Read-Write or snapshot with periodic refresh.

Physical storage of distributed databases involves the following: Firstly, the tables could be fragmented with horizontal division of rows or vertical division by columns or both such as with division by both rows and columns. Secondly, the degree of replication could be specified. So each fragment could appear at one site only and this is called non-redundant replication or all the sites could have all the data and this is called full replication. Alternatively, selective replication could also be specified where some fragments could have higher degree of replication.

Query processing also requires some features for distributed databases. Data transfer strategies now need to consider communications cost besides CPU cost and I/O cost. Query decomposition and allocation between databases should also be determined. Consider the case when data is at site 1 and site 2 and query is at site 3. In such cases, we could send both table to site 3 or send one table to Site 2 and then the result to site 3 or send the other table to site 1 and then the result to site 3.
Another data transfer strategy is semi-join stragety. Here Site 2 could take some of the colums of one table and send to site 1. Site1 could then join the projection and send it back to site 2. Site 2 could then join it with its data and then send the result to site 3.
Semi-join strategies are usually cheaper than conventional strategies when the tables are large but semi-joins are cheap or when there are lots of irrelevant columns and very small join selectivity.

Thus distributed databases require a few more features.

Sunday, February 17, 2013

Data Strcutures using C#

Arrays serve as a popular homogenous uniform-sized direct access data structure. Well not all array elements need be uniform length. Jagged arrays are arrays of arrays where the contained arrays could be of variable length. These are declared as follows:
int[][] jaggedArray = new int[3][];
jaggedArray[0] = new int[] { 1, 3, 5, 7, 9 };
jaggedArray[1] = new int[] { 0, 2, 4, 6 };
jaggedArray[2] = new int[] { 11, 22 };
A multidimensional array on the other hand is declared as follows:
int[,] multiArray = new int[,]
{ {1 ,2, 3}, {4, 5, 6}, {7, 8, 9|}}

Saturday, February 16, 2013

Object Oriented databases

The object oriented databases originated from well-known limitations of the Relational DBMS which was too simple for CAD/CAM, GIS/CASE since there was limited datatypes and no UDTs at the time. Among the other limitations were that RDBMS does not allow the concept of generics or generalization and aggregation. And there is a mis-match between DB language and application languages. There was no support for design transactions, versions and schema evolutions. Hence the approach to extended relational DBs ( or object-relational model ) with support for user defined data types, stored procedures or rules, transitive closures, and generalization.

Fundamentally, an object oriented database implements a conceptual object data model with objects and their identifiers. This model is all about encapsulation, classes, inheritance, overloading and overriding, complex values and types. Therefore it differs from a entity relationship model where there are explicit relations defined and schemas. The object model extends the relationships to add support for subclass and instances, composition hierarchy and checking out / checking in of objects etc. Integrity constraints are also extended in object databases with referential integrity constraints and uniqueness constraints and siblings or sub-class relationships. Support for composite objects with references to other objects with exclusive or shared parts is also available.

Similarly, the query language and optimization has been extended. The queries look very similar to those in programming languages. Expression with support for All or Any are also made available. The issues with query optimization in the object-relational model include user defined methods (hard to optimize), class hierarchy based queries ( requiring new strategies), multi-valued attribute ( needs new cost models ) and dual buffering ( separately process objects in the buffer / on the disk).

Locking is at the granularity of objects (intention locks) on large complex objects. Transactions share objects in memory buffers and recovery needs to handle memory buffers.

An object database provides a new model closer to application development and requires refinements on the relational model. When objects are flattened, they lend themselves to storage in a relational model.

Friday, February 15, 2013

Spatial databases

A spatial database stores information about objects in space which we can express in the form of geometry such as points, lines and polygons or geography. These are data types that are very different from the customary varchar, int, varbinary etc data types for databases. Spatial data is also queried very differently. The open geospatial consortium defines the following query types
Spatial measurements
Spatial functions
Spatial predicates
Constructor functions
Observer functions
NoSQL systems like MongoDB and CouchDB support only a limited set.
Also the index used for spatial data types are different too. Due to the nature of the spatial data types, some patterns are very common in spatial queries and regular indexes don't support these very well. For example, how far two points differ or if a point is within a spatial area of interest require algorithms and data structures that are different from B-Trees or bitmap indexes.
For example in Microsoft SQL Server, you can do the following:
CREATE TABLE SpatialTable
( id int IDENTITY (1,1),
GeomCol1 geometry,
GeomCol2 AS GeomCol1.STAsText() );
GO

INSERT into SpatialTable (GeomCol1)
VALUES (geometry::STGeomFromText('LINESTRING (20, 180, 100, 100, 180, 180)', 0));

INSERT into SpatialTable(GeomCol1)
VALUES (geometry::STGeomFromText('POLYGON ((0, 0, 150, 0, 150, 150, 0, 150, 0, 0)', 0));

DECLARE @geom1 geometry;
DECLARE @geom2 geometry;
DECLARE @result geometry;

SELECT @geom1 = GeomCol1 FROM SpatialTable Where id = 1;
SELECT @geom2 = GeomCol2 FROM SpatialTable Where id = 2;
SELECT @result = @geom1.STIntersection(@geom2);
SELECT @result.STAsText();

Incidentally, expressing spatial data types as .NET UDTs and performing spatial operations is practical but just not deep enough for the kind of features and the size of data such as for astronomical study.

recap

C sharp data structures

Here are some common data structures in C#

ArrayList: This is a dynamically sized array of objects that implements the IList Interface. It works by maintaining an internal array of objects that is replaced with a larger array when it reaches its capacity of elements. It is very efficient at adding elements since there is usually a free slot in the end but is inefficient at inserting . Searching can be efficient if the BinarySearch method can be used but you must Sort the array first.

BitArray: This is a dynamically sized array of Boolean values. It is more memory efficient than simple array of bools because it uses only one bit for each value whereas a bool array uses two bytes for each value.

Hashtable class: A Hashtable class is a standard dictionary (key/value) data structure that also uses a hashing algorithm to store and index values efficiently. The GetHashCode method is used for hashing. Any type used as a key in the hashtable should return a good hash of the object’s internal value.

Queue class: A Queue is a standard first in first out (FIFO) data structure providing simple operations to enqueue, dequeue, peek etc.

SortedList class: A SortedList is a standard dictionary data structure that uses a binary-chop search to index efficiently. SortedList implements IDictionary.

Stack class: A stack class is a standard Last-in first-out (LIFO) data structure.

StringCollection: A StringCollection is a standard collection data structure for storing strings.

HashSet: A HashSet is a generic collection that contains no duplicate elements and where the elements are in no particular order. The capacity of the HashSet is the number of elements the object can hold.

LinkedList: A LinkedList is a general purpose linked list. It supports enumerators and implements the ICollection interface, consistent with other collection classes in the .Net framework.

SortedDictionary: A sorted dictionary is a binary search tree and is very similar to the SortedList but differs in execution speed and memory use.

ConcurrentQueue generic is used for thread safe queue operations and can be used safely by multiple threads. You could invoke multiple threads with the Parallel library i.e. Parallel.Invoke(action, action);
Similarly ConcurrentBag, ConcurrentDictionary and ConcurrentStack are also available.
Partitioner and OrderablePartitioner provides a manner of partitioning data into multiple partitions.
A Blocking Collection provides a blocking and bounding capabilities for threadsafe collections that implement IProducerConsumerCollection<T>
BinaryTree and BinarySearchTree can be implemented with a generic Node<T> class.
public class Node<T>
{
public Node<T> Left {get; set;}
public Node<T> Right {get; set;}
public T data {get; set;}
}
}
and travesals can be written in a recursive way:
public PreorderTaversal (Node current)
{
if ( current != null)
{
Console.WriteLine(current.Value);
PreorderTraversal(current.Left);
PreorderTraversal(current.Right);
}
}