Cluster computing

Saturday, February 23, 2013

Document parsing

Structured text is very valuable to identifying topics or keywords in a text. Word documents provides markup for such information and this can be useful to find topics. Word documents can be parsed to retrieve the table of contents or the structure and this can be used to divide the text into sections that can then be treated as unstructured. Content that is not text but have titles or captions to go with them should be treated the same as headings for section text. These are also candidates for document indexing. Thus an improvement to indexing unstructured text is to add logic to extract document structure and utilize the layout of the information presented by the user. The data structure used for capturing this layout for the purposes of indexing could have elements holding references to sections, their type and location and these elements are but a list populated from document parsing. Each element of this list will be treated the same was as any length unstructured text.

Friday, February 22, 2013

schema design

Schema refinement in database design:
Good schema design improves decomposition. One way to do this is to eliminate redundancy. Redundancy causes several problems: storage is repeated, updates may not make changes to all in a consistent manner, insertion may be dependent on other items and deletion may not be possible without losing other information as well.
Several normal forms have been proposed for relations and if a schema conforms to one of these, it can avoid certain kinds of problems. A property of decomposition is lossless-join which enables us to recover any instance of the decomposed relation from corresponding instances of the smaller relations. Another property is dependency-preservation which enables us to enforce any constraint on the original relation by simply enforcing some constraints on each of the smaller relations.
A good designer will ask if a relation is in a normal form or if the decomposition is dependency preserving. Relations will often have functional dependencies. A functional dependency is where different attributes will keep the same dependency for every pair of tuples so for example, if t1.X = t2.X, then t1.Y = t2.Y
A primary key constraint is a special case of an FD.
Constraints can be defined on an Entity set. For example, the SSN can be a key to the tuples with attributes name, lot, rating, hourly_wages, hours_worked etc.
Constraints on a relationship set. These constraints can eliminate redundancies that ER design may not catch. For example, a contractor with contract id may involve parts, suppliers and Departments, then it maybe better to split the relations as CQSD and SDP.
Functional Dependencies also help with identifying attributes of entities and particularly find attributes that are on the wrong entity set. For example, employees can work in at most one department. So we can decompose this into two entities Workers and Departments with attributes as workers(ssn, name, did, since) and Departments(did, dname, budget, lot).
Closure of a set of functional dependencies is defined as the set of FDs implied by a given set F of FDs and is denoted as F+. Armstrong's axioms helps find closures of a set of FDs. The axiom suggests
1. Reflexivity if X is a proper subset of Y, then X --> Y
2. Augmentation if X-->Y, then XZ-->YZ for any Z
3. Transitivity if X-->Y and Y-->Z, then X-->Z
It is convenient to use additional rules while reasoning about F+
Union if X -->Y and X-->Z, then X-->YZ
Decomposition if X-->YZ, then X-->Y and X-->Z
These additional rules are not essential, their soundness can be proved using Armstrong's axioms.
The attribute closure of attribute set X is the set of attributes A such that X --> can be inferred using Armstrong Axioms.

Thursday, February 21, 2013

Skip list

Skip list insertion and deletion has to keep track of the references for all levels of the node to be inserted or deleted. So the first thing we do for insertion or deletions is that for each of the levels from top to bottom, all nodes with references that should now be changed are found. This could be on the stack via recursion or on heap. Then they are updated to the target node and with the current pointing to the earlier as in the case for insertion or they are updated to what the target node points to and the target node spliced as in the case for deletion. In all cases, the level zero iteration of nodes should span the entire list.

Insertion should perform the updates to the references as it walks down the levels. Deletion could perform the updates as it returns up the levels.

Wednesday, February 20, 2013

finding answers quickly

Many times when we search, we really are interested in the first few results rather than all of the results we get. There is an urgency for the user to get only the first few or the best few answers quickly. Rarely does the user go past the second or third page of the results. If the user does not find what they want, they usually refine the query and resubmit it. This is true not just for searches but also for decision support. Another pattern is when the user may want to avoid writing a complex query and instead write one that gives an approximate answer quickly, then refine the query continually until an exact answer is found.
Top N Queries refers to the first of these kinds of querying. This is very helpful for the users to be able to explicitly indicate how many answers they want, making it possible for the DBMS to optimize execution. As an example, one could request the products that are top 10 by sales.
So here the optimizer would add a predicate that checks for sales to be greater than a cut off value. Histograms can be used to find to reflect the top values. This cut-off value can be refined if the the tuples are less or more than the number specified by the user but the approach is based on pre-computation.
Another approach is to use online aggregation where the answer quality is continually refined during computation progress. First, the grouping column value is determined to be within a range with a confidence level, then the user input priority is used to determine the order of the tuples to work with and finally use non-blocking algorithms for relational operators. For example, the sort-merge join algorithm blocks because sorting requires all input tuples. Instead nested loops join and hash join are therefore preferable to sort-merge join for online aggregation

Tuesday, February 19, 2013

Distributed Databases (2)

Points to review:
Parallel databases evaluate queries in parallel in order to improve performance. Parallel queries can be executed on shared memory, shared disk system and shared-nothing system. In the shared-nothing system, there's linear speed-up and scale-up possible.
In pipelined parallelism, one operator consumes the output of another operator.
In data-partitioned parallelism, the input data is partitioned and we work on each partition in parallel.
Data can be partitioned with round robin partitioning where say the tuple i goes to processor i mod n or data can be partitioned with hash partitioning where a hash function distributes the tuples or range partitioning where the tuples are distributed over n search key ranges. Existing code can be parallelized by introducing split and merge operators.
Shared nothing architecture lends itself to data partitioning.
For query optimizations, we take parallelism within each operator and use pipelined parallelism to exploit parallelism between operators.
In a distributed database, data is stored across several locations. When the user does not have to know the location of the data, we achieve distributed data independence. When there is no difference between distributed transactions and local transactions, we achieve distributed transaction atomicity. If a transaction involves activities at different sites, the activity at a given site is called a subtransaction.
In distributed databases, a single relation might be fragmented or replicated across several sites. Horizontal and vertical fragmentation may apply to rows and columns respectively. Fragments need qualifying names. Distributed catalog management is used to keep track of what is stored where.
When processing queries in a distributed DBMS, the location of the partitions of the relations need to be taken into account. Two relations that reside on different sites can be joined by sending one relation to the other site and performing the join locally. Choice of sites could be based on the number of tuples to select or join. Semijoins and Bloomjoins reduce the number of tuples sent across the network. In such query processing, communication costs and autonomy of individual sites matter.
In replication, we store several copies of the relation or partition on different sites. In synchronous replication, all copies of a replicated relation are updated before the transaction commits.In asynchronous replication, copies are only updated periodically. Synchronous replication involves voting where an update must write a majority of copies, and a read must access at least enough copies to make sure that one of the copies is current. Asynchronous replication involves peer-to-peer replication where more than one copy can be updatable a conflict resolution strategy deals with conflicting changes. Alternatively, in primary site replication, there is one primary copy that is updatable, the other secondary copies cannot be updated. Changes on the primary are captured and then applied to the other sites.
Locking can be either central at a designated primary copy or fully distributed. Distributed deadlock detection is required.
Recovery is done with a commit protocol that co-ordinates activities at different sites involved in the transaction. In two phase commit each transaction has a designated co-ordinator site. Subtransactions are executed at subordinate sites. Subordinate sites block until co-ordinator site is recovered.
Three phase Commit can avoid blocking even if the co-ordinator site. In 3PC, a co-ordinator site sends out prepare messages and receives yes votes from all sub-ordinates. Then it sends out pre-commit messages, rather than a commit messages. When a sufficient number of acknowledgements have been received, the co-ordinator force-writes a commit log record and sends a commit message to all the sub-ordinates. The precommit message lets the sites to communicate with each other without waiting for the co-ordinator to recover.

Monday, February 18, 2013

Distributed databases

What are the features in the database server for distributed databases ?

Distributed databases are faced with the challenge of dealing with heterogenity and autonomy. Logically, these databases are connected via links. Physically, they share storage and query processing. In addition, reovery and concurrency control now spans linked servers.

Distributed databases originated from internet and mobile computing. The motivation was to provide distributed computing over LAN, WAN, internet etc to boost availability and reliability via redundancy and performance via distribution of work.

The computing environment for distributed databases can vary from homogenous nodes in a shared nothing architecture to a more centralized or truly distributed heterogenous computers. In such cases, the transparency of computing, the networking costs, the replication, fragmentation and allocation become important considerations. And site and communcation failures add to issues with robustness.

In a conceptual distributed management, heterogenity and autonomy classification criteria are considered.
The degree of homogenity depends on the choices of data models, constraints and query languages. Semantic attributes of an entity, attribute names, meaning of an attributes, isolation level choices.

The degree of local autonomy depends on the data model design autonomy, communication autonomy, execution autonomy and association autonomy.

The above define the classification criteria for conceptual DMs. With this classification, we can consider the following categories of distributed databases.
1. Distribution transparent category: These are homogeneous, no local autonomy databases that looks like a single centralized DB to users.
2. Multidatabase category: These have no common schema.
3. and Federated category: These have common schema, but local autonomy.
In addition, hardware and software may be heterogeneous

In the Federated databases, we can have a schema hierarchy where in the local component is translated to canonical data model and exported such as with public view or federated such as with a global view to the DBAs or made external such as with views to user groups.

Fedearated schema is usually built with consensus. And data interchange standards complement federated schema. As an example, XML is a syntax for federated schema and data in federated schema. This category has support of major IT vendors.

Let's consider schema integration examples.

An engineering database may have relational schema such as with the following entities
Engineer ( Engineer no, name, title, salary)
Project (Project Number, Project Name, budget, location)
Client (Client name, address)
and with following relationships (N:1)
Engineer works in Project (Responsibility, Duration)
Projject Contract by Client (Contract Date)

Also, an Employee database may have a CODASYL schema
Department (dept-name, budget, manager)
Employee( e# name, address, title, salary)
Department employs Employee (1:N relationship)

Now we can integrate this schema.
Here's one solution that uses Entity relationship Diagram)

Entities:
Department(dname, budget, manager)
Employee(E#, name, title, address, salary)
Engineer()
Project(PNo, Pname, Budget, Location)
Client(Client Name, Address)

Relationships
Engineer() is a subtype of employee
Department employs Employee(1:N)
Engineer works in project (M:N) : (Responsibility, Duration)
Project Contracted by Clients(M:N) (Contract Date)

Linking of databases can be done with the following;
Global naming of data objects such as tables. So we can have two part names for tables in the form tablename@databasename eg emp@sales
Database Links may be created for one way communication such as sales over us.america
Database replication could be setup with read-only or replicated master table and Read-Write or snapshot with periodic refresh.

Physical storage of distributed databases involves the following: Firstly, the tables could be fragmented with horizontal division of rows or vertical division by columns or both such as with division by both rows and columns. Secondly, the degree of replication could be specified. So each fragment could appear at one site only and this is called non-redundant replication or all the sites could have all the data and this is called full replication. Alternatively, selective replication could also be specified where some fragments could have higher degree of replication.

Query processing also requires some features for distributed databases. Data transfer strategies now need to consider communications cost besides CPU cost and I/O cost. Query decomposition and allocation between databases should also be determined. Consider the case when data is at site 1 and site 2 and query is at site 3. In such cases, we could send both table to site 3 or send one table to Site 2 and then the result to site 3 or send the other table to site 1 and then the result to site 3.
Another data transfer strategy is semi-join stragety. Here Site 2 could take some of the colums of one table and send to site 1. Site1 could then join the projection and send it back to site 2. Site 2 could then join it with its data and then send the result to site 3.
Semi-join strategies are usually cheaper than conventional strategies when the tables are large but semi-joins are cheap or when there are lots of irrelevant columns and very small join selectivity.

Thus distributed databases require a few more features.

Sunday, February 17, 2013

Data Strcutures using C#

Arrays serve as a popular homogenous uniform-sized direct access data structure. Well not all array elements need be uniform length. Jagged arrays are arrays of arrays where the contained arrays could be of variable length. These are declared as follows:
int[][] jaggedArray = new int[3][];
jaggedArray[0] = new int[] { 1, 3, 5, 7, 9 };
jaggedArray[1] = new int[] { 0, 2, 4, 6 };
jaggedArray[2] = new int[] { 11, 22 };
A multidimensional array on the other hand is declared as follows:
int[,] multiArray = new int[,]
{ {1 ,2, 3}, {4, 5, 6}, {7, 8, 9|}}