Cluster computing: Distributed databases

What are the features in the database server for distributed databases ?

Distributed databases are faced with the challenge of dealing with heterogenity and autonomy. Logically, these databases are connected via links. Physically, they share storage and query processing. In addition, reovery and concurrency control now spans linked servers.

Distributed databases originated from internet and mobile computing. The motivation was to provide distributed computing over LAN, WAN, internet etc to boost availability and reliability via redundancy and performance via distribution of work.

The computing environment for distributed databases can vary from homogenous nodes in a shared nothing architecture to a more centralized or truly distributed heterogenous computers. In such cases, the transparency of computing, the networking costs, the replication, fragmentation and allocation become important considerations. And site and communcation failures add to issues with robustness.

In a conceptual distributed management, heterogenity and autonomy classification criteria are considered.
The degree of homogenity depends on the choices of data models, constraints and query languages. Semantic attributes of an entity, attribute names, meaning of an attributes, isolation level choices.

The degree of local autonomy depends on the data model design autonomy, communication autonomy, execution autonomy and association autonomy.

The above define the classification criteria for conceptual DMs. With this classification, we can consider the following categories of distributed databases.
1. Distribution transparent category: These are homogeneous, no local autonomy databases that looks like a single centralized DB to users.
2. Multidatabase category: These have no common schema.
3. and Federated category: These have common schema, but local autonomy.
In addition, hardware and software may be heterogeneous

In the Federated databases, we can have a schema hierarchy where in the local component is translated to canonical data model and exported such as with public view or federated such as with a global view to the DBAs or made external such as with views to user groups.

Fedearated schema is usually built with consensus. And data interchange standards complement federated schema. As an example, XML is a syntax for federated schema and data in federated schema. This category has support of major IT vendors.

Let's consider schema integration examples.

An engineering database may have relational schema such as with the following entities
Engineer ( Engineer no, name, title, salary)
Project (Project Number, Project Name, budget, location)
Client (Client name, address)
and with following relationships (N:1)
Engineer works in Project (Responsibility, Duration)
Projject Contract by Client (Contract Date)

Also, an Employee database may have a CODASYL schema
Department (dept-name, budget, manager)
Employee( e# name, address, title, salary)
Department employs Employee (1:N relationship)

Now we can integrate this schema.
Here's one solution that uses Entity relationship Diagram)

Entities:
Department(dname, budget, manager)
Employee(E#, name, title, address, salary)
Engineer()
Project(PNo, Pname, Budget, Location)
Client(Client Name, Address)

Relationships
Engineer() is a subtype of employee
Department employs Employee(1:N)
Engineer works in project (M:N) : (Responsibility, Duration)
Project Contracted by Clients(M:N) (Contract Date)

Linking of databases can be done with the following;
Global naming of data objects such as tables. So we can have two part names for tables in the form tablename@databasename eg emp@sales
Database Links may be created for one way communication such as sales over us.america
Database replication could be setup with read-only or replicated master table and Read-Write or snapshot with periodic refresh.

Physical storage of distributed databases involves the following: Firstly, the tables could be fragmented with horizontal division of rows or vertical division by columns or both such as with division by both rows and columns. Secondly, the degree of replication could be specified. So each fragment could appear at one site only and this is called non-redundant replication or all the sites could have all the data and this is called full replication. Alternatively, selective replication could also be specified where some fragments could have higher degree of replication.

Query processing also requires some features for distributed databases. Data transfer strategies now need to consider communications cost besides CPU cost and I/O cost. Query decomposition and allocation between databases should also be determined. Consider the case when data is at site 1 and site 2 and query is at site 3. In such cases, we could send both table to site 3 or send one table to Site 2 and then the result to site 3 or send the other table to site 1 and then the result to site 3.
Another data transfer strategy is semi-join stragety. Here Site 2 could take some of the colums of one table and send to site 1. Site1 could then join the projection and send it back to site 2. Site 2 could then join it with its data and then send the result to site 3.
Semi-join strategies are usually cheaper than conventional strategies when the tables are large but semi-joins are cheap or when there are lots of irrelevant columns and very small join selectivity.

Thus distributed databases require a few more features.

Cluster computing

Monday, February 18, 2013

Distributed databases

No comments:

Post a Comment