Cluster computing

Sunday, February 24, 2013

Integrated access to multiple data sources

Integrated access to multiple data sources
Large organizations typically have several databases and users may want to access data from more than one source. For example, an organization can have one data store for product catalog also called master data and another for billing and payments and yet another for reporting. These databases may contain some common information, determining the exact relationship between tables in different databases can get tough. For example, prices in one database might be dollars per dozen item and in another might be dollars per item. This is therefore typically avoided using XML DTDs which offer the promise that such semantic mismatches can be avoided if all parties conform to a single standard DTD. However, there are many legacy databases and most domains may not yet have an agreed-upon DTD. Semantic mismatches can be resolved and hidden from users by using relational views over the tables from the two databases. Defining a collection of views to give users a uniform presentation of relevant data from multiple databases is called semantic integration. The task of defining these views for semantic integration can be challenging when there is little or no documentation for the existing databases.
If the underlying databases are managed by different DBMS, some kind of middleware may be used to evaluate queries over integrating views, retrieving data over query execution time. Alternatively, the integrating views can be materialized and stored in a data warehouse. Queries can be run over the warehoused data instead of the source DBMS at run-time.

new venture

http://indexer.cloudapp.net

schema design 2

Normal forms are a guidance to avoid known problems during schema design. Given a schema, whether we decompose it into smaller schema is determined based on this normal forms. The normal forms are based on functional descriptors and are first normal form, second normal form, third normal form and Boyce Codd normal form. Each of these forms have increasingly restrictive requirements. A relation is first normal form if every field contains atomic values and not lists or sets. 2NF is mainly historical interest. 3NF and BCNF are important from design standpoint. Boyce Codd Normal form for a relation R holds truie if for every FD X->A that holds over R, one of the following statements is true.
A belongs to X : that is, it is a trivial FD or
X is a superkey
Third normal form holds when for every FD X-> A that holds over R, one of the following statements is true:
A belongs to X : that is, it is a trivial FD or
X is a superkey or
A is part of some key for R

Saturday, February 23, 2013

Setting up a website

If you want to create a website for yourself, here are some of the things you need to do:
1) Your company must have a name and logo. This is central to the theme on any page that displays information about your company.
2) Your company webstite must explain the company's value proposition in a simple and clear manner to the user from the home page. A picture, a slogan or a paragraph that conveys this to the user should be on the front page of your website.
3) Your company website must elicit user attention with achievements, partners or better yet examples of real life usage.
4) Your company website have business contact information if users are supposed to contact.
5) Your company must have a copyright to all information on the website.

Document parsing

Structured text is very valuable to identifying topics or keywords in a text. Word documents provides markup for such information and this can be useful to find topics. Word documents can be parsed to retrieve the table of contents or the structure and this can be used to divide the text into sections that can then be treated as unstructured. Content that is not text but have titles or captions to go with them should be treated the same as headings for section text. These are also candidates for document indexing. Thus an improvement to indexing unstructured text is to add logic to extract document structure and utilize the layout of the information presented by the user. The data structure used for capturing this layout for the purposes of indexing could have elements holding references to sections, their type and location and these elements are but a list populated from document parsing. Each element of this list will be treated the same was as any length unstructured text.

Friday, February 22, 2013

schema design

Schema refinement in database design:
Good schema design improves decomposition. One way to do this is to eliminate redundancy. Redundancy causes several problems: storage is repeated, updates may not make changes to all in a consistent manner, insertion may be dependent on other items and deletion may not be possible without losing other information as well.
Several normal forms have been proposed for relations and if a schema conforms to one of these, it can avoid certain kinds of problems. A property of decomposition is lossless-join which enables us to recover any instance of the decomposed relation from corresponding instances of the smaller relations. Another property is dependency-preservation which enables us to enforce any constraint on the original relation by simply enforcing some constraints on each of the smaller relations.
A good designer will ask if a relation is in a normal form or if the decomposition is dependency preserving. Relations will often have functional dependencies. A functional dependency is where different attributes will keep the same dependency for every pair of tuples so for example, if t1.X = t2.X, then t1.Y = t2.Y
A primary key constraint is a special case of an FD.
Constraints can be defined on an Entity set. For example, the SSN can be a key to the tuples with attributes name, lot, rating, hourly_wages, hours_worked etc.
Constraints on a relationship set. These constraints can eliminate redundancies that ER design may not catch. For example, a contractor with contract id may involve parts, suppliers and Departments, then it maybe better to split the relations as CQSD and SDP.
Functional Dependencies also help with identifying attributes of entities and particularly find attributes that are on the wrong entity set. For example, employees can work in at most one department. So we can decompose this into two entities Workers and Departments with attributes as workers(ssn, name, did, since) and Departments(did, dname, budget, lot).
Closure of a set of functional dependencies is defined as the set of FDs implied by a given set F of FDs and is denoted as F+. Armstrong's axioms helps find closures of a set of FDs. The axiom suggests
1. Reflexivity if X is a proper subset of Y, then X --> Y
2. Augmentation if X-->Y, then XZ-->YZ for any Z
3. Transitivity if X-->Y and Y-->Z, then X-->Z
It is convenient to use additional rules while reasoning about F+
Union if X -->Y and X-->Z, then X-->YZ
Decomposition if X-->YZ, then X-->Y and X-->Z
These additional rules are not essential, their soundness can be proved using Armstrong's axioms.
The attribute closure of attribute set X is the set of attributes A such that X --> can be inferred using Armstrong Axioms.

Thursday, February 21, 2013

Skip list

Skip list insertion and deletion has to keep track of the references for all levels of the node to be inserted or deleted. So the first thing we do for insertion or deletions is that for each of the levels from top to bottom, all nodes with references that should now be changed are found. This could be on the stack via recursion or on heap. Then they are updated to the target node and with the current pointing to the earlier as in the case for insertion or they are updated to what the target node points to and the target node spliced as in the case for deletion. In all cases, the level zero iteration of nodes should span the entire list.

Insertion should perform the updates to the references as it walks down the levels. Deletion could perform the updates as it returns up the levels.