Cluster computing

Monday, July 22, 2013

Barry Wise on five progressive steps to database normalization
We start out with an example where we store user's name, company, company address, and personal urls - say url1 and url2
The zero form is when all of this is in a single table and no normalization has occured.
The first normal form is achieved by
1) eliminating repeating groups in individual tables
2) creating a separate table for each set of data
3) identify each set of related data with a primary key
so this yields a table where the user information is repeated for each url so url field limitation is solved
The Second normal form is achieved by
1) Creating separate tables for a set of values that apply to multiple records
2) Relate these tables with a foreign key
Basically, we break the url values into a separate table so we can add more in the future
The third normal form is achieved by
1) eliminating fields that do not depend on the key
Company name and address have nothing to do with the user id, so they are broken off into their own table
The fourth and higher form depend on data relationships involving one-one, one-to-many and many-to-many.
The Fourth normal form is
1) In many to many relationship, independent entities cannot be stored in the same table.
To get many users related to many urls, we define a url_relations where they user id and url id are paired.
The next normal form is the Fifth normal form which suggests that
1) The original table must be reconstructed from the tables into which it has been broken down. This is a way to check that no new columns have been added.
As always, remember that denormalization has its benefits as well.
Also, Litt's tips additionally mentions the following :
1) create a table for each list. More than likely every list will have additional information
2) create non-meaningful identifiers.
This is to make sure that business rule changes do not affect the primary identifier

Sunday, July 21, 2013

Database normalization and denormalization rules are discussed here.
Codd described the objectives of normalization as follows:
1) To free the collection of relations from undesirable insertion, update and deletion dependencies.
2) To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increases the lifespan of application programs.
3) To make the relational program more informative to users
4) To make the collection of results neutral to query statistics, where these statistics are liable to change as time goes by.
The undesired side effects of insert, update or delete may include the following:
- Multiple rows for the same information which are updated independently and get out of sync
- no rows such as when a new user is added but not assigned to anything
- inconsitent deletes such as where one table deletion implies and requires a deletion in a completely different table.
If the addition of new data, requires changes to existing structure, then such changes can cause regressions
Tables when normalized are immediately correspond to real world concepts and their relationships.
The normalized tables are suited for general querying across any set of tables.
Some common normalization terms:
Functional dependency : FD: X->Y where Y attribute has a functional dependency on a set of X attributes if and only if each X value is associated with one and only one Y value.
Trivial Functional dependency : is a FD of an attribute on a superset of itself.
Full Functional dependency : when an attribute is FD on X but not on any subset of X
Transitive dependency : X->Y and Y->Z implies X->Z
Multivalued dependency : presence of some rows implies presence of some others
Join dependency - table can be recreated with joins
Superkey : A superkey is a combination of attributes that can be used to identify a database record.
Candidate key is a minimal superkey such as the Social Security Number
Non-prime attribute is one that does not occur in any candidate key. A prime attribute is one which occurs in some candidate key
A candidate key may be designated as a primary key but is unsually not talked about in with respect to other candidate keys.
Normal forms include the following:
1) First normal form - table has one candidate key
2) Second - no non-prime attribute is FD on a proper subset of any candidate key
3) Third - every non-prime attribute is non-transitively dependent on every candidate key in the table ( no transistive dependency is allowed)
4) EKNF - Every non-trivial FD is either the dependency of an elementary key or a dependency of a superkey
5) BCNF - Every non-trivial FD in the table is a dependency on a super key
6) Fourth - Every non-trivial multivalued dependency is a dependency on a super key
7) Fifth - Every non-trivial join dependency in the table is implied by the super keys of the table.
8) Domain key - Every constraint is a logical consequence of the tables domain constraints or key constraints.
9) Sixth - no non-trivial join dependencies
Denormalization - OLAP is denormalized as compared to OLTP. The redundant data is carefully controlled during ETL. Normalized snowflake schema becomes denormalized star schema. Non-first normal form is formed by nesting 1NF and the reverse is unnesting.

The KMP algorithm for string pattern matching proceeds like this:

//
#include "stdafx.h"
#include <iostream>
#include <cstdio>
#include <cstring>
#include <vector>
using namespace std;
int* PreProcess( string pattern) {
int patternLength = pattern.length();
if (patternLength == 0) return 0;
    int * next = new int[patternLength + 1];
    if (next == 0) return 0;
    next[0] = -1;  // set up for loop below; unused by KMP

    int i = 0;
    int j = -1;
    while (i < patternLength) {
  next[i + 1] = next[i] + 1;
  while ( next[i+1] > 0 &&
    pattern[i] != pattern[next[i + 1] - 1])
   next[i + 1] = next[next[i + 1] - 1] + 1;
  i++;
}
    return next;
}
void KMP(string pattern, string text, vector<int> *positions) {
    int patternLength = pattern.length();
    int textLength = text.length();
    int* next = PreProcess(pattern);
if (next == 0) return;
    int i = 0;
    int j = 0;
    while ( j < textLength )
{
  while(true)
   if (text[j] == pattern[i]) //matches
   {
    i++;   // yes, move on to the next state
    if (i == patternLength) // maybe that was the last state
    {
     // found a match;
     positions->push_back(j-(i-1));
     i = next[i];
    }
    break;
   }
   else if (i == 0) break; // no match in state j = 0, give up
   else i = next[i];
  j++;
}
}
char CharCode(char chr) {
    if ( 'a' <= chr && chr <= 'z' )
        return 'a';
    if ( 'A' <= chr && chr <= 'Z' )
        return 'A';
    if ( '0' <= chr && chr <= '9' )
        return '0';
    if ( chr == '.' || chr == '?' || chr == '!' || chr == ',' || chr == ':' || chr == ';' || chr == '-' )
        return '.';
}
string CodeText(string text) {
    string ret = text;
    for (int i = 0; i < text.length(); ++i) {
        ret[i] = CharCode(text[i]);
    }
    return ret;
}
void FancyOutput(string pattern, string code, vector<int> *positions) {
    cout << "Matched positions: ";
    for (int i = 0; i < positions->size()-1; ++i)
        cout << (*positions)[i] + 1 << ", ";
    cout << (*positions)[positions->size()-1] + 1 << "." << endl;

    std::cerr << "Text: " << code.c_str() << endl;
    for (int i = 0; i < positions->size(); ++i) {
        printf("%5d ", i+1);
        for (int j = 0; j < (*positions)[i]; ++j) cout << " ";
        cout << pattern.c_str() << endl;
    }
}
int _tmain(int argc, _TCHAR* argv[])
{
    string pattern, text, code;
char pattext[1024];
char txttext[1024];
    cout << "Input pattern:" << endl;
    cin.getline(pattext, 1024, '\n');
pattern.assign(pattext);

    cout << "Input text:" << endl;
    cin.getline(txttext, 1024, '\n');
text.assign(txttext);
    cout << endl;

    code = CodeText(text);
    cout << "Processed text:" << endl;
    cout << code.c_str() << endl;
    cout << endl;

    vector<int> *positions = new vector<int>();
    KMP(pattern, code, positions);

    if ( positions->size() )
        cout << "Y" << endl;
    else
        cout << "N" << endl;

    if ( positions->size() )
        FancyOutput(pattern, code, positions);
    return 0;
}

Saturday, July 20, 2013

Kimball architecture

Kimball architecture:
The Kimball architecture is based on dimensional modeling. The following rule of thumbs are observed in this modeling
1) Load detailed atomic data into dimensional structures. i.e do not load summarized data into the dimensional tables.
2) Structure dimensional models around the business processes. these business processes have performance metrics that often translate to dimensions or facts. Combined metrics could also be additional dimensions.
3) Ensure that every fact table has an associated date dimensional table. The business processes and performance metrics mentioned above are often associated with measurement events which are usually periodic with holiday indicators
4) Ensure that all facts in a single fact table are at the same grain or level of detail. The measurements within a fact table must be at the same grain or level of detail such as transactional, periodic snapshot or accumulating snapshot.
5) Resolve many to many relationships in fact tables. The events stored in a table are inherently associated with many places on many days. These foreign key fields should never be null. Sometimes dimensions can take on multiple values for a single measurement event, in which case, a many-many dual keyed bridge table is used in conjunction with the fact table.
6) Resolve many to one relationships in dimensional tables. Hierarchical fixed depth many to one relationships are typically collapsed into a flattened dimensional table. Do not normalize or snowflake a M:1 relationship but denormalize the dimensions.
7) Store report labels and filter domain values in dimension tables. The codes, decodes and descriptors uses for labeling and query should be captured in dimensional tables. Again such attributes should have no nulls.
8) Make certain that dimension tables use a surrogate key. Meaningless sequentially assigned surrogate keys can help make smaller fact tables, smaller indexes and improved performance.
9) Create conformed dimensions to integrate data across the enterprise. Conformed dimensions also referred to with common, master, standard or reference dimensions are defined once in the ETL system and deliver consistent descriptive attributes across dimensional models and support the ability to drill across and integrate data from multiple business processes.
10) Continuously balance requirements and realities to deliver a DW/BI solution that's accepted by business users and that supports their decision making. User requirements and underlying realities of the associated source data needs to be reconciled.
Dimnesional modeling, project strategy, technical ETL/ BI architectures or deployment/maintenance all require balancing acts.

Friday, July 19, 2013

Corporate Information Factory Architecture

CIF operational systems topology:
The CIF architecture comprises of the data warehouse and operational data store as the core of the architecture. This is surrounded by metadata management plane where the operational systems reside Operational systems can include external operational systems, distribution, product, account and customer operational systems. These operational systems add data into data management core through transformations and integrations. Exploration warehouse datamart and data delivery also extract information from the data management core. The decision support interface may use these data marts while the transaction interface may use the operational data store directly. CIF consumers acquire the information produced via data delivery, manipulate it in datamarts, and assimilate it in their own environments. Outside this metadata management is the Information Services as well as Operations and administration.
The Information Services comprises of groups of items such as library and toolbox as well as the workbench. The Operations and administration involve systems management, data acquisition management, service management and change management.
Producers are the first link in the information food chain. They synthesize the data into raw information and make it available for consumption across the enterprise.
Operational systems are the core of the day to day operations. The operational systems are organized by the product they support. However, businesses are more customer oriented than product so that they can differentiate their offerings. CIF provides facilities to define how this data relates to a customer using rules that form the metadata.
Operational data usually stand alone so they have not been integrated. CIF synthesizes cleans and integrates the data before it is usable. CIF also acts a history store for the enterprise.
Next, integration and transformation consist of the processes to capture, integrate, transform, cleanse, re-engineer and load the source data into the data warehouse. Typically data is pulled from the operational systems as opposed to pushing. Configuration management and scheduling play a large role here. Typically these packages can be written with the knowledge of the source data and once written don't change too much. So they are good candidates for scheduled jobs.
Next the data warehouse plays the big role in CIF. It is often described as "subject-oriented, integrated, time-variant(temporal) and non-volatile collection of data" Sizes of up to terabytes of data are not uncommon for a data warehouse.
Next, the data management extends the data warehouse with archival/restoration, partitioning and movement of data based on triggering and aggregation. This is often in house development tasks.
Finally, the consumers extract information from this collection of data. The data-marts support these analytic requirements and they are made available via the decision support interface.
Also, the metadata enables metrics for measuring support.
Courtesy : Imhoff

OAuth

OAuth is a feature by which users can sign in with a third party intermediary authorization. For example, if you were at a online printing store and wanted to print photos from an album in another website, the printing store can retrieve the photos from that website on your behalf and you can continue with your print order, if you sign in to that website.It differs from login in that you don't have to repeatedly provide username and password for different websites. The sign in on one website can be reused for others.
The way this works is by providing users access tokens based on third party referral. This token grants a user access to resources. The token may expire every few minutes and may need to be refreshed. So a refresh token can be requested. Admins to OAuth provider can see which users have registered from which clients. They can also revoke access to users and clients.
OAuth has four different workflows for granting access: These are :
Implicit Grant – such as when the mobile application follows a redirect to the client application.
Credentials Grant – such as when the user provides username and password for a token
Client Credentials Grant – such as when admin applications from secured kiosks provide context regardless of the user
Refresh Grant – a client can retrieve an access token by exchanging a previous refresh token.

An access token request consists of the following
- a grant type aka authorization code
- code
- redirect URI
- client Id
while the response consists of
- a return type ( data or redirect )
- Access Token
- token type
- refresh type
- expires in how many seconds
- refresh token
- scope

The authorization request consists of response type, client id, scope and state. The response consists of code and state.

Thursday, July 18, 2013

This post talks about some business intelligence practices :

OLTP and OLAP processing have different systems due to several differences:

The first difference is that an online transaction processing is one in which the response time is critical and touches existing data. The Online Analytical Processing on the other hand is based on data accumulated over time such as for decision support.

The second difference is that the latter require a schema where they have a star or multi-level star or snowflake like design.

The third difference is that the latter require very different data structures such as bitmaps and conjunctive filters while the former uses B+ trees.

The fourth difference is that the databases in the latter required fast load of large data periodically.

The fifth difference is that the joins on the data were very costly to perform and the views need to be captured and persisted till the next update.

OLAP systems could be implemented with a CIF architecture or a Star-Kimball model:

The CIF architecture uses Entity Relationships and database normalization rules to build a normalized data model in its data warehouse. In the Star-Kimball model, a dimensional model is used where the transactions are split into facts or dimensions and arranged in a star like schema and if they are hierarchical in a multi-level star like schema.

The advantage of the dimensional model is that it is easy to understand and use. The joins are simpler and the users don’t need to know the source of the data or the data structure but can work off materialized views although the operating systems have to handle the complex transformations to maintain the dimensions.

The advantage with the normalization model is that it holds a lot of information and is flexible to changing business needs.

A day to day task may involve the task of preventing orphaned data. We prevent this by adding constraints and checks to the data. For example in a table that maintains employees and their managers, the column to denote the manager points back to another employee in the table. For the top of the organization, there is usually no manager. We represent this by making him his own manager. This is useful now because none of the employees can be orphaned when all the data in the manager column is checked for consistency. In general, we use constraints, checks, primary and foreign keys, surrogates and natural keys to enforce integrity of the data.