Cluster computing

Wednesday, July 24, 2013

Technical overview OneFS continued

In a file write operation on a three node cluster, each node participates with a two layer stack - the initiator and the participant. The node that the client connects to acts as the Captain. In an Isilon cluster, data, parity, metadata and inodes are all distributed on multiple nodes.Reed Solomon erasure encoding is used to protect data and this is generally more efficient (upto 80%) on raw disk with five nodes or more. The stripe width of any given file is determined based on the number of nodes in the cluster, the size of the file, and the protection setting such as N+2.
OneFS uses InfiniBand back-end network for fast network access. Data is written in atomic units called protection groups. If every protection group is safe, the entire file is safe. For files, protected by erasure codes, a protection group consists of a series of data blocks as well as a set of erasure codes Types of protection groups can be switched dynamically, temporarily relying on mirroring when node failures prevent the desired number of erasure codes from being used and then reverting to protection groups without admin intervention.
The OneFS file system block size is 8 KB and a billion such small files can be written at high performance because the on disk structures can scale to that size. For large files, mutliple contiguous 8KB blocks upto sixteen can be striped into a single node's disk. For even larger files, OneFS can maximize sequential performance by writing in stripe units of sixteen contiguous blocks.An AutoBalance service reallocates and rebalances data to make storage space more efficient and usable.
The captain node uses a two phase commit transaction to safely distribute writes to multiple NVRAMS across clusters. The mechanism relies on NVRAMs for journaling all transactions on every node in the cluster.Using NVRAMs in parallel allows high throughput writes and safety against failures. When a node returns from failure, the only actions required is to replay its journal from the NVRAM and occassionally from AutoBalance to rebalance files that were involved in the transaction. No resynchronization event ever needs to take place. Access patterns can be chosen from the following at a per file or per directory level. Concurrency - optimizes for current load on the cluster, featuring many simultaneous clients.
Streaming - optimizes for high speed streaming of a single file, for example to enable very fast reading within a single client.
Random - optimizes for unpredictable access to the file, by adjusting striping and disabling the use of any pre-fetch cache.
Caching is used to accelerate the process of writing data into an Isilon cluster. Data is written into the NVRAM based cache of the initiator node before writing to disk, then batched up and flushed to disks later at a more convenient time. In the event of a cluster split or unexpected node outage, uncommitted cached writes are fully protected.
Caching operates as follows :
1) an NFS client sends node 1 a write request for file with N+2 protection
2) Node 1 accepts the writes into its NVRAM write cache fast path and then mirrors the writes to participants nodes' logfile
3) Write acknowledgements are returned to the NFS clients immediately and as such write to disk latency is avoided
4) As node 1's write cache fills, it is flushed via two phase commit protocol and appropriate parity protection
5) the write cache and participant node files are cleared and available to accept new writes.

Tuesday, July 23, 2013

Technical overview of OneFS file system continued

Each cluster creates a single namespace and file system. This means that the file system is distributed across all nodes. There is no partitioning and no need for volume creation. Each cluster can be accessed by any node in the cluster. Data and metadata are striped across the nodes for redundancy and availability. The single file tree can grow with node additions without any interruption to user. Services take care of all the associated management routines such as tiering, replication etc. Files are transferred parallely without any regard for the depth and breadth of the tree.
This design is different from having different namespaces or volumes and aggregating them to make it appear as a single. There the administrator has to layout the tree, move files between volumes, purchase cost, power and cooling etc.
OneFS uses physical pointers and extents for metadata and stores files and directory metadata in inodes. B-Trees are used extensively in the file system. Data and metadata are redundant and the amount of redundancy can be configured by the administrator. Metadata only takes up about 1% of the system. Metadata access and locking are collectively managed in a peer to peer architecture.
The OneFS blocks are accessed by multiple devices simultaneously so the addressing scheme is indexed by a tuple of {node, drive, offset}.Data is protected by erasure code and is labeled by the number of failures it can tolerate simultaneously such as "N+2" that can withstand two failures when the metadata is mirrored 3x.
OneFS controls the placement of files directly down to the sector level on any drive. This allows for optimized data placement I/O patterns and avoids unnecessary read-modify-write operations. As opposed to dedicating entire RAID volumes to a particular performance type and protection setting, the file system is homogeneous and can be used to optimize spindle usage with configuration changes at any time and online.
Every node that participates in the I/O has a stack that comprises of two layers. The top half is the initiator that serves client side protocols such as NFS, CIFS, iSCSI, HTTP, FTP etc. and the bottom layer is the participant that comprises of NVRAM. When a client connects to a node to write to a file, it connects to the top half where the files are broken into smaller logical chunks called stripes before being written. Failure safe buffering using a write coalescer is used to ensure that the writes are efficient.
OneFS stripes the data across all nodes and not simply across all disks.

OneFS file system overview:
The EMC OneFS file system is a distributed file system that runs on a storage cluster. It combines three layers of traditional storage architecture namely, a File System, Volume Manager, and data protection.
It scales out because it relies on intelligent software, commodity hardware, and distributed architecture.
OneFS works with a cluster that consists of multiple nodes and starts out with as few as three nodes. In the clustrer, nodes could provide different ratios of throughput and Input/Output operations per second. OneFS combines these into a whole - RAM is grouped together into a single coherent cache, NVRAM is grouped together for high throughput writes and spindles and CPU are combined to increase throughput, capacity and IOPS
There are two types of networks associated with a cluster : internal and external. Intra-node communication in a cluster is performed using a proprietary unicast node to node protocol. IP over InfiniBand network is used with redundant switches.
Clients connect to the cluster using Ethernet connections, each node provides its own ports. File system protocols such as NFS, CIFS, HTTP, iSCSI, FTP and HDFS are supported.
The Operating system of OneFS is a BSD-based operating system that supports both windows and Linux/Unix based semantics such as hard links, delete-on-close, atomic rename, ACLs and extended attributes.
To communicate with the client, a variety of protocols as mentioned above are supported. The I/O subsystem is split in two halves - the top half is the initiator and the bottom half is the participant. Any node that the client connects to acts as the initiator.
Cluster operations involve checking and maintaining the health of the cluster. These jobs are run through a job engine and have priority associated. Jobs might include balancing free space in a cluster, scanning for viruses, reclaiming disk space, associating a path and its contents with a domain, rebuild and re-protect the file system, gathers the file system, scrubs disk for media level errors, and revert an entire snapshot to disk.
The granularity of the jobs ensures that OneFS performs adequately and appropriately for every impact interval in the customers environment.

Monday, July 22, 2013

Barry Wise on five progressive steps to database normalization
We start out with an example where we store user's name, company, company address, and personal urls - say url1 and url2
The zero form is when all of this is in a single table and no normalization has occured.
The first normal form is achieved by
1) eliminating repeating groups in individual tables
2) creating a separate table for each set of data
3) identify each set of related data with a primary key
so this yields a table where the user information is repeated for each url so url field limitation is solved
The Second normal form is achieved by
1) Creating separate tables for a set of values that apply to multiple records
2) Relate these tables with a foreign key
Basically, we break the url values into a separate table so we can add more in the future
The third normal form is achieved by
1) eliminating fields that do not depend on the key
Company name and address have nothing to do with the user id, so they are broken off into their own table
The fourth and higher form depend on data relationships involving one-one, one-to-many and many-to-many.
The Fourth normal form is
1) In many to many relationship, independent entities cannot be stored in the same table.
To get many users related to many urls, we define a url_relations where they user id and url id are paired.
The next normal form is the Fifth normal form which suggests that
1) The original table must be reconstructed from the tables into which it has been broken down. This is a way to check that no new columns have been added.
As always, remember that denormalization has its benefits as well.
Also, Litt's tips additionally mentions the following :
1) create a table for each list. More than likely every list will have additional information
2) create non-meaningful identifiers.
This is to make sure that business rule changes do not affect the primary identifier

Sunday, July 21, 2013

Database normalization and denormalization rules are discussed here.
Codd described the objectives of normalization as follows:
1) To free the collection of relations from undesirable insertion, update and deletion dependencies.
2) To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increases the lifespan of application programs.
3) To make the relational program more informative to users
4) To make the collection of results neutral to query statistics, where these statistics are liable to change as time goes by.
The undesired side effects of insert, update or delete may include the following:
- Multiple rows for the same information which are updated independently and get out of sync
- no rows such as when a new user is added but not assigned to anything
- inconsitent deletes such as where one table deletion implies and requires a deletion in a completely different table.
If the addition of new data, requires changes to existing structure, then such changes can cause regressions
Tables when normalized are immediately correspond to real world concepts and their relationships.
The normalized tables are suited for general querying across any set of tables.
Some common normalization terms:
Functional dependency : FD: X->Y where Y attribute has a functional dependency on a set of X attributes if and only if each X value is associated with one and only one Y value.
Trivial Functional dependency : is a FD of an attribute on a superset of itself.
Full Functional dependency : when an attribute is FD on X but not on any subset of X
Transitive dependency : X->Y and Y->Z implies X->Z
Multivalued dependency : presence of some rows implies presence of some others
Join dependency - table can be recreated with joins
Superkey : A superkey is a combination of attributes that can be used to identify a database record.
Candidate key is a minimal superkey such as the Social Security Number
Non-prime attribute is one that does not occur in any candidate key. A prime attribute is one which occurs in some candidate key
A candidate key may be designated as a primary key but is unsually not talked about in with respect to other candidate keys.
Normal forms include the following:
1) First normal form - table has one candidate key
2) Second - no non-prime attribute is FD on a proper subset of any candidate key
3) Third - every non-prime attribute is non-transitively dependent on every candidate key in the table ( no transistive dependency is allowed)
4) EKNF - Every non-trivial FD is either the dependency of an elementary key or a dependency of a superkey
5) BCNF - Every non-trivial FD in the table is a dependency on a super key
6) Fourth - Every non-trivial multivalued dependency is a dependency on a super key
7) Fifth - Every non-trivial join dependency in the table is implied by the super keys of the table.
8) Domain key - Every constraint is a logical consequence of the tables domain constraints or key constraints.
9) Sixth - no non-trivial join dependencies
Denormalization - OLAP is denormalized as compared to OLTP. The redundant data is carefully controlled during ETL. Normalized snowflake schema becomes denormalized star schema. Non-first normal form is formed by nesting 1NF and the reverse is unnesting.

The KMP algorithm for string pattern matching proceeds like this:

//
#include "stdafx.h"
#include <iostream>
#include <cstdio>
#include <cstring>
#include <vector>
using namespace std;
int* PreProcess( string pattern) {
int patternLength = pattern.length();
if (patternLength == 0) return 0;
    int * next = new int[patternLength + 1];
    if (next == 0) return 0;
    next[0] = -1;  // set up for loop below; unused by KMP

    int i = 0;
    int j = -1;
    while (i < patternLength) {
  next[i + 1] = next[i] + 1;
  while ( next[i+1] > 0 &&
    pattern[i] != pattern[next[i + 1] - 1])
   next[i + 1] = next[next[i + 1] - 1] + 1;
  i++;
}
    return next;
}
void KMP(string pattern, string text, vector<int> *positions) {
    int patternLength = pattern.length();
    int textLength = text.length();
    int* next = PreProcess(pattern);
if (next == 0) return;
    int i = 0;
    int j = 0;
    while ( j < textLength )
{
  while(true)
   if (text[j] == pattern[i]) //matches
   {
    i++;   // yes, move on to the next state
    if (i == patternLength) // maybe that was the last state
    {
     // found a match;
     positions->push_back(j-(i-1));
     i = next[i];
    }
    break;
   }
   else if (i == 0) break; // no match in state j = 0, give up
   else i = next[i];
  j++;
}
}
char CharCode(char chr) {
    if ( 'a' <= chr && chr <= 'z' )
        return 'a';
    if ( 'A' <= chr && chr <= 'Z' )
        return 'A';
    if ( '0' <= chr && chr <= '9' )
        return '0';
    if ( chr == '.' || chr == '?' || chr == '!' || chr == ',' || chr == ':' || chr == ';' || chr == '-' )
        return '.';
}
string CodeText(string text) {
    string ret = text;
    for (int i = 0; i < text.length(); ++i) {
        ret[i] = CharCode(text[i]);
    }
    return ret;
}
void FancyOutput(string pattern, string code, vector<int> *positions) {
    cout << "Matched positions: ";
    for (int i = 0; i < positions->size()-1; ++i)
        cout << (*positions)[i] + 1 << ", ";
    cout << (*positions)[positions->size()-1] + 1 << "." << endl;

    std::cerr << "Text: " << code.c_str() << endl;
    for (int i = 0; i < positions->size(); ++i) {
        printf("%5d ", i+1);
        for (int j = 0; j < (*positions)[i]; ++j) cout << " ";
        cout << pattern.c_str() << endl;
    }
}
int _tmain(int argc, _TCHAR* argv[])
{
    string pattern, text, code;
char pattext[1024];
char txttext[1024];
    cout << "Input pattern:" << endl;
    cin.getline(pattext, 1024, '\n');
pattern.assign(pattext);

    cout << "Input text:" << endl;
    cin.getline(txttext, 1024, '\n');
text.assign(txttext);
    cout << endl;

    code = CodeText(text);
    cout << "Processed text:" << endl;
    cout << code.c_str() << endl;
    cout << endl;

    vector<int> *positions = new vector<int>();
    KMP(pattern, code, positions);

    if ( positions->size() )
        cout << "Y" << endl;
    else
        cout << "N" << endl;

    if ( positions->size() )
        FancyOutput(pattern, code, positions);
    return 0;
}

Saturday, July 20, 2013

Kimball architecture

Kimball architecture:
The Kimball architecture is based on dimensional modeling. The following rule of thumbs are observed in this modeling
1) Load detailed atomic data into dimensional structures. i.e do not load summarized data into the dimensional tables.
2) Structure dimensional models around the business processes. these business processes have performance metrics that often translate to dimensions or facts. Combined metrics could also be additional dimensions.
3) Ensure that every fact table has an associated date dimensional table. The business processes and performance metrics mentioned above are often associated with measurement events which are usually periodic with holiday indicators
4) Ensure that all facts in a single fact table are at the same grain or level of detail. The measurements within a fact table must be at the same grain or level of detail such as transactional, periodic snapshot or accumulating snapshot.
5) Resolve many to many relationships in fact tables. The events stored in a table are inherently associated with many places on many days. These foreign key fields should never be null. Sometimes dimensions can take on multiple values for a single measurement event, in which case, a many-many dual keyed bridge table is used in conjunction with the fact table.
6) Resolve many to one relationships in dimensional tables. Hierarchical fixed depth many to one relationships are typically collapsed into a flattened dimensional table. Do not normalize or snowflake a M:1 relationship but denormalize the dimensions.
7) Store report labels and filter domain values in dimension tables. The codes, decodes and descriptors uses for labeling and query should be captured in dimensional tables. Again such attributes should have no nulls.
8) Make certain that dimension tables use a surrogate key. Meaningless sequentially assigned surrogate keys can help make smaller fact tables, smaller indexes and improved performance.
9) Create conformed dimensions to integrate data across the enterprise. Conformed dimensions also referred to with common, master, standard or reference dimensions are defined once in the ETL system and deliver consistent descriptive attributes across dimensional models and support the ability to drill across and integrate data from multiple business processes.
10) Continuously balance requirements and realities to deliver a DW/BI solution that's accepted by business users and that supports their decision making. User requirements and underlying realities of the associated source data needs to be reconciled.
Dimnesional modeling, project strategy, technical ETL/ BI architectures or deployment/maintenance all require balancing acts.