Cluster computing

Sunday, May 4, 2014

Revisit of some TSQL
Create Table example:
CREATE TABLE Demographics(
id INT IDENTITY (1,1),
name VARCHAR(15) NOT NULL,
country VARCHAR(2) DEFAULT 'US'
CONSTRAINT country_not_null NOT NULL
CONSTRAINT country_check
CHECK (country IN ('CA', 'US')),
indexed_name VARCHAR(15),
CONSTRAINT demographics_pk
PRIMARY KEY (id)
CONSTRAINT demographics_fk01
FOREIGN KEY (name, country)
REFERENCES parent_example (name, country) );

ALTER TABLE Demographics ADD
age INT,
CONSTRAINT age CHECK (age > 0 AND age < 150);

Regular expressions can be specified with one or more of
%, _, [...] and [^...]
Common formatting syntax use CASE, COALESCE and ISNULL()

Hierarchical queries can be written with common table expressions:
To select direct reports for a manager, we write
WITH directreports
( manager_id, employee_id, employee_name, level)
AS
(SELECT e.manager_id, e.employee_id, e.employee_name, 0 as level FROM dbo.Employee e
WHERE e.manager_id is NULL
UNION ALL
SELECT d.manager_id, d.employee_id, d.employee_name, level + 1 FROM directreports d
ON e.manager_id = d.employee_id)
SELECT manager_id, employee_id, name from directreports

SELECT d.name as department_name
COUNT(*) as employee_count
From employee e inner join department d
ON e.department_id = d.id
GROUP by d.name WITH ROLLUP
-- generates summary row

WITH CUBE to generate summaries for all possible combinations and a grand total
GROUPING(group_by_col_name) returns 1 or 0 when the associated columns in the result set contain a null.

For string functions use CHARINDEX or SUBSTRING or LENGTH or +
DATETIME functions involve GETUTCDATE(), DATEADD(datepart, interval, date) and DATEDIFF(datepart, startdate, enddate) eg. DATEADD(day, 1, GETDATE());

use CONVERT for datetime

Next BCL coverage of .Net

FileSystem Interface:
Classes:
File : methods, Create, Delete, Open, Read, Write, GetTime
Directory: methods: Create, Delete, Exists, EnumerateFiles, GetDirectories, GetFiles, GetTime
FileInfo
DirectoryInfo
FileStream
Path - static methods.
FileWatcher

System.Messaging has the following classes and methods:
MessageQueue class has different methods as follows: 1) Send 2) Receive 3) Peek 4) , GetPublicQueuesByMachine, GetPublicQueuesByLabel and GetPrivateQueuesByMachine.

MessageQueueCriteria, MessageQueueEnumerator, MessageQueuePermission and MessageQueueTransaction are other classes

We write a few more methods.
Node * Tree_Successor(Node* root)
{
Assert (root != null);
if (root->right)
{return Tree_Minimum(root->right);}
Node* x = root;
// find the lowest ancestor of x whose left child is also an ancestor of x
Node* y = root->parent;
while ( y && x == y->right)
{
x = y;
    y = y->parent;
   }
return y;
}
Node* Tree_Minimum(Node* root)
{
Node *x = root;
while (x && x->left)
   x = x->left;
return x;
}

Node* Tree_Predecessor(Node* root)
{
Assert (root != null);
if (root->left)
{return Tree_Maximum(root->left);}
Node* x= root;
Node* y =   root->parent;
while (y && x == y->left)
{
x = y;
y = y -> parent;
}
return y;
}

File - System organization
There are primarily two design considerations:
1) exposure of file system to user via definitions of file, attributes operations
2) overlay of logical filesystem on top of physical file system via data structures and algorithms
The file system stack consists of
application programs
logical file system
file-organization module
basic file system
I/O control
devices
a typical open file table will have
index, file_name, permissions, access dates, and pointer to disk block.
A file is represented by an inode (short for index node) in Unix. The in ode contains the user and group identifiers of the file, last file modification and access, count of number of hard links, and the type of the file.In addition, the inode contains 15 pointers to the disk blocks containing the data contents of the file. Inodes also contain pointers that point to direct blocks. There can be pointers to single indirect and double indirect as well as triple indirect blocks. File structures are inherited by the child process after a fork so several processes may share the same offset location for a file.

Saturday, May 3, 2014

Some notes on the book on C++.
A user defined allocator can have chunk of memory to allocate or deallocate from. The class called Pool has methods for alloc and free. Its destructor deletes all the chunks . The method for grow allocates a new chunk and organizes it as a linked list of elements.
AutoPointers support the resource acquisition is initialization technique. An auto-pointer is initialized by a pointer and can be dereferenced in the way that a pointer can. The object pointed to will be implicitly deleted at the end of the auto_ptr scope.
auto_ptr<Shape> p (new Rectangle(p1, p2));
Copy Constructor and assignment operator can be kept private for the most part.
class X {
//
X (SomeType);
X( const X&);
X& operator= (const X&);
~X();
};

class X {
// members with implicit this pointer
X* operator& ();
X operator&(X);
X operator ++ (int);
X operator&(X,X);
X operator / ();
};

Template specializations : the specialization ( argument list match ) proceeds one after the other
The order is as follows:
template <class T> class Vector; // general
template <class T> class Vector<T*>; // specialized for any pointer
template <> class Vector<void*>;

Function templates
template<class T> void sort(vector<T> & v);

Sequences and containers
list<int>::iterator p = find(li.begin(), li.end(), 42);

Standard containers :
<vector> single dimensional array : methods size(), operator [], at(), front(), back()
<list> doubly linked list iterator, reverse_iterator methods: insert(), erase(), clear()
<deque> doubly ended queue : methods : empty(), front(), back(), push(), pop()
<queue>
<stack> : methods push_back(), pop_back() on Stack
<map> associative array of T
<set> set of T
<bitset> array of booleans
size(), capacity() and reserve() the latter two are for vectors
push_back(), pop_back(), push_front() and pop_front()
priority_queue : methods top()

Hashing
template<class T> size_t Hash<T>::operator()(const T& key) const
{
size_t res = 0;
size_t len = sizeof(T);
const char * p = reinterpret_cast<const char*> (&key);
while ( len--) res = (res << 1) ^ *p++;
return res;
}

We now look at the Streams and strings.
Streams include fstream and stringstream derived from iostream. iostream derives from i stream and ostream. Stream methods include open and close
string methods include compare, insert, append, concatenation (+) find, replace, and substr(). The size and capacity are similar to vector

This weekend I'm going to cover some basics in computer programming. They will be from the Algorithms and Data Structures book.

Merge-Sort(A,p,r)
if (p < r)
then q <- (p + r) / 2
   Merge-Sort(A, p, q)
   Merge-Sort(A, q+1, r)
   Merge(A, p, q, r)

Quick-Sort(A,p,r)
if p < r
then q <- Partition(A, p, r)
Quick-sort(A, p, q - 1)
            Quick-sort(A, q+1, r)

Partition (A, p, r)
x <- A[r]
i <- p - 1
for j <- p to r - 1
do if A[j] <= x
then i = i + 1
exchange A[i] = A[j]
exchange A[i + 1] <-> A[r]
return i + 1

Bubble-Sort(A)
for i = 1 to length[A]
    do for j = length[A] down to i + 1
          do if A[j] < A[j - 1]
               then exchange A[j] -> A[j - 1]

Tree-Insert(T, z)
y <- nil
x <- root[T]
while x != NIL
do y = x
if key[z] < key[x]
then x <- left[x]
else x <- right[x]
p[z] = y
if ( y == NIL)
then root[T] <- z
else if key[z] < key[y]
then left[y] <- z
else right[y] <- z

Tree-Delete(T,z)
// Case 1 z has no children
// Case 2 z has one child => splice out
// Case 3 z has two children => substitute z
with the successor that has no left child
if left[z] = NIL or right[z] = NIL
then y = z
else y = Tree-Successor(z)
if left[y] <> NIL
then x <- left[y]
else x <- right [y]
if x <> NIL
then p[x] <- p[y]
if p[y] = NIL
then root[T] = x
else if y = left[p[y]]
then left[p[y]] = x
else right[p[y]] = x
if y <> z
then key[z] <- key[y]
copy y's satellite data into z
return y

Friday, May 2, 2014

In this post, we revert to designing a scalable queue alerts module. We describe a technique that involves getting all the messages from the queues and then filtering them for the actions involved. In this case, we add metadata to the messages particularly the queue to which it belongs. When the queues get added or deleted, we do or do not see messages from those queues so this does not affect the message processor in the alerts module. However the rules may or may not be in sync with the rules.To change the rules, the alerts module must be reloaded with the new set of rules. In order to do that, the alerts module provides an interface to reconfigure the rules it has. These rules could still be invalid. If the queues mentioned in the rules don't exist or if there are some errors encountered during the evaluation of the rules, they result in no-operation. The messages are evaluated only against those that are valid. For the valid messages the actions are invoked by workers from a pool. These help with message processing so that messages are not queued behind a single worker. Also the rules could specify the action in a way that it serves as alerts.
The implementation has two layers. There is the message manager and the message worker. The worker receives the incoming messages and evaluates it against a rule to determine the action to be taken. The message manager does book keeping of the actions and the messages for a particular application. The manager can work with workers across applications and rules. The manager can also start and stop the worker. By separating the management of the message, queue, rules configuration and actions, we have a separation of concerns and the layer above can assume that the messages are all read and acted on.
I will describe the API next that outlines the interaction between the alerts module and the user

When discussing alerts and triggers in our previous post, one application of triggers comes to mind and this is replication. We talk about different replication techniques including trigger based replication in this post.
With replication one can hope to have a slightly out-of-date "warm standby" as a substitute for the main. There are several hardware and software techniques. We enumerate just a few.
Physical replication : The simplest scheme is to physically duplicate the entire set of data every replication period. This scheme does not replicate large data-set because the expensive bandwidth for shipping the data and the cost for reinstalling it at the remote site.
This scheme is often used by the clients at the low end because of the performance costs. Moreover, guaranteeing a transaction consistent snapshot is tricky. This can be avoided with the next technique.
Trigger based replication : In this scheme, triggers are placed on the data-set containers so that any insert, update or delete operations produced a 'difference' record which is installed in a special replication table. This replication table is shipped to the remote site and the modifications are replayed there.Since the triggers are for transactions, we know that the differences we get are consistent with transactions. However, there is more performance penalty with this technique.
The Log-based replication: This scheme is the replication solution of choice when feasible. Here a log sniffer process intercepts the log writes and delivers them to the remote system. This is implemented using two broad techniques:
1) read the log and prepare the data changes to be replayed on the target system,
2) read the logs and ship them to the target system
The target system continually replays log records as they arrive.
Often both mechanisms are used. This scheme overcomes all of the problems of the previous alternatives.

Thursday, May 1, 2014

Today we discuss some more on the queue alerts. We look into the possibility of polling as a way to read the messages in a non-invasive manner instead of instrumenting the queue processor for events. In polling, we are continuously seeking to read the queue. We read from all available queues and the moment we have read from a queue we post another read. There may be multiplexing involved and we may have several workers participating in the IO. When we get all the messages we then evaluate them against the rules. Reading from the queues in this manner is akin to promiscuous mode listening on the network.
We now evaluate whether we need to trigger alerts or if we can proceed to evaluating the rules on the messages. When we raise alerts on the messages received, it does not correspond to any of the events on the queue processor. The queue processor can tell us when the message arrived, when the message gets processed and when the message leaves the queue. It can also tell us any exceptions, retries or the destination the message is sent to. This information is not available to the poller. Moreover the poller can get messages in any order. Only with the help of timestamps, can the poller make any guess on the sequence of packets processed by the processors. Two or more messages may be simultaneously executed by the different processors in which case they have the same timestamp. These are serialized into the incoming message queue by the processors. Since the poller cannot make any events other than the arrival time of the messages unless there was some metadata associated with the message, the poller is only able to generate the arrival event and even that is of no consequence since the application can already look at the timestamp on the messages. If the poller were to execute the actions specified against the conditions in the rules, the poller doesn't need to raise events for itself. But let us look at what metadata could potentially be useful. The queue from which the message was read and the information about that queue is a useful addition to the information to have. Now we can trace the path of a message as it moves on the same or other queues. We can also easily filter out the messages arrived on a particular queue. Thus we see that a little bit more information collected by the poller such as a queue id of the queue from which the message was read is very helpful for evaluating the rules specified by the user to the queue alerts module.