Cluster computing

Saturday, February 16, 2013

Object Oriented databases

The object oriented databases originated from well-known limitations of the Relational DBMS which was too simple for CAD/CAM, GIS/CASE since there was limited datatypes and no UDTs at the time. Among the other limitations were that RDBMS does not allow the concept of generics or generalization and aggregation. And there is a mis-match between DB language and application languages. There was no support for design transactions, versions and schema evolutions. Hence the approach to extended relational DBs ( or object-relational model ) with support for user defined data types, stored procedures or rules, transitive closures, and generalization.

Fundamentally, an object oriented database implements a conceptual object data model with objects and their identifiers. This model is all about encapsulation, classes, inheritance, overloading and overriding, complex values and types. Therefore it differs from a entity relationship model where there are explicit relations defined and schemas. The object model extends the relationships to add support for subclass and instances, composition hierarchy and checking out / checking in of objects etc. Integrity constraints are also extended in object databases with referential integrity constraints and uniqueness constraints and siblings or sub-class relationships. Support for composite objects with references to other objects with exclusive or shared parts is also available.

Similarly, the query language and optimization has been extended. The queries look very similar to those in programming languages. Expression with support for All or Any are also made available. The issues with query optimization in the object-relational model include user defined methods (hard to optimize), class hierarchy based queries ( requiring new strategies), multi-valued attribute ( needs new cost models ) and dual buffering ( separately process objects in the buffer / on the disk).

Locking is at the granularity of objects (intention locks) on large complex objects. Transactions share objects in memory buffers and recovery needs to handle memory buffers.

An object database provides a new model closer to application development and requires refinements on the relational model. When objects are flattened, they lend themselves to storage in a relational model.

Friday, February 15, 2013

Spatial databases

A spatial database stores information about objects in space which we can express in the form of geometry such as points, lines and polygons or geography. These are data types that are very different from the customary varchar, int, varbinary etc data types for databases. Spatial data is also queried very differently. The open geospatial consortium defines the following query types
Spatial measurements
Spatial functions
Spatial predicates
Constructor functions
Observer functions
NoSQL systems like MongoDB and CouchDB support only a limited set.
Also the index used for spatial data types are different too. Due to the nature of the spatial data types, some patterns are very common in spatial queries and regular indexes don't support these very well. For example, how far two points differ or if a point is within a spatial area of interest require algorithms and data structures that are different from B-Trees or bitmap indexes.
For example in Microsoft SQL Server, you can do the following:
CREATE TABLE SpatialTable
( id int IDENTITY (1,1),
GeomCol1 geometry,
GeomCol2 AS GeomCol1.STAsText() );
GO

INSERT into SpatialTable (GeomCol1)
VALUES (geometry::STGeomFromText('LINESTRING (20, 180, 100, 100, 180, 180)', 0));

INSERT into SpatialTable(GeomCol1)
VALUES (geometry::STGeomFromText('POLYGON ((0, 0, 150, 0, 150, 150, 0, 150, 0, 0)', 0));

DECLARE @geom1 geometry;
DECLARE @geom2 geometry;
DECLARE @result geometry;

SELECT @geom1 = GeomCol1 FROM SpatialTable Where id = 1;
SELECT @geom2 = GeomCol2 FROM SpatialTable Where id = 2;
SELECT @result = @geom1.STIntersection(@geom2);
SELECT @result.STAsText();

Incidentally, expressing spatial data types as .NET UDTs and performing spatial operations is practical but just not deep enough for the kind of features and the size of data such as for astronomical study.

recap

C sharp data structures

Here are some common data structures in C#

ArrayList: This is a dynamically sized array of objects that implements the IList Interface. It works by maintaining an internal array of objects that is replaced with a larger array when it reaches its capacity of elements. It is very efficient at adding elements since there is usually a free slot in the end but is inefficient at inserting . Searching can be efficient if the BinarySearch method can be used but you must Sort the array first.

BitArray: This is a dynamically sized array of Boolean values. It is more memory efficient than simple array of bools because it uses only one bit for each value whereas a bool array uses two bytes for each value.

Hashtable class: A Hashtable class is a standard dictionary (key/value) data structure that also uses a hashing algorithm to store and index values efficiently. The GetHashCode method is used for hashing. Any type used as a key in the hashtable should return a good hash of the object’s internal value.

Queue class: A Queue is a standard first in first out (FIFO) data structure providing simple operations to enqueue, dequeue, peek etc.

SortedList class: A SortedList is a standard dictionary data structure that uses a binary-chop search to index efficiently. SortedList implements IDictionary.

Stack class: A stack class is a standard Last-in first-out (LIFO) data structure.

StringCollection: A StringCollection is a standard collection data structure for storing strings.

HashSet: A HashSet is a generic collection that contains no duplicate elements and where the elements are in no particular order. The capacity of the HashSet is the number of elements the object can hold.

LinkedList: A LinkedList is a general purpose linked list. It supports enumerators and implements the ICollection interface, consistent with other collection classes in the .Net framework.

SortedDictionary: A sorted dictionary is a binary search tree and is very similar to the SortedList but differs in execution speed and memory use.

ConcurrentQueue generic is used for thread safe queue operations and can be used safely by multiple threads. You could invoke multiple threads with the Parallel library i.e. Parallel.Invoke(action, action);
Similarly ConcurrentBag, ConcurrentDictionary and ConcurrentStack are also available.
Partitioner and OrderablePartitioner provides a manner of partitioning data into multiple partitions.
A Blocking Collection provides a blocking and bounding capabilities for threadsafe collections that implement IProducerConsumerCollection<T>
BinaryTree and BinarySearchTree can be implemented with a generic Node<T> class.
public class Node<T>
{
public Node<T> Left {get; set;}
public Node<T> Right {get; set;}
public T data {get; set;}
}
}
and travesals can be written in a recursive way:
public PreorderTaversal (Node current)
{
if ( current != null)
{
Console.WriteLine(current.Value);
PreorderTraversal(current.Left);
PreorderTraversal(current.Right);
}
}

Dynamics CRM Q & A

Role based security is different from object based security. An access right is a right granted to a user of an object. A privilege is a right granted to user on a class of objects. Access right applies only after privileges have taken effect. Policies and roles makes it easy to grant privileges. Role based security uses authentication and authorization mechanism of the application server.

Find performs a search on an attribute for which it is defined. Advanced Find performs search on the conditions and the attributes that user may choose one or more.

CRM discovery service is used for ORG related activities while CRM metadata service is used for CRM record activities.

A plugin is different from a workflow. A plugin is a logic that can be executed repeatedly for different events or entities, even when offline, with or without impersonation and usually runs quickly. A workflow is governed by an end-user, can have sub-processes triggered and can take a long time.

MSCRM_Config and MSCRM_orgname databases have been out of the box on MSCRM install.

The main challenge with CRM is to keep the focus and center on the customer record while scaling or using better top level tools and custom dashboards.

What is the typical cost of using CRM ? It varies between $20 to $350 per month per user.

What should CRM do ? CRM should help monitor, deliver and measure my effectiveness.

What do you see as new and emerging in CRM ? Mobile applications supporting CRM is a new and emerging field.

Do you have a few key best practices someone considering CRM can use ?
Some suggestions are: First consider your future needs Second take the opportunity to clean up your data now and Third communicate throughout the process to get early buy-in.

What advantage might CRM have for specific veriticals ? If you are a services company with long sales cycle, project terms, and frequent interactions and touch points, CRM will be exponentially more valuable to you

Reference: Q & A slideshare.net/warsha.agarwal

Thursday, February 14, 2013

PHP, Heroku, MongoDB

MongoDB is a key value store. Its built for high-performance and is schemafree. It is also a document oriented database and a relational store. Hence it supports NOSQL (Not Only SQL). It's used by GitHub and sourceforge. You can try it yourself at http://try.mongodb.org/ And you can use it in most languages such as C, C++, C#, php, python, node.js, HASKELL, SCALA, Java etc. It uses the terms similar to other databases. you type show dbs, use admin, show collections etc. A collection is a table. Each document is a row. You can type
db.accounts.find().forEach(function(doc) { print (tojson(doc));});
You can display each document in JSON format this way. The database is also easy to understand with similar functionality.
Query
You use dynamic queries. You can insert JSON like
db.colors.insert({name:'red', primary:'true'})
> var cursor = db.colors.find()
> cursors.next()
and of course SELECT * from colors where name = 'green'
or db.colors.find({primary: true}).sort({name: 1}).limit(1)
db.people.find({age: {lte:27}})
Operators
You can also use $gt, $lt, $ne, $in, $nin, $mod, $all, $size, $exists, $type, $elemMatch, $not and $where.
Index
You can also use Indexes. db.colors.ensureIndex({name: 1}) for single ascending or -1 for single descending order or {unique:true} for unique and {background:true} for non-blocking index in background
You can also mix with db.colors.ensureIndex({name: 1}, created_at: -1})
Aggregation:
You can use db.colors.distinct('name')
You can also use items group
db.items.group({key:{template:true}, initial:{count:0}, reduce:function(obj, prev) { prev.count += 1}});
Data Types:
Array, Binary, Boolean, DateTime, DB reference, Embedded object, Integer, Null, ObjectId, RegExp, String, Symbol, Timestamp
Relationships are used similarly normalized, embedded ( Embedding is pre-joining) etc. Embed when documents appears with parent. 4MB document size limit.
And there's support via e-mail, IRC, Web, book, conference and training.
reference : slides from jnunemaker

PHP:
web programming language, runs on the web server, It's free, cross-platform interpreted language and is targeted towards web programming. The PHP interpreter runs the command between php tags, string syntax called a here document, displaying information about a database. PHP form runs the command between <?php? > tags and processes form data via $_POST.
PHP worsks on pieces of text called strings. Strings can be of three types - enclosed in single quotes, enclosed in double quotes and a here document. Strings can even contain binary files such as image or sound. Strings can support variable interpolation i.e. the variables in double quotes will be substituted with their value. Concatenation is called implode and splitting is called explode, randomly shuffle with str_shuffle(). Perl like variables beginning with $.
Pass by reference and value are both allowed.

Wednesday, February 13, 2013

Interview Preparation

Insertion of a binary tree involves the following steps:

If the root is null, make this the root.

If the root is a single node with no children and it’s value is greater than the input, insert the input as the left child of the root. Otherwise insert as the right child.

If the root has any siblings, traverse right if the root is smaller than the input, otherwise traverse left. In each case, append the tree with the input as a leaf. The leaf attaches to the sentinel node just as in the case of the root above.

Deletion from a binary tree involves the following steps:

Deletion is tricky because the delete of the node requires several of the following nodes to be written and this is because splicing the node may be required. The case of deleting the nodes is trivial because the left child leaf or the right child leaf can be deleted without affecting the rest of the tree.
There are three cases for the node to be deleted. These cases depends on how many siblings the node has for it to be deleted.
1) the first case is when there are no children , we just remove it.
2) the second case is when there is only one child, we splice out the node.
3) the third case is when there are two children, we splice out the successor which has at most one child and replace z with y.

So the steps are as follows:

Walk down the tree to find the node to be deleted. Traverse right if the node to be deleted has a value greater than the root. Arrive at the node to be deleted or null if the node is not found .

If the node to be deleted is the root and the root has two siblings, the tree successor has to be found, spliced and made the root. Otherwise if there is only one sibling, then that sibling is chosen to replace the node. We can call the node to be deleted as z and the one to be spliced as y. y = z when there is only one sibling.

If y has a left sibling, then we will need to update its parent and hence we will call it x.
Otherwise we will choose x as the right sibling.
If x is not null, we update it's parent to point to y's parent.
If y's parent is null, then y is the root, hence x has to be made the root.
Otherwise. if y was the left sibling of its parent then that parent's left sibling should now point to x, otherwise we choose the right sibling.

if y != z, then we transfer y's data to z.

Red Black Tree

The Red black tree must satisfy the following properties:
1. Every node is either red or black.
2. The root is black
3. Every leaf is black
4. If a node is red, then its children are black
5. For each node, all paths to the node contain the same number of black nodes.

Tree Insertions or deletions require rotations.

1. Tree Insertion is very much like the tree insert procedure but involves recoloring and rotation at the end to satisfy the red-black properties. There are three cases to be considered depending on whether the uncle (y) of the node just inserted (z) is red or black and z is a right or left child.
To enumerate these cases:
Case 1:  z's uncle y is red : Action - recolor the parent and the uncle, z moved up
Case 2: z's uncle y is black and z is a right child : Action - perform a left rotation
Case 3: z's uncle y is black and z is a left child : Action - perfrom a right rotation
Color of root and leafs to be checked at the end of the procedure :

2. Tree deletions is very much like the tree delete procedure also requires recoloring and rotations. There are four cases to be considered and these are the dependent on whether the node to be fixed (x) has a red or black sibling and whether that node's sibling (w) has red or black children.
To enumerate these cases:
Case 1: x's sibling w is red : Action - exchange the color of the parent and w and perform a left rotation
Case 2: x's sibling w is black, and both of w's children are black : Action - color the sibling to red and move x up the tree
Case 3: x's sibling w is black, w's left child is red, and w's right child is black : Action - exchange the color of w and its left child and perform a right rotation.
Case 4: x's sibling w is black and w's right child is red : Action - remove extra black and perform left rotation.

Graph : This can have nodes similar to a n-ary tree. Graph can also be represented by edges and vertices.
Breadth First Traveral:
Use a queue and insert a node. When the node is removed, it siblings are inserted. Each pop operation results in a step of the breadth first traversal. This continues until the queue is empty.

Depth first traversal is easy to write with a recursive function. Check the parameter, define the terminal condition, recurse down the adjacent nodes, one by one.

Djikstra's algorithm is useful to find a single-source shortest path on a weighted, directed graph. You pick the minimum vertex among the adjacent and then relax the edges for each vertex adjacent to the pick.

B-tree is also popular. Each node has the number of keys currently stored, the node n[x] keys themselves, stored in non-decreasing order and a boolean value to say if it is a leaf or an internal node.

Techniques to use various data structures

Divide and conquer. This performs faster by collapsing linear searches to log n search. Then combining the solutions to sub-problems. Its great for sub problems that are independent

Dynamic Programming: This also combines the solutions to sub-problems except that the sub-problems overlap.

Greedy Algorithms : This makes the choice that looks best at the moment. It will always make the locally optimal choice.
There's top down as well as bottom up approach where the problems are decomposed top down and solutions are built bottom up.

Some examples:

1. Longest Common Subsequence :
Step 1 : Characterize a LCS with the following criteria: say Z is an LCS of X and Y
1. if Xm == Yn, then Zk = Xm = Yn and Zk-1 is an LCS of X and Y
2. If Xm != Yn, then Zk != Xm => Z is an LCS of Xm-1 and Y
3. If Xm != Yn, then Zk != Yn => Z is an LCS of X and Yn-1

Step 2: Define the recursion
C[i, j] =    0                                         if i = 0, or j = 0
           =   c[i -1, j - 1] + 1                   if i, j > 0 and xi = yi
           =   max(c[i, j - 1], c[i -1, j ])    if i, j > 0 and xi != yj

Step 3: Computing the length of an LCS
LCS-Length(X, Y) : Use the step 2 and set up a recursion to compute the length.

2. String pattern matching :
Finite state Automata can be used for the following:
Define a finite automaton matcher with right going edges forming the spine of the automaton and numbering as many as the pattern specified. A directed edge shows some state transition. Pattern match length and offset are both captured
In C# language provisions for performing a regex, we have some good design and implementation.
Here it is for a quick recap.

Regex r = new Regex( pattern, options);
int[] gnums = r.GetGroupNumber();
Match m = r.Match(text);
while (m.Success)
{
for ( int i = 0; i < gnums.Length; i++)
{
Group g = m.Groups[gnums[i]];
CaptureCollection cc = g.Captures;
for ( int j = 0; j < cc.Count; j++)
{
     capture c = cc[j];
     Console.WriteLine( "\t\tCapture{0} = [{1}] Index = {2} and Length = {3}", j, c, c.Index, c.Lenght);
}
}
m = m.NextMatch();
}

Moving to the language based interview questions on C#, here is a quick recap of some questions asked in interviews:

Some questions :
1. [TCS] Desribe the garbage collector in .Net. It's compact generational markdown compactor.
2. [TCS] What is reflection ? How's it used ? Suggested answer : Reflection is a way to manipulate the types based on their metadata. Some uses of reflection are late binding, serialization , remoting, attributes etc.
3. [Starbucks] How do you wrap your MVC methods for error handling. Answer [HandleError] attribute
4. [Starbucks] How do you define your serialization attributes. Why is order useful?
5. [Starbucks] How do you use WCF with SOAP ? What additional items do you specify in the web.config ?
6. [T-Mobile] What attributes are used to define REST APIs in Controller ? HttpGet, HttpPost
7. [Infosys] With use of a single sign on software such as siteminder across various internal applications, a user session is lost between navigations. How do you diagnose and fix the problem ? This could be an issue with javascript, services, siteminder or network.
8. [Infosys] How do you prioritze, keep the team moving, meet the deadlines, and deliver with high quality
9. [Ontela] How would you design a file system ? What are the layers and interface.

There is a lot of considerations for the system programming questions.  For example, File System design has to consider organizing files and block level storage, locking, logging, file access security and remoting The layers involve File Organization, File Management. file allocation strategy, file buffering and disk manangement interface.
File allocation strategies have to consider file allocation in large blocks or small file system devices. File Allocation strategies could consider storage in a linked list of alarge blocks, consecutive storage in a single blocks, and storage in a bunch of smal blocks tracked by an index. If random access is important, the linked list might be a poor choice. If fast reads are important, ascertaining the physical location of data would be in constant time for consecutive or indexed storage. If internal and external fragmentation is to be considered, then allocation size will need to be adjusted. If free space needs to be tracked, linked list or bit vector could be used. Operating System like Linux make files systems to be extended or added as plugins with their implementation of Virtual File System.
Virtual File System is an indirection layer which handles all the file system calls and calls the necessary file system functions in the physical file system code to do the I/O. The physical file system may employ one or more buffer cache to interact with device drivers and in turn with device controllers.
File Systems can be mounted and this populates the file descriptors which contains several kinds of information. Information that are common to every file system type.

Talking about files, let's take a look at newer dbs now.

libraries for data mining

Some of the data mining packages available on googlecode are nifty utilities to mine 2D data. They come with various alogorithms and technique implementations. This can directly be used with different kinds of data. One such logic involves computing cluster centroids and points, then finding euler distance between the points and the centroids. Euler distance is the square root of the sum of squares of the x and y offsets from the centroid. In the case of finding keywords in a text, wordnet similarity distance between the selected candidates could be considered and a 2D array (points, clusters) of similarity populated each ranging between 0 and 1. We could iterate through all these points to compute sum of euler distance for a point from all of the clusters. Also, for the same point we could compute the inverse of the fraction of the distance to each cluster to the sum just computed. Then we could normalize the matrix to tag each data point to different clusters. Once we know the membership of each words to different clusters, we could populate index with words that maximize cluster membership.