Cluster computing

Friday, February 15, 2013

Dynamics CRM Q & A

Role based security is different from object based security. An access right is a right granted to a user of an object. A privilege is a right granted to user on a class of objects. Access right applies only after privileges have taken effect. Policies and roles makes it easy to grant privileges. Role based security uses authentication and authorization mechanism of the application server.

Find performs a search on an attribute for which it is defined. Advanced Find performs search on the conditions and the attributes that user may choose one or more.

CRM discovery service is used for ORG related activities while CRM metadata service is used for CRM record activities.

A plugin is different from a workflow. A plugin is a logic that can be executed repeatedly for different events or entities, even when offline, with or without impersonation and usually runs quickly. A workflow is governed by an end-user, can have sub-processes triggered and can take a long time.

MSCRM_Config and MSCRM_orgname databases have been out of the box on MSCRM install.

The main challenge with CRM is to keep the focus and center on the customer record while scaling or using better top level tools and custom dashboards.

What is the typical cost of using CRM ? It varies between $20 to $350 per month per user.

What should CRM do ? CRM should help monitor, deliver and measure my effectiveness.

What do you see as new and emerging in CRM ? Mobile applications supporting CRM is a new and emerging field.

Do you have a few key best practices someone considering CRM can use ?
Some suggestions are: First consider your future needs Second take the opportunity to clean up your data now and Third communicate throughout the process to get early buy-in.

What advantage might CRM have for specific veriticals ? If you are a services company with long sales cycle, project terms, and frequent interactions and touch points, CRM will be exponentially more valuable to you

Reference: Q & A slideshare.net/warsha.agarwal

Thursday, February 14, 2013

PHP, Heroku, MongoDB

MongoDB is a key value store. Its built for high-performance and is schemafree. It is also a document oriented database and a relational store. Hence it supports NOSQL (Not Only SQL). It's used by GitHub and sourceforge. You can try it yourself at http://try.mongodb.org/ And you can use it in most languages such as C, C++, C#, php, python, node.js, HASKELL, SCALA, Java etc. It uses the terms similar to other databases. you type show dbs, use admin, show collections etc. A collection is a table. Each document is a row. You can type
db.accounts.find().forEach(function(doc) { print (tojson(doc));});
You can display each document in JSON format this way. The database is also easy to understand with similar functionality.
Query
You use dynamic queries. You can insert JSON like
db.colors.insert({name:'red', primary:'true'})
> var cursor = db.colors.find()
> cursors.next()
and of course SELECT * from colors where name = 'green'
or db.colors.find({primary: true}).sort({name: 1}).limit(1)
db.people.find({age: {lte:27}})
Operators
You can also use $gt, $lt, $ne, $in, $nin, $mod, $all, $size, $exists, $type, $elemMatch, $not and $where.
Index
You can also use Indexes. db.colors.ensureIndex({name: 1}) for single ascending or -1 for single descending order or {unique:true} for unique and {background:true} for non-blocking index in background
You can also mix with db.colors.ensureIndex({name: 1}, created_at: -1})
Aggregation:
You can use db.colors.distinct('name')
You can also use items group
db.items.group({key:{template:true}, initial:{count:0}, reduce:function(obj, prev) { prev.count += 1}});
Data Types:
Array, Binary, Boolean, DateTime, DB reference, Embedded object, Integer, Null, ObjectId, RegExp, String, Symbol, Timestamp
Relationships are used similarly normalized, embedded ( Embedding is pre-joining) etc. Embed when documents appears with parent. 4MB document size limit.
And there's support via e-mail, IRC, Web, book, conference and training.
reference : slides from jnunemaker

PHP:
web programming language, runs on the web server, It's free, cross-platform interpreted language and is targeted towards web programming. The PHP interpreter runs the command between php tags, string syntax called a here document, displaying information about a database. PHP form runs the command between <?php? > tags and processes form data via $_POST.
PHP worsks on pieces of text called strings. Strings can be of three types - enclosed in single quotes, enclosed in double quotes and a here document. Strings can even contain binary files such as image or sound. Strings can support variable interpolation i.e. the variables in double quotes will be substituted with their value. Concatenation is called implode and splitting is called explode, randomly shuffle with str_shuffle(). Perl like variables beginning with $.
Pass by reference and value are both allowed.

Wednesday, February 13, 2013

Interview Preparation

Insertion of a binary tree involves the following steps:

If the root is null, make this the root.

If the root is a single node with no children and it’s value is greater than the input, insert the input as the left child of the root. Otherwise insert as the right child.

If the root has any siblings, traverse right if the root is smaller than the input, otherwise traverse left. In each case, append the tree with the input as a leaf. The leaf attaches to the sentinel node just as in the case of the root above.

Deletion from a binary tree involves the following steps:

Deletion is tricky because the delete of the node requires several of the following nodes to be written and this is because splicing the node may be required. The case of deleting the nodes is trivial because the left child leaf or the right child leaf can be deleted without affecting the rest of the tree.
There are three cases for the node to be deleted. These cases depends on how many siblings the node has for it to be deleted.
1) the first case is when there are no children , we just remove it.
2) the second case is when there is only one child, we splice out the node.
3) the third case is when there are two children, we splice out the successor which has at most one child and replace z with y.

So the steps are as follows:

Walk down the tree to find the node to be deleted. Traverse right if the node to be deleted has a value greater than the root. Arrive at the node to be deleted or null if the node is not found .

If the node to be deleted is the root and the root has two siblings, the tree successor has to be found, spliced and made the root. Otherwise if there is only one sibling, then that sibling is chosen to replace the node. We can call the node to be deleted as z and the one to be spliced as y. y = z when there is only one sibling.

If y has a left sibling, then we will need to update its parent and hence we will call it x.
Otherwise we will choose x as the right sibling.
If x is not null, we update it's parent to point to y's parent.
If y's parent is null, then y is the root, hence x has to be made the root.
Otherwise. if y was the left sibling of its parent then that parent's left sibling should now point to x, otherwise we choose the right sibling.

if y != z, then we transfer y's data to z.

Red Black Tree

The Red black tree must satisfy the following properties:
1. Every node is either red or black.
2. The root is black
3. Every leaf is black
4. If a node is red, then its children are black
5. For each node, all paths to the node contain the same number of black nodes.

Tree Insertions or deletions require rotations.

1. Tree Insertion is very much like the tree insert procedure but involves recoloring and rotation at the end to satisfy the red-black properties. There are three cases to be considered depending on whether the uncle (y) of the node just inserted (z) is red or black and z is a right or left child.
To enumerate these cases:
Case 1:  z's uncle y is red : Action - recolor the parent and the uncle, z moved up
Case 2: z's uncle y is black and z is a right child : Action - perform a left rotation
Case 3: z's uncle y is black and z is a left child : Action - perfrom a right rotation
Color of root and leafs to be checked at the end of the procedure :

2. Tree deletions is very much like the tree delete procedure also requires recoloring and rotations. There are four cases to be considered and these are the dependent on whether the node to be fixed (x) has a red or black sibling and whether that node's sibling (w) has red or black children.
To enumerate these cases:
Case 1: x's sibling w is red : Action - exchange the color of the parent and w and perform a left rotation
Case 2: x's sibling w is black, and both of w's children are black : Action - color the sibling to red and move x up the tree
Case 3: x's sibling w is black, w's left child is red, and w's right child is black : Action - exchange the color of w and its left child and perform a right rotation.
Case 4: x's sibling w is black and w's right child is red : Action - remove extra black and perform left rotation.

Graph : This can have nodes similar to a n-ary tree. Graph can also be represented by edges and vertices.
Breadth First Traveral:
Use a queue and insert a node. When the node is removed, it siblings are inserted. Each pop operation results in a step of the breadth first traversal. This continues until the queue is empty.

Depth first traversal is easy to write with a recursive function. Check the parameter, define the terminal condition, recurse down the adjacent nodes, one by one.

Djikstra's algorithm is useful to find a single-source shortest path on a weighted, directed graph. You pick the minimum vertex among the adjacent and then relax the edges for each vertex adjacent to the pick.

B-tree is also popular. Each node has the number of keys currently stored, the node n[x] keys themselves, stored in non-decreasing order and a boolean value to say if it is a leaf or an internal node.

Techniques to use various data structures

Divide and conquer. This performs faster by collapsing linear searches to log n search. Then combining the solutions to sub-problems. Its great for sub problems that are independent

Dynamic Programming: This also combines the solutions to sub-problems except that the sub-problems overlap.

Greedy Algorithms : This makes the choice that looks best at the moment. It will always make the locally optimal choice.
There's top down as well as bottom up approach where the problems are decomposed top down and solutions are built bottom up.

Some examples:

1. Longest Common Subsequence :
Step 1 : Characterize a LCS with the following criteria: say Z is an LCS of X and Y
1. if Xm == Yn, then Zk = Xm = Yn and Zk-1 is an LCS of X and Y
2. If Xm != Yn, then Zk != Xm => Z is an LCS of Xm-1 and Y
3. If Xm != Yn, then Zk != Yn => Z is an LCS of X and Yn-1

Step 2: Define the recursion
C[i, j] =    0                                         if i = 0, or j = 0
           =   c[i -1, j - 1] + 1                   if i, j > 0 and xi = yi
           =   max(c[i, j - 1], c[i -1, j ])    if i, j > 0 and xi != yj

Step 3: Computing the length of an LCS
LCS-Length(X, Y) : Use the step 2 and set up a recursion to compute the length.

2. String pattern matching :
Finite state Automata can be used for the following:
Define a finite automaton matcher with right going edges forming the spine of the automaton and numbering as many as the pattern specified. A directed edge shows some state transition. Pattern match length and offset are both captured
In C# language provisions for performing a regex, we have some good design and implementation.
Here it is for a quick recap.

Regex r = new Regex( pattern, options);
int[] gnums = r.GetGroupNumber();
Match m = r.Match(text);
while (m.Success)
{
for ( int i = 0; i < gnums.Length; i++)
{
Group g = m.Groups[gnums[i]];
CaptureCollection cc = g.Captures;
for ( int j = 0; j < cc.Count; j++)
{
     capture c = cc[j];
     Console.WriteLine( "\t\tCapture{0} = [{1}] Index = {2} and Length = {3}", j, c, c.Index, c.Lenght);
}
}
m = m.NextMatch();
}

Moving to the language based interview questions on C#, here is a quick recap of some questions asked in interviews:

Some questions :
1. [TCS] Desribe the garbage collector in .Net. It's compact generational markdown compactor.
2. [TCS] What is reflection ? How's it used ? Suggested answer : Reflection is a way to manipulate the types based on their metadata. Some uses of reflection are late binding, serialization , remoting, attributes etc.
3. [Starbucks] How do you wrap your MVC methods for error handling. Answer [HandleError] attribute
4. [Starbucks] How do you define your serialization attributes. Why is order useful?
5. [Starbucks] How do you use WCF with SOAP ? What additional items do you specify in the web.config ?
6. [T-Mobile] What attributes are used to define REST APIs in Controller ? HttpGet, HttpPost
7. [Infosys] With use of a single sign on software such as siteminder across various internal applications, a user session is lost between navigations. How do you diagnose and fix the problem ? This could be an issue with javascript, services, siteminder or network.
8. [Infosys] How do you prioritze, keep the team moving, meet the deadlines, and deliver with high quality
9. [Ontela] How would you design a file system ? What are the layers and interface.

There is a lot of considerations for the system programming questions.  For example, File System design has to consider organizing files and block level storage, locking, logging, file access security and remoting The layers involve File Organization, File Management. file allocation strategy, file buffering and disk manangement interface.
File allocation strategies have to consider file allocation in large blocks or small file system devices. File Allocation strategies could consider storage in a linked list of alarge blocks, consecutive storage in a single blocks, and storage in a bunch of smal blocks tracked by an index. If random access is important, the linked list might be a poor choice. If fast reads are important, ascertaining the physical location of data would be in constant time for consecutive or indexed storage. If internal and external fragmentation is to be considered, then allocation size will need to be adjusted. If free space needs to be tracked, linked list or bit vector could be used. Operating System like Linux make files systems to be extended or added as plugins with their implementation of Virtual File System.
Virtual File System is an indirection layer which handles all the file system calls and calls the necessary file system functions in the physical file system code to do the I/O. The physical file system may employ one or more buffer cache to interact with device drivers and in turn with device controllers.
File Systems can be mounted and this populates the file descriptors which contains several kinds of information. Information that are common to every file system type.

Talking about files, let's take a look at newer dbs now.

libraries for data mining

Some of the data mining packages available on googlecode are nifty utilities to mine 2D data. They come with various alogorithms and technique implementations. This can directly be used with different kinds of data. One such logic involves computing cluster centroids and points, then finding euler distance between the points and the centroids. Euler distance is the square root of the sum of squares of the x and y offsets from the centroid. In the case of finding keywords in a text, wordnet similarity distance between the selected candidates could be considered and a 2D array (points, clusters) of similarity populated each ranging between 0 and 1. We could iterate through all these points to compute sum of euler distance for a point from all of the clusters. Also, for the same point we could compute the inverse of the fraction of the distance to each cluster to the sum just computed. Then we could normalize the matrix to tag each data point to different clusters. Once we know the membership of each words to different clusters, we could populate index with words that maximize cluster membership.

Monday, February 11, 2013

Glossary

Glossary of terms in computing

Automation : This refers to the set of tasks that have been specified in a way that they can run without manual intervention. This comes in very useful in test setup, execution, tear down, results store, and reports making. Frequently such tasks are delegated to a test harness.

Balancing : This refers to sharing the load between one or more processors or units of control such that there is a distribution of work resulting in more concurrency and faster execution as well as fault tolerance.

Cloud Computing : This refers to the use of one or more virtual machines in a hosted environment where the provider of VM and its network, the blades and hardware are taking away the onus of maintaining the application on a proprietary system while assuring high availability and geographic reachability.

Degree of Parallelism: This refers to the number of concurrent unit of contol invoked in the execution of a task usually by a partitioning of data.

Execution This refers to the actions needed to complete a task. The actions are usually outlined ahead together with the context and data with which to operate on and in a tightly controlled environment.

Flow diagrams : This refers to either the control or the data flows in a system where the system interaction and vulnerabilities can be studied by seeing how the data changes and with what controls as it goes through the system. This is one way to white box a system in that its internal workings are made clear with examples of data and control. Such diagrams are useful for threat analysis.

GigaByte: This is a unit of storage equal to 10 power 9 bytes and represents a way to describe the size of say files or databases.

Hashing : This is a technique to make lookups faster by keeping track of where insertions were made for entries in a collection so that they can be retrieved with a fixed cost. Typically, a range of hash values is used and a hash is computed on the object given and then put into the category. The hashing is very effective when a good hash function is used otherwise chaining and linear search may result.

Instruction : This is a specific task specified to the processor and can be considered ‘bare metal’ or ‘native’ in that it is independent of any programming languages and very much tied to the hardware for which it is specified.

Jump : This is a very specific instruction given to the processor to move from the next instruction in a sequence to something outside the sequence.

Keys, public and private : This refers to cryptographic key pairs usually used for encryption such that only the sender and receiver can interpret the encrypted traffic. The private key is also called the secret and is used for decrypting while the public key can be shared.

Locking: This refers to the locking of a set of instructions or resources such that they provide mutual exclusion so as to prevent clobbering by interfering reads or writes. Interestingly, Lock free implies the guarantee of atomic reads and writes without conflicts by tightly controlling the sequence in which they appear but without the use of locks.

Monitoring: This is ensuring continuous availability of say a web service by frequently checking for some heartbeat events or directly calling the service periodically.

Nightly run: This refers to running large analytical workloads such as OLAP on system when there is less chance for conflicts with update statements.

Proxy: This is a software entity that sits between a sender and receiver and usually passes the traffic . Proxy can be either transparent or opaque and are commonly used with http and https.

Query: This is the language of retrieving data from a data store and can be both data definition as well as data manipulation in nature. They can be as granular as a batch or statement in T-SQL

Resource: A resource is known by its cost and is helpful with management. It can be used to refer to memory or cpu required fo r execution and can loosely be used for humans assigned to different tasks.

Scale out/up: A scale out is used to refer to additional servers added horizontally to say a cluster where as a scale up tries to increase the capacity of a single server in terms of memory or processors

Ticket: A ticket is a snapshot that captures enough information to identify an incident and for tracking. It can be associated with a user on whose behalf it is created and tracked as it is worked upon and closed.

User: A user to a computer system knows only the functionality and not the mechanics of the system. In other words, the user should not be expected to figure out whether a system failed to complete a specified action or what the remedial actions are. The user only sees the system as a tool for solving a processing.

Violation, access: Access violation is a bad address specified to the processor and most commonly is a null reference.

Zero: A zero is a state of a bit in which there is no signal and is a convenient test for the Boolean state false. It provides a comparison against a signal.

Friday, February 8, 2013

Abstract Data Types

Abstract Data Types (ADTs) can be used directly in the relational model. They don't change the algebra, just the expressions over attribute values. However deeper support for non-relational structured types is provided via XML with languages like XPath and XQuery. The following three approaches enumerate some of the handling of structured types like XML. The first is to build a custom database system that operates on data with structured types but this perhaps more laborious and intensive than supporting XML in relational model. The second approach is to treat the complex type as an ADT where say a column type of XML is used to store one XML document per row. All executions are handled outside the query optimizer. A third approach is for the DBMS to normalize the nested structure into a set of relations upon insertion and this is often called shredding. With shredding, the sub-objects are mapped with foreign keys and ordering information between elements of the same level is removed which improves performance.

Thursday, February 7, 2013

a new way to tag words

When you want to select keywords in a text, there are several techniques such as linguistic search etc available as mentioned earlier. Usually words are tagged as various parts of speech (POS) available from say WordNet. I propose a slightly different tag that might make document indexing easier. This is a very simple metric that is based on how frequently these words are used across the internet. With this metric, I'm not suggesting a better way to find hapaxes. Instead I'm proposing to differentiate between frequently occuring not so common words in a text block.

The metric could be a simple Poisson Distribution where the formula is

and K is the number of occurrences of the word, k! is the factorial of k and

is a positive real number