Cluster computing

Monday, October 26, 2015

 A critique of the paper “Rule Profiling for Query Optimizers and their implications”  

This paper introduces a technique for evaluating the transformation rules that drive query optimization in a relational database server such as Microsoft SQL Server.  It studies the usage of transformation rules by the query optimizer with the help of profiling tools. Profiling in this case is explained as finding the set of rules that are needed for a particular query to obtain an optimal plan.  This is referred to as an essential set of rules for a particular query.  There may be more than one essential set for a query if there are different ways to generate an optimal plan for the query. Consequently there may be some overlap of rules within the query. A rule that is included in every essential set for a query is deemed a relevant transformation rule for that query. The paper empirically studies the metrics for a variety of workloads. Plan and cost for an execution plan depends on what is chosen by the optimizer and are considered functions of the input query. 

While this paper succeeds in articulating metrics that yield fewer transformation rules for the purpose of studying different workloads by the same query optimizer, the paper could have considered an alternate metric such as precision and recall.  Precision in this case is the ratio that explains number of selected items that are relevant. It is the ration of the true positives to all that were selected by the query optimizer for this query. A true positive is one that improves the query execution plan A false positive doesn’t but shows up in the query plan.  Recall on the other hand is a metric that determines how many relevant items are selected. It is the ratio of the true positives to all those that would have improved the query execution plan from the global set of transformation rules including such that the optimizer did not select. Together precision and recall yields a different metric called F-score which gives the effectiveness of retrieval with respect to a given query. 

The advantages of the metrics from the paper are not lost with the advantages of the proposed metric. On the other hand, it comes closer to the evaluation the paper sought out earlier. The paper suggests that there are less than twenty transformation rules that fall under its “relevant” definition The inclusion of such rule in every execution set merely strengthens the criteria that it is “relevant” This equates to identifying the common rules in the “true positives” of every essential set but their inclusion alone is not a guarantee that they are indeed improving the plan by bringing the cost down At the same time the demarcation of true versus false positives is much more cleaner in what the optimizer has selected and is indeed-relevant to the optimization of the plan. The essential set and the combination of true and false positives are similar in nature although it can be argued that without any comparision across optimization plans for the same query, the true and false positives give a better indicator within the same chosen set for an execution plan. Furthermore, while it can be argued that most if not all the transformation rules may be true positives for a highly effective query optimizer, the differentiation with another choice of query plan cannot really be as well quantified by the “relevant” definition of the paper alone as with the precision and recall. Even the differentiation in terms of the false positives is brought to better light with the use of precision and recall. 

 #coding exercise

Check whether a given linked list represents a palindrome.

The challenge here is sequential acess of elements for comparison rather than the indexed base access.

Solution 1

Determine length of linked list in first pass.

Traverse exactly half and then check the latter half to be the reverse of the preceding half.

Linear time complexity.

Sunday, October 25, 2015

Mobile databases

Handphone and mobile devices are becoming ubiquitous and more and more powerful. Even their software can support a large ecosystem of applications. That is where the applications vie for the limited resources on the mobile devices. As a text book says, some of the software problems which involve data management, transaction management and database recovery have their origin in distributed data systems and in mobile computing these problems become more difficult to solve –mainly because of the nature of bandwidth for the wireless communication channels, relatively short active life of the power supply and the changing locations of the required information (sometimes in cache, sometimes in the air, sometimes at the server)

The mobile computing architecture is also slightly different. It is one where mobile units and base stations communicate through wireless channels having bandwidths significantly lower than those of a wired network. The mobile units operate within geographic mobility domain that is divided into smaller domains called cells. The mobile discipline requires that the movement of mobile units be unrestricted within the geographic mobility domain, while having information access contiguity during movement. The average duration of a user’s stay in the cell, referred to as the residence latency, is a parameter that is computed and continuously adjusted.

In this world, the mobile units suffer from the drawbacks that the database connectivity may be intermittent and may have limited power source. Clients slip into doze mode to conserve energy. They can be woken up when the server wants to communicate or the user keeps it active.

Moreover the type of data and data access is different for mobile applications. Data may be classified into 1) private data – one that is restricted to the user, 2) public data – one that can be accessed by anyone who can read it and 3) shared data – data that is accessed both in read and write modes by groups of users. Mobile databases can be distributed in one of two ways: The entire database may be distributed mainly among the wired components, possibly with full or partial replication. The database is distributed among wired and wireless components. If the database is local to the computer, the database may need synchronized with the remote. This environment, referred to as the intermittently synchronized database environment have the following characteristics: The communication between client and server is unicast or multicast. A server cannot connect to a client at will. Flakiness of the connection is immaterial. A client is free to manage its own data and transactions. A client has multiple ways of connecting to the server.

While today’s application and even mobile database providers are increasingly relying on wifi hot spots where internet ubiquity and connection speed and bandwidth have given rise to near wired connection environment, the classical problem of mobile databases has potential for untapped market of non-wifi coverages.

Snapshots can provide the last saved version enabling both recovery and file synchronization services. Since snapshots can be taken at any granularity, the low disk and memory footprint for mobile device usage can be maintained. This kind of service becomes external to the database.

#codingexercise

If you are given a number find if the number is colorful. A number is colorful when all the combinations of its digits result in a different product when multiplied.
Solution first we combine, then we find the duplicates in the products resulting from the combinations.
void Combine(ref string digits, ref stringbuilder b, int start, int level, ref List<int>Products)
{
for (int I =start; I < digits.length; I++)
{

if (b[level] != digits[I]){
     b[level] = digits[i];
     Products.add(calculateProduct(b.toString()));
}

if (I < digits.length)
    Combine(ref digits, ref b, start+1, level+1);

if (b[level] == digits[I]){
     b[level] = '/0';
}
}
}

void isColorful(List<int>Products)
{
return (Products.Count == Products.distinct().count());
}

Observation the given sequence should not have the digit 1.

Observation the given sequence should not have duplicates.

Saturday, October 24, 2015

#coding interview question

Given an unsorted array in which every adjacent element is differ by either +1 or -1

Write an efficient code to find x where x is also given.

Solution 1:

Binary Chop the elements till we encounter either x-1, x, or x+1 then search nearby for x.

Solution 2:

(approach when many x are involved)

Perform an insertion sort followed by a binary search.

Insertion sort is usually O(n^2) but in this case the positions are off by +-1 so we have to look at cells that either contain x-1, x, x+1 and how many ever duplicates there may be.

void InsertionSort(int A[], int length, int x){

for (int i =0; i < length; i++){

int j = i;

while (j > 0 && A[j-1] > A[j]){

// Swap(A[j],A[j-1]);
int temp = A[j];
A[j] = A[j-1];
A[j-1] = temp;

j = j - 1;

}

}

int Search(int A[]. int length, int x){

InsertionnSort(A, length, x);
return exists(A, length, x, 0, length-1);
}

Friday, October 23, 2015

Q: In the plane we consider rectangles whose sides are parallel to the coordinate axes and

have positive length. Such a rectangle will be called a box . Two boxes intersect if they have a

common point in their interior or on their boundary.

Find the largest n for which there exist n boxes B1, . . . ,Bn such that Bi and Bj intersect if

and only if i strictly not equal to j ± 1 (mod n).

A: The condition given by the problem to find the largest n is that Bi and Bj are not adjacent rectangles. Rectangles are called adjacent if their index i in Bi is such that their indexes differ by at most 1. Moreover, the condition is stated such that non-adjacent rectangles must overlap and have a point common.

In other words the rectangles (B1, B2) , (B2,B3) ... (Bn-1, Bn) (Bn,B1) are adjacent and cannot overlap but any other pairing of Bi can overlap.

Let us take a look at what overlap means. When two rectangles overlap as per the problem description, they have a common point along their projections on the X and the Y-axis
If we take the set of such line segments bounded by the corners of two rectangles along the X-axis and Y-axis as Ik and Jk respectively where k can vary from 1 to n.
Adjacent rectangles don't intersect. So (Ik, Ik+1) OR (Jk, Jk+1) are disjoint intervals. Then, the possible line segments for adjacent rectangles that are denoted by (I1,I2). . (In-1,In),(In,I1) and (J1, J2), (Jn-1,Jn), ... (Jn,J1) respectively have n pairs of disjoint intervals.
And since non-adjacent rectangles necessarily intersect, their projections on both axes must intersect. How many ever pairs of rectangles we can find that are disjoint on one axes that many can exist on the y-axis as well. Therefore the total number of such pairs will be double the number of disjoint pairs we can find and that will be the largest n to satisfy the condition in the problem.

By writing the line segments as length L1 L2 ..Ln for segment ends (a1,b1) (a2,b2) .. (an,bn), we can choose two points alpha and beta such that alpha is the rightmost among left endpoints and beta is the leftmost among right endpoints.

If alpha is less then beta then all ai is less the alpha is less than beta is less than bi. If every length contains alpha, then there cannot be any disjoint set.
If alpha is greater than beta, then assume Beta = bi for some i and alpha = a2 , then we have ai is less than bi or beta which is less than alpha or a2 which is less than b2. This means L2 and Li are disjoint. Now L2 intersects all remaining intervals between L1 and L3 so L2 and Li can be disjoint only if i = 1 or i = 3. Also alpha belongs to the intersection of each of the remaining intervals which means it is not empty Similarly L5 to Ln,L1 intersect with L3 and consequently they have a non-empty overlap that includes Beta. This leaves only (L1,L2) (L2,L3) and (L1,L3) as the only candidates for the disjoint intervals. And therefore the answer is n = 6.

It was an exercise left out that alpha was claimed to be in the intersection of each of the remaining intervals and that alpha can be assumed to be a2. This is because we choose the leftmost point for alpha.

Write a C program that, given an array A[] of n numbers and another number x, determines whether

or not there exist two elements in S whose sum is exactly x.

int HasTwoNumbersWithSum(int A[], int length, int n)

{

Qsort(&A, length);

For (int I = 0; I < length; i++){

Assert (a[i] > 0);

If (A[i] < n && exists(A, length, n- A[i], 0, length – 1 )){

Return 1;

}

Return 0;

}

Int exists(int A[], int length, int val, int start, int stop)

{

While (Start <= stop)

{

Mid = (start + stop )/ 2;

If( A[mid] == val )return 1;

If (start == stop) return 0;

If ( A[mid] > val)

Return exists(A, length, start, mid-1);

Else

Return exists(A, length, mid+1, stop);

}

Return 0;

}

Thursday, October 22, 2015

For any integer n >= 2, we compute the integer h(n) by applying the following procedure to its

decimal representation. Let r be the rightmost digit of n.

(1) If r = 0, then the decimal representation of h(n) results from the decimal representation

of n by removing this rightmost digit 0.

(2) If 1 <= r <= 9 we split the decimal representation of n into a maximal right part R that

solely consists of digits not less than r and into a left part L that either is empty or ends

with a digit strictly smaller than r. Then the decimal representation of h(n) consists of the

decimal representation of L, followed by two copies of the decimal representation of R-1.

For instance, for the number n = 17,151,345,543, we will have L = 17,151, R = 345,543

and h(n) = 17,151,345,542,345,542.

Prove that, starting with an arbitrary integer n >= 2, iterated application of h produces the

integer 1 after finitely many steps.

We identify integers n > = 2 with the digit-strings of their decimal representations and extend the definition of h to all non-empty strings with digits from 0 to 9. We recursively define ten functions f0 to f9 that map some strings into integers for k = 9,8, ...1,0. The function f9 is only defined on string x that entirely consists of nines. If x consists of m nines , then f9(x) = m + 1, m = 0, 1, ...
For k<=8 the domain of fk(x) is the set of all strings containing only of digits that are >= k. We write x in the form of x0 kx1 kx2 k ... xm-1 kxm where the strings xs only consists of digits >= k+ 1. String can be empty or m = 0 which means that digits need not appear. We define
fk(x) = Sum s = 0 to m of ( 4 ^ fk+1(xs))
We now use the following facts:
Fact 1 If x does not contain digits smaller than k, then fi(x) = 4 fi+1(xs) for all i = 0, .. k - 1.
In particular, fi(empty string) = 4 ^ (9-i) for all i = 0,1, ...9
We establish the same for k = 9, 8, ... 0 by induction
Fact 2 If the non-empty strings does not contain digits smaller than k, then f1(x) > f1(empty string) for all i = 0, ..., k
Now the fact 3 that will reduce the given problem to one of mere transformations:
Fact 3: f0(n) > f0(h(n))
Then the empty string will be reached after a finite number of applications of h. But starting from a string without leading zeros, empty string can only be reached via the strings 1->00->0->empty string. By that transformation sequence, the number 1 will appear after a finite number of applications of h.
The intuition behind the solution is that if we reduce the given problem to transformations where we can take use the criteria in the question which is to use the numbers smaller than r.

Wednesday, October 21, 2015

Faking it Right by Ross McCammon 

Ross says the professional world’s a stage and we are all actors. According to him, the question is how much do the roles in which you cast yourself differ from who you actually are ?  If they are too different, it might cause problems. However, if they are not that much, they may help in letting your negative feelings pass. 

Ross does not mix words when he says that we lie.  There are plenty of examples when we lie. However there’s honor in those lies because you are trying. 

So let’s take a few examples of how to act: 

Fake confidence at a meeting:  do it at the very beginning when impressions are made and not later.  First smile. Second lock in to the “gaze”.  Third sit up straight. And fourth raise your eyebrows every now and then. 

Seem happy when the promotion goes to someone else: Dig deep because you have to go through a lot of resentment and envy to get through to the good stuff. Its more than just pretending says Sean Kavanagh CEO of the Ariel Group. You have to authentically take on that role. Take a few deep breaths into the diaphragm and not into the shoulders. Focus on the intention and the need of the audience. Empathize with the audience and remember the purpose of your soliloquy 

How to ‘can’ when you ‘can’t even’ : There are times when you can’t and there are times when you can’t even. If you can’t even and you have to seem like you can, here’s what to do : 

Focus on any object at eye level and raise the corners of your mouth. If your jaw had dropped try to hold it up so that your lips close. This might even seem like you can. 

How to smile when you do not feel like smiling: Note that unhappy smiling is worse than not smiling at all. Researchers studied a group of bus drivers found that those who engaged in surface acting ended up in a worse mood and those who faked happiness by smiling and thinking positive thoughts (deep acting) found themselves in better mood. 

If you want to seem happy, get happy, real happy. 

Lastly, the author provides a few key technical matters: 

Think of your reactions as “notches”. Determine which notch your feelings are. Call your feelings “it”. 

Take “it” down at least one notch. Five is too many. Pay attention to body language. Make eye contact. Sit up straight. When in doubt: Act.  Stifle anger. Stifle irritation. Do not stifle a sneeze. And stifle feeling flabbergasted after a sneeze. Gasted is fine. Flabber is way too much. 

#coding exercise

For a given starting number k and the number of iterations n, print the sequence. For example,

2

12

1112

3112

132112

Void PrintSequence (int k, int n)

{

var lastseq = new List <int>(){ k};

For ( int I = 0; I < n; I ++){

Var next = new List <int>();

Int last = int_min;

Int count =0;

For (int j = 0; j < lastseq. Length; j++){

If lastseq  [j] != last {

If last != int_min {

Next.add (count);

Next.add (lastseq [j-1]);

}

Count= 0;

}

 count ++;

Last = lastseq  [j];

}

Next.add (count);

Next.add (lastseq [lastseq.length-1]);

Lastseq = next;

))

}

}

Tuesday, October 20, 2015

We continue the discussion on the question below:

Let n and k be fixed positive integers of the same parity, k n. We are given 2n lamps
numbered 1 through 2n; each of them can be on or off. At the beginning all lamps are off. We
consider sequences of k steps. At each step one of the lamps is switched (from off to on or from
on to off).
Let N be the number of k-step sequences ending in the state: lamps 1, . . . , n on, lamps
n+1, . . . , 2n off.
Let M be the number of k-step sequences leading to the same state and not touching lamps
n+1, . . . , 2n at all.
Find the ratio N/M.

We showed that every restricted admissible process can be modified in 2 ^ (k-n) ways which results in as many distinct admissible processes.
What remains to be shown is that every admissible process can be achieved in that way. And surely enough we can replace the switch of lamp with label l > n that occurs in q by the switch of a lamp with label l - n and consecutively the resulting process p does not have lamps n+1, .. 2n involved.

Moreover the switches for those lamps were even in number and when performing the same on lamps with label < n, the resulting process is admissible and restricted.

Therefore admissible <---> restricted processes are one to one reversible. This completes the proof.

Note that the state of each lamp and it's operations are independent of other lamps, so the combination of ways to operate them is a simple multiplication. And since the state of the lamps are binary, it becomes intuitive that the ratio of the admissible to restricted process has to be a power of two.

Finding Interest in the mundane 

Emails can be boring especially when we are inundated by the tens if not hundreds every day. We are trained to filter what’s relevant and critical to us. We can even recall threads and conversations that may matter to a specific topic of discussion. We spend some time scanning the incoming mails, some more time filtering and most time responding or preparing a response to what matters. Consequently the tools of our trade are the search options on the Email software we use which is Outlook. A cursory look at this search options shows that the query we write are from what we remember such as the subject containing a buzz word or sender or recipient being someone specific. However, there are a few problems with this approach.  

First is we speedboat from the get go. We want to hammer home the wins from each incoming and outgoing responses. There is hardly any technology other than search that we use in our aim to get where we want too and if we have spent a lot of time digging around e-mails, we either request a resend or move on. There may be some tagging and organization involved but these are mostly from what we expect the incoming mails to conform to. Besides with the explosion of containers, the search becomes narrowed leading to higher misses or incorrect assumptions. Secondly, we don’t smell the roses. In our daily routine, there is hardly any learning from the conversations that flow to our inbox unless we have scanned and read each one of them. Many of these conversations may not even be directed at the reader per se and even lesser so at the tangent topics that may boomerang back to us. For example, there is no text analytics or content intelligence that we build from this vast amount of semi-structured data. 

Thirdly, we don’t have appealing visualization of the information that we can have drones retrieve us from the daily grind. Even if we do realize a highly efficient text analytics platform over the emails and manage to build a user interface that can improve our productivity, we may limit ourselves to ultimately showing them as the same search results as we used to see earlier leaving the burden on the user to refine the search results or scan them for items of interest. Yes precision and recall can be improved and even change from being header or metadata based search to full-text but statistical and semantic analysis would not display their immense potential if they presented their results in the same way. What can we then use to render such information ? Take the following visualization as an example: 

 Courtesy: http://vallandingham.me/msd/vis/ 

as an example of what we can organize the results on Outlook. This sort of imagery and interactivity is far more helpful to the user than static endless results. Besides this is complimentary and not a substitute to the traditional search. 

Fourth, even the search tools we use return data not information. Most of the results are the individual emails that meet the query or search criteria whereas the results we are looking for is in terms of concepts or tags that are abstract and may not even be literal in the email. If we look at the following menu from our Outlook software: 

We can see that the query builder window is not the same as the kind of clustering, categorizing and machine learning techniques options we are talking about. There is no way to even build models for search here.  

Finally, we are not talking business intelligence.  There is a lot of overlap between building logic for search and reports to display the results between a business intelligence software add ons and the content intelligence and text analytics platform but the key difference here is that BI is targeted for some kind of specific domain or information whereas we are talking about increasing the panorama of information retrieval tools to glean what passes us by on a day to day basis. Time series data and Big Data techniques have made some of the approaches commercially viable on large scale but to scope it down to personal level and to this repository is altogether different.  

Thus we can explore more from what we get each day.