Cluster computing: October 2015

Friday, October 30, 2015

A diagonal of a regular 2006-gon is called odd if its endpoints divide the boundary into
two parts, each composed of an odd number of sides. Sides are also regarded as odd diagonals.
Suppose the 2006-gon has been dissected into triangles by 2003 non-intersecting diagonals.
Find the maximum possible number of isosceles triangles with two odd sides.

An isosceles triangle is called odd if it has two odd sides. A triangle in the dissection above which is odd and isosceles is called iso-odd.
We make an observation as follows:
Let AB be one of the dissecting diagonals and let Lambda be one of the shorter part of the boundary of the 2006-gon with endpoints A, B. Suppose that lambda consists of n segments. Then the number of iso-odd triangles with vertices on Lambda does not exceed n/2.
We can prove this by induction . For the initial case of n = 2, there can only be one such triangle. So this holds. Let us assume this observation holds for all lambda of length < n. Then we consider the case when Lambda == n. Let PQ be the longest diagonal which is a side of an iso-odd triangle PQS with all vertices on Lamda. If there is no such triangle, then we are not increasing the number of iso-odd triangles and hence the observation holds. If on the other hand, there is such a triangle, then every triangle whose vertices lie on lambda is right angled or obtuse with the maximum possible when it form the triangle PQS with the summit at S. Then we may assume that the five points APSQB lie on lambda and partition lambda into four pieces of AP, PS, SQ and QB each where possibly the outer two segments reduce to a point. Since PQ is the longest diagonal and PQS is an iso-odd triangles, it cannot have vertices on both AP and QB the outer segments. Therefore every iso -odd triangle lying within lambda has its vertices on just one of four pieces.Each of these pieces we can apply the same induction and we can add the four inequalities,we get the required number to not exceed n/2.

The intuition behind this approach is that there are only three vertices we need to pick to for such triangles and given that they can lie arbitrarily along the segments we categorize them into forms that the triangle can take and where those vertices may fall. Once we have a lemma for the number of such triangles, we can easily study it in each of the categories

Thursday, October 29, 2015

Today we review the book The 5 Choices by Kory Kogon, Adam Merrill, and Leena Rinne.
This book is about the path to extraordinary productivity when the workplace is getting a whole lot overwhelming and exhausting. Texts, Emails, interruptions meetings etc inundate us and distract us from thinking clearly to make good decisions and accomplish what matters. Therefore this book shows how to manage time and energy through five fundamental choices.
The book promises to increase our productivity at the individual team and organization levels and to feel engaged, accomplished and satisfied.
The book shares new ideas on how to be productive, how to get clear and focus on the things that matter to us, how to increase our capability to make decisions and how to recognize that you have the ability to do extraordinary work. The Five choices are enumerated as follows:
Choice 1 is to act on the important and not to react to the urgent. The authors say that we either react or we think. The reacting is the lower level where we process our feelings and emotions for a response but don't necessarily think it out. Thinking on the other hand is the upper level where we make conscious and intentional decisions. The longer we use the upper plane the better the outcome. We have some tools to do this. FranklinCovey's Time Matrix provides the framework and Pause-Clarify-Decide provides the process. The time matrix is a four quadrant visualizing tool that helps us categorize activities. The top left quadrant is the quadrant of necessity and contains things that are both urgent and important. The lower left quadrant is the quadrant of distraction. This is the one that gives a negative return on your attention and energy. The lower right quadrant is the quadrant of waste. We should try not to be here at all. The upper right quadrant is the one for extraordinary productivity. To spend more time in this quadrant, we should pause to think what item belongs in which quadrant and once its clear, we decide accordingly.
Choice 2 is to go for the extraordinary and not to settle for the ordinary. For each item that we place in the quadrant 2, we can attach a statement or story as we do on our scrum meetings. We anchor it to a purpose and passion. It helps us to tap into our motivation and higher performance.
Choice 3 is to schedule the big rocks and not to sort gravel. Quantifying our work items is akin to attaching story points to our stories. Consequently, we don't spend time on the little things that clutter up and prevent us from executing on the big items. A master task list is probably one of the best visualizing tools for this Since the list is the master and it contains items that we want to execute, we have all the items in our consideration and ranked appropriately.
Choice 4 is to rule the technology and not let it rule us. Very often we spend more time learning to use a tool or to work with a tool with the assumption that the tool will save us, but as a Japanese swordsman once said being independent of your tools and being present is the first step. You can manage the information you need to process by organizing them into appointments, tasks, contacts and notes/documents.
Choice 4 also includes a set of master moves that make our execution efficient. The first master move is to win without fighting Here we automate much of our decision making process. The second master move is to eliminate any additional quadrant 3 and 4 items from our inbox. The third master move is to enable more preparedness by including a link to locate so we spend hardly any time in Q1.
Choice 5 is to fuel our fire and not to burn it out. The five drivers of mental and physical energy are Moving,eating, sleeping, relaxing and connecting. Each of them is powerful in making us more productive.

Wednesday, October 28, 2015

Q: An equilateral triangle, partitioned into n2 congruent equilateral triangles by n-1 equidistant parallels to each of its three sides. Two chess-like bishops placed at any two vertices of the small triangles are said to menace one another if they lie on a same parallel. The problem is to determine the largest number of bishops that can be placed so that none menaces another.

A: A bishop may be assigned three coordinates a; b; c, namely the numbers of sides of small triangles they are off each of the sides of the big triangle.

In other words the condition above is that if i != j; then ai != aj ; bi != bj and ci != cj .
If any pair of these axes coordinates were same, then the bishops would be on the same triangle. If they are on the same triangle, they share the same parallel that made the triangle and hence would menace each other.

In fact the only solution that divides the number of congruent triangles is when the smaller triangle formed by the bishops is itself an equilateral triangle.
The first condition gives us the inequality that pairwise parallel-coordinates are distinct so we have their sum as N(N-1)/2
summing all the three co-ordiantes we have 3N(N-1)/2 <= sum of all( ai+bi+ci)
Since all the triangles are congruent,N = 4
therefore a + b + c = n

By solving this problem, we have the obtained the first condition as in the problem earlier. And again showed that the limiting case of equality is also included as a solution.

#codingexercise
Determine the type of a triangle given its edge lengths
Void printTriangle( int a, int b, int c)
{
If (a == b && b==c) printf ("equilateral");
If (a ==b || b == c || c == a) printf ("isosceles");
printf ("scalene");
}

Tuesday, October 27, 2015

#coding question.

Matrix has 1's followed by 0's. Find the row with the maximum 1

int GetRowWithMost1(int[,] matrix, int row, int column)
{
int maxCount = 0;
int ret = 0;
for (int I =0; I < row;i++)
{
int count = 0;
for (int j = 0; j < col; j++)
{
   if (matrix[I,j] == 1) count++;
}
if (count > maxCount){
    maxCount = count;
    ret = I;
}
}
return ret;
}

Q: For any integer n >= 2, let N(n) be the maximal number of triples (ai, bi, ci), i = 1, ...., N(n), consisting of nonnegative integers ai; bi and ci such that the following two conditions are satisfied:

(1) ai + bi + ci = n for all i = 1, ...., N(n);

(2) If i != j; then ai != aj ; bi != bj and ci != cj .

Determine N(n) for all n >= 2.

A: The second condition above suggests that a-coordinates in all the triples are pairwise distinct.
If we sum the a-coordinates we get a value that has to be greater than or equal to the sum of the suffixes i-1 from 1 to N. This is because ai has to be less than n and and i can only range from 1 to N(n)
Therefore the sum of a-coordinates has to be greater than or equal to the sum of consecutive numbers which is equal to N(N-1)/2
This same inequality holds for all the coordinates. Consequently, we have 3N(N-1)/2 <= Sum of all coordinates from 1 to N = nN
Therefore 3(N-1)/2 <= n
and N has to be less than or equal to the floor of (2n/3) + 1
If we take different examples we can see that N can be equal to the floor of (2n/3) + 1 and this is the value we are required to determine.

There are three different cases for n that we can take. These are
n=3k-1
n=3k
n=3k+1
and for each we can draw the examples up to floor(2n/3) + 1 and we will see that the limiting case of floor(2n/3) + 1 also holds the above two conditions as given. Hence the determination of N(n) is correct.

Monday, October 26, 2015

 A critique of the paper “Rule Profiling for Query Optimizers and their implications”  

This paper introduces a technique for evaluating the transformation rules that drive query optimization in a relational database server such as Microsoft SQL Server.  It studies the usage of transformation rules by the query optimizer with the help of profiling tools. Profiling in this case is explained as finding the set of rules that are needed for a particular query to obtain an optimal plan.  This is referred to as an essential set of rules for a particular query.  There may be more than one essential set for a query if there are different ways to generate an optimal plan for the query. Consequently there may be some overlap of rules within the query. A rule that is included in every essential set for a query is deemed a relevant transformation rule for that query. The paper empirically studies the metrics for a variety of workloads. Plan and cost for an execution plan depends on what is chosen by the optimizer and are considered functions of the input query. 

While this paper succeeds in articulating metrics that yield fewer transformation rules for the purpose of studying different workloads by the same query optimizer, the paper could have considered an alternate metric such as precision and recall.  Precision in this case is the ratio that explains number of selected items that are relevant. It is the ration of the true positives to all that were selected by the query optimizer for this query. A true positive is one that improves the query execution plan A false positive doesn’t but shows up in the query plan.  Recall on the other hand is a metric that determines how many relevant items are selected. It is the ratio of the true positives to all those that would have improved the query execution plan from the global set of transformation rules including such that the optimizer did not select. Together precision and recall yields a different metric called F-score which gives the effectiveness of retrieval with respect to a given query. 

The advantages of the metrics from the paper are not lost with the advantages of the proposed metric. On the other hand, it comes closer to the evaluation the paper sought out earlier. The paper suggests that there are less than twenty transformation rules that fall under its “relevant” definition The inclusion of such rule in every execution set merely strengthens the criteria that it is “relevant” This equates to identifying the common rules in the “true positives” of every essential set but their inclusion alone is not a guarantee that they are indeed improving the plan by bringing the cost down At the same time the demarcation of true versus false positives is much more cleaner in what the optimizer has selected and is indeed-relevant to the optimization of the plan. The essential set and the combination of true and false positives are similar in nature although it can be argued that without any comparision across optimization plans for the same query, the true and false positives give a better indicator within the same chosen set for an execution plan. Furthermore, while it can be argued that most if not all the transformation rules may be true positives for a highly effective query optimizer, the differentiation with another choice of query plan cannot really be as well quantified by the “relevant” definition of the paper alone as with the precision and recall. Even the differentiation in terms of the false positives is brought to better light with the use of precision and recall. 

 #coding exercise

Check whether a given linked list represents a palindrome.

The challenge here is sequential acess of elements for comparison rather than the indexed base access.

Solution 1

Determine length of linked list in first pass.

Traverse exactly half and then check the latter half to be the reverse of the preceding half.

Linear time complexity.

Sunday, October 25, 2015

Mobile databases

Handphone and mobile devices are becoming ubiquitous and more and more powerful. Even their software can support a large ecosystem of applications. That is where the applications vie for the limited resources on the mobile devices. As a text book says, some of the software problems which involve data management, transaction management and database recovery have their origin in distributed data systems and in mobile computing these problems become more difficult to solve –mainly because of the nature of bandwidth for the wireless communication channels, relatively short active life of the power supply and the changing locations of the required information (sometimes in cache, sometimes in the air, sometimes at the server)

The mobile computing architecture is also slightly different. It is one where mobile units and base stations communicate through wireless channels having bandwidths significantly lower than those of a wired network. The mobile units operate within geographic mobility domain that is divided into smaller domains called cells. The mobile discipline requires that the movement of mobile units be unrestricted within the geographic mobility domain, while having information access contiguity during movement. The average duration of a user’s stay in the cell, referred to as the residence latency, is a parameter that is computed and continuously adjusted.

In this world, the mobile units suffer from the drawbacks that the database connectivity may be intermittent and may have limited power source. Clients slip into doze mode to conserve energy. They can be woken up when the server wants to communicate or the user keeps it active.

Moreover the type of data and data access is different for mobile applications. Data may be classified into 1) private data – one that is restricted to the user, 2) public data – one that can be accessed by anyone who can read it and 3) shared data – data that is accessed both in read and write modes by groups of users. Mobile databases can be distributed in one of two ways: The entire database may be distributed mainly among the wired components, possibly with full or partial replication. The database is distributed among wired and wireless components. If the database is local to the computer, the database may need synchronized with the remote. This environment, referred to as the intermittently synchronized database environment have the following characteristics: The communication between client and server is unicast or multicast. A server cannot connect to a client at will. Flakiness of the connection is immaterial. A client is free to manage its own data and transactions. A client has multiple ways of connecting to the server.

While today’s application and even mobile database providers are increasingly relying on wifi hot spots where internet ubiquity and connection speed and bandwidth have given rise to near wired connection environment, the classical problem of mobile databases has potential for untapped market of non-wifi coverages.

Snapshots can provide the last saved version enabling both recovery and file synchronization services. Since snapshots can be taken at any granularity, the low disk and memory footprint for mobile device usage can be maintained. This kind of service becomes external to the database.

#codingexercise

If you are given a number find if the number is colorful. A number is colorful when all the combinations of its digits result in a different product when multiplied.
Solution first we combine, then we find the duplicates in the products resulting from the combinations.
void Combine(ref string digits, ref stringbuilder b, int start, int level, ref List<int>Products)
{
for (int I =start; I < digits.length; I++)
{

if (b[level] != digits[I]){
     b[level] = digits[i];
     Products.add(calculateProduct(b.toString()));
}

if (I < digits.length)
    Combine(ref digits, ref b, start+1, level+1);

if (b[level] == digits[I]){
     b[level] = '/0';
}
}
}

void isColorful(List<int>Products)
{
return (Products.Count == Products.distinct().count());
}

Observation the given sequence should not have the digit 1.

Observation the given sequence should not have duplicates.

Saturday, October 24, 2015

#coding interview question

Given an unsorted array in which every adjacent element is differ by either +1 or -1

Write an efficient code to find x where x is also given.

Solution 1:

Binary Chop the elements till we encounter either x-1, x, or x+1 then search nearby for x.

Solution 2:

(approach when many x are involved)

Perform an insertion sort followed by a binary search.

Insertion sort is usually O(n^2) but in this case the positions are off by +-1 so we have to look at cells that either contain x-1, x, x+1 and how many ever duplicates there may be.

void InsertionSort(int A[], int length, int x){

for (int i =0; i < length; i++){

int j = i;

while (j > 0 && A[j-1] > A[j]){

// Swap(A[j],A[j-1]);
int temp = A[j];
A[j] = A[j-1];
A[j-1] = temp;

j = j - 1;

}

}

int Search(int A[]. int length, int x){

InsertionnSort(A, length, x);
return exists(A, length, x, 0, length-1);
}

Friday, October 23, 2015

Q: In the plane we consider rectangles whose sides are parallel to the coordinate axes and

have positive length. Such a rectangle will be called a box . Two boxes intersect if they have a

common point in their interior or on their boundary.

Find the largest n for which there exist n boxes B1, . . . ,Bn such that Bi and Bj intersect if

and only if i strictly not equal to j ± 1 (mod n).

A: The condition given by the problem to find the largest n is that Bi and Bj are not adjacent rectangles. Rectangles are called adjacent if their index i in Bi is such that their indexes differ by at most 1. Moreover, the condition is stated such that non-adjacent rectangles must overlap and have a point common.

In other words the rectangles (B1, B2) , (B2,B3) ... (Bn-1, Bn) (Bn,B1) are adjacent and cannot overlap but any other pairing of Bi can overlap.

Let us take a look at what overlap means. When two rectangles overlap as per the problem description, they have a common point along their projections on the X and the Y-axis
If we take the set of such line segments bounded by the corners of two rectangles along the X-axis and Y-axis as Ik and Jk respectively where k can vary from 1 to n.
Adjacent rectangles don't intersect. So (Ik, Ik+1) OR (Jk, Jk+1) are disjoint intervals. Then, the possible line segments for adjacent rectangles that are denoted by (I1,I2). . (In-1,In),(In,I1) and (J1, J2), (Jn-1,Jn), ... (Jn,J1) respectively have n pairs of disjoint intervals.
And since non-adjacent rectangles necessarily intersect, their projections on both axes must intersect. How many ever pairs of rectangles we can find that are disjoint on one axes that many can exist on the y-axis as well. Therefore the total number of such pairs will be double the number of disjoint pairs we can find and that will be the largest n to satisfy the condition in the problem.

By writing the line segments as length L1 L2 ..Ln for segment ends (a1,b1) (a2,b2) .. (an,bn), we can choose two points alpha and beta such that alpha is the rightmost among left endpoints and beta is the leftmost among right endpoints.

If alpha is less then beta then all ai is less the alpha is less than beta is less than bi. If every length contains alpha, then there cannot be any disjoint set.
If alpha is greater than beta, then assume Beta = bi for some i and alpha = a2 , then we have ai is less than bi or beta which is less than alpha or a2 which is less than b2. This means L2 and Li are disjoint. Now L2 intersects all remaining intervals between L1 and L3 so L2 and Li can be disjoint only if i = 1 or i = 3. Also alpha belongs to the intersection of each of the remaining intervals which means it is not empty Similarly L5 to Ln,L1 intersect with L3 and consequently they have a non-empty overlap that includes Beta. This leaves only (L1,L2) (L2,L3) and (L1,L3) as the only candidates for the disjoint intervals. And therefore the answer is n = 6.

It was an exercise left out that alpha was claimed to be in the intersection of each of the remaining intervals and that alpha can be assumed to be a2. This is because we choose the leftmost point for alpha.

Write a C program that, given an array A[] of n numbers and another number x, determines whether

or not there exist two elements in S whose sum is exactly x.

int HasTwoNumbersWithSum(int A[], int length, int n)

{

Qsort(&A, length);

For (int I = 0; I < length; i++){

Assert (a[i] > 0);

If (A[i] < n && exists(A, length, n- A[i], 0, length – 1 )){

Return 1;

}

Return 0;

}

Int exists(int A[], int length, int val, int start, int stop)

{

While (Start <= stop)

{

Mid = (start + stop )/ 2;

If( A[mid] == val )return 1;

If (start == stop) return 0;

If ( A[mid] > val)

Return exists(A, length, start, mid-1);

Else

Return exists(A, length, mid+1, stop);

}

Return 0;

}

Thursday, October 22, 2015

For any integer n >= 2, we compute the integer h(n) by applying the following procedure to its

decimal representation. Let r be the rightmost digit of n.

(1) If r = 0, then the decimal representation of h(n) results from the decimal representation

of n by removing this rightmost digit 0.

(2) If 1 <= r <= 9 we split the decimal representation of n into a maximal right part R that

solely consists of digits not less than r and into a left part L that either is empty or ends

with a digit strictly smaller than r. Then the decimal representation of h(n) consists of the

decimal representation of L, followed by two copies of the decimal representation of R-1.

For instance, for the number n = 17,151,345,543, we will have L = 17,151, R = 345,543

and h(n) = 17,151,345,542,345,542.

Prove that, starting with an arbitrary integer n >= 2, iterated application of h produces the

integer 1 after finitely many steps.

We identify integers n > = 2 with the digit-strings of their decimal representations and extend the definition of h to all non-empty strings with digits from 0 to 9. We recursively define ten functions f0 to f9 that map some strings into integers for k = 9,8, ...1,0. The function f9 is only defined on string x that entirely consists of nines. If x consists of m nines , then f9(x) = m + 1, m = 0, 1, ...
For k<=8 the domain of fk(x) is the set of all strings containing only of digits that are >= k. We write x in the form of x0 kx1 kx2 k ... xm-1 kxm where the strings xs only consists of digits >= k+ 1. String can be empty or m = 0 which means that digits need not appear. We define
fk(x) = Sum s = 0 to m of ( 4 ^ fk+1(xs))
We now use the following facts:
Fact 1 If x does not contain digits smaller than k, then fi(x) = 4 fi+1(xs) for all i = 0, .. k - 1.
In particular, fi(empty string) = 4 ^ (9-i) for all i = 0,1, ...9
We establish the same for k = 9, 8, ... 0 by induction
Fact 2 If the non-empty strings does not contain digits smaller than k, then f1(x) > f1(empty string) for all i = 0, ..., k
Now the fact 3 that will reduce the given problem to one of mere transformations:
Fact 3: f0(n) > f0(h(n))
Then the empty string will be reached after a finite number of applications of h. But starting from a string without leading zeros, empty string can only be reached via the strings 1->00->0->empty string. By that transformation sequence, the number 1 will appear after a finite number of applications of h.
The intuition behind the solution is that if we reduce the given problem to transformations where we can take use the criteria in the question which is to use the numbers smaller than r.

Wednesday, October 21, 2015

Faking it Right by Ross McCammon 

Ross says the professional world’s a stage and we are all actors. According to him, the question is how much do the roles in which you cast yourself differ from who you actually are ?  If they are too different, it might cause problems. However, if they are not that much, they may help in letting your negative feelings pass. 

Ross does not mix words when he says that we lie.  There are plenty of examples when we lie. However there’s honor in those lies because you are trying. 

So let’s take a few examples of how to act: 

Fake confidence at a meeting:  do it at the very beginning when impressions are made and not later.  First smile. Second lock in to the “gaze”.  Third sit up straight. And fourth raise your eyebrows every now and then. 

Seem happy when the promotion goes to someone else: Dig deep because you have to go through a lot of resentment and envy to get through to the good stuff. Its more than just pretending says Sean Kavanagh CEO of the Ariel Group. You have to authentically take on that role. Take a few deep breaths into the diaphragm and not into the shoulders. Focus on the intention and the need of the audience. Empathize with the audience and remember the purpose of your soliloquy 

How to ‘can’ when you ‘can’t even’ : There are times when you can’t and there are times when you can’t even. If you can’t even and you have to seem like you can, here’s what to do : 

Focus on any object at eye level and raise the corners of your mouth. If your jaw had dropped try to hold it up so that your lips close. This might even seem like you can. 

How to smile when you do not feel like smiling: Note that unhappy smiling is worse than not smiling at all. Researchers studied a group of bus drivers found that those who engaged in surface acting ended up in a worse mood and those who faked happiness by smiling and thinking positive thoughts (deep acting) found themselves in better mood. 

If you want to seem happy, get happy, real happy. 

Lastly, the author provides a few key technical matters: 

Think of your reactions as “notches”. Determine which notch your feelings are. Call your feelings “it”. 

Take “it” down at least one notch. Five is too many. Pay attention to body language. Make eye contact. Sit up straight. When in doubt: Act.  Stifle anger. Stifle irritation. Do not stifle a sneeze. And stifle feeling flabbergasted after a sneeze. Gasted is fine. Flabber is way too much. 

#coding exercise

For a given starting number k and the number of iterations n, print the sequence. For example,

2

12

1112

3112

132112

Void PrintSequence (int k, int n)

{

var lastseq = new List <int>(){ k};

For ( int I = 0; I < n; I ++){

Var next = new List <int>();

Int last = int_min;

Int count =0;

For (int j = 0; j < lastseq. Length; j++){

If lastseq  [j] != last {

If last != int_min {

Next.add (count);

Next.add (lastseq [j-1]);

}

Count= 0;

}

 count ++;

Last = lastseq  [j];

}

Next.add (count);

Next.add (lastseq [lastseq.length-1]);

Lastseq = next;

))

}

}

Tuesday, October 20, 2015

We continue the discussion on the question below:

Let n and k be fixed positive integers of the same parity, k n. We are given 2n lamps
numbered 1 through 2n; each of them can be on or off. At the beginning all lamps are off. We
consider sequences of k steps. At each step one of the lamps is switched (from off to on or from
on to off).
Let N be the number of k-step sequences ending in the state: lamps 1, . . . , n on, lamps
n+1, . . . , 2n off.
Let M be the number of k-step sequences leading to the same state and not touching lamps
n+1, . . . , 2n at all.
Find the ratio N/M.

We showed that every restricted admissible process can be modified in 2 ^ (k-n) ways which results in as many distinct admissible processes.
What remains to be shown is that every admissible process can be achieved in that way. And surely enough we can replace the switch of lamp with label l > n that occurs in q by the switch of a lamp with label l - n and consecutively the resulting process p does not have lamps n+1, .. 2n involved.

Moreover the switches for those lamps were even in number and when performing the same on lamps with label < n, the resulting process is admissible and restricted.

Therefore admissible <---> restricted processes are one to one reversible. This completes the proof.

Note that the state of each lamp and it's operations are independent of other lamps, so the combination of ways to operate them is a simple multiplication. And since the state of the lamps are binary, it becomes intuitive that the ratio of the admissible to restricted process has to be a power of two.

Finding Interest in the mundane 

Emails can be boring especially when we are inundated by the tens if not hundreds every day. We are trained to filter what’s relevant and critical to us. We can even recall threads and conversations that may matter to a specific topic of discussion. We spend some time scanning the incoming mails, some more time filtering and most time responding or preparing a response to what matters. Consequently the tools of our trade are the search options on the Email software we use which is Outlook. A cursory look at this search options shows that the query we write are from what we remember such as the subject containing a buzz word or sender or recipient being someone specific. However, there are a few problems with this approach.  

First is we speedboat from the get go. We want to hammer home the wins from each incoming and outgoing responses. There is hardly any technology other than search that we use in our aim to get where we want too and if we have spent a lot of time digging around e-mails, we either request a resend or move on. There may be some tagging and organization involved but these are mostly from what we expect the incoming mails to conform to. Besides with the explosion of containers, the search becomes narrowed leading to higher misses or incorrect assumptions. Secondly, we don’t smell the roses. In our daily routine, there is hardly any learning from the conversations that flow to our inbox unless we have scanned and read each one of them. Many of these conversations may not even be directed at the reader per se and even lesser so at the tangent topics that may boomerang back to us. For example, there is no text analytics or content intelligence that we build from this vast amount of semi-structured data. 

Thirdly, we don’t have appealing visualization of the information that we can have drones retrieve us from the daily grind. Even if we do realize a highly efficient text analytics platform over the emails and manage to build a user interface that can improve our productivity, we may limit ourselves to ultimately showing them as the same search results as we used to see earlier leaving the burden on the user to refine the search results or scan them for items of interest. Yes precision and recall can be improved and even change from being header or metadata based search to full-text but statistical and semantic analysis would not display their immense potential if they presented their results in the same way. What can we then use to render such information ? Take the following visualization as an example: 

 Courtesy: http://vallandingham.me/msd/vis/ 

as an example of what we can organize the results on Outlook. This sort of imagery and interactivity is far more helpful to the user than static endless results. Besides this is complimentary and not a substitute to the traditional search. 

Fourth, even the search tools we use return data not information. Most of the results are the individual emails that meet the query or search criteria whereas the results we are looking for is in terms of concepts or tags that are abstract and may not even be literal in the email. If we look at the following menu from our Outlook software: 

We can see that the query builder window is not the same as the kind of clustering, categorizing and machine learning techniques options we are talking about. There is no way to even build models for search here.  

Finally, we are not talking business intelligence.  There is a lot of overlap between building logic for search and reports to display the results between a business intelligence software add ons and the content intelligence and text analytics platform but the key difference here is that BI is targeted for some kind of specific domain or information whereas we are talking about increasing the panorama of information retrieval tools to glean what passes us by on a day to day basis. Time series data and Big Data techniques have made some of the approaches commercially viable on large scale but to scope it down to personal level and to this repository is altogether different.  

Thus we can explore more from what we get each day.