Thursday, April 7, 2016

#puzzle
You are given an array x of n positive numbers. You start at point (0,0) and moves x[0] metres to the north, then x[1] metres to the west, x[2] metres to the south,x[3] metres to the east and so on. In other words, after each move your direction changes counter-clockwise. Find out if the paths cross.
Solution
Keep track of the axis segments from four steps behind
if the current axes segment is perpendicular (directions are repeating patterns) to the one four step behind and has a point from the tracked axes segment on the current axes segment (either the x or y are same, we know whether it is x or y because they repeat alternatively), then they cross.
bool axis_segments_cross( int x1, int y1, int x2, int y2,
                               int x3, int y3, int x4, int y4)
{
        if (x1 == x2)
        {
            assert(y3 == y4);
            if (y1 <= y3 && y3<= y2) return true;
            else return false;
        }
        if (y1 == y2)
       {
           assert(x3 == x4);
           if (x1 <= x3 && x3 <= x2) return true;
           else return false;
       }
        if (x3 == x4)
        {
            assert(y1 == y2);
            if (y3 <= y2 && y2<= y4) return true;
            else return false;
        }
        if (y3 == y4)
       {
           assert(x1 == x2);
           if (x3 <= x1 && x1 <= x4) return true;
           else return false;
       }
       return false;
}

bool update_axis_segments(List<int>x)
{
   Point curr_start, curr_end;  // current displacement
   Point prev_start, prev_end; // 4 displacements back
   for (int i = 0; i < x.Length; i++)
   {
       //update
       switch(i%4)
       {
         case 0: // North  curr_end.y+=x[i] also snapshot previous
         case 1: // West    curr_end.x-=x[i]
         case 2: // South  curr_end.y -= x[i];
         case 3: // East     curr_end.x += x[i];
        }
        // check
       if (axis_segments_cross(curr_start.x, curr_start.y, curr_end.x, curr_end.y,
                                                 prev_start.x, prev_start.y, prev_end.x, prev_end.y) == true) return true;
   }
   return false;
}

Wednesday, April 6, 2016

Today we start reading the paper Pelican: a  building block for exascale cold data storage by Balakrishnan et al. Cold data is the data that is rarely accessed. Pelican is a rack-scale hard-disk based storage unit designed as the basic building block for Exabyte scale storage for cold data. A rack-scale set of computers is a class of hardware with pooled storage, compute and network such that thousands of cores are interconnected at high speed. A rack is increasingly replacing the individual server as the basic building block of modern data centers. At Microsoft, the rack-scale computing, the Rack-scale computing project aims to redesign hardware, storage, networking and OS stacks in a cross layer co-design manner so as to improve efficiencies. By choosing a building block of this size, Pelican aims to make a building block of this size for cold data workloads. The provisioning is considered just right when the total cost of ownership reduces as compared to traditional disk-based storage. For example, only 8% of the drives can be concurrently spinning in this building block. This introduces complex resource management  to be handled by Pelican.
 This is met in the following way.  Resource restrictions are expressed as constraints over the harddrive. The data layer and IO scheduling ensure that these constraints are not violated. Pelican and  conventional storage are compared side by side using a cross validated simulator. Pelican demonstrates high throughput with acceptable latency for cold workloads.
The cost of storing data appears as a sliding scale. The storage in Amazon S3 for example is near the high end. But cold data is also growing phenomenally. Amazon Glacier and Facebook cold data storage are just a few examples where S3 doesn't suffice and cloud scale tiers need to be provisioned. Moreover these systems have to minimize up front capital costs of buying the storage as well as the costs of running the storage. By co-designing hardware and software stacks, Pelican aims to allow the entire rack's resources to be carefully balanced to support cold data workloads.
#coding exercise
Given a sorted positive integer array nums and an integer n, add/patch elements to the array such that any number in range [1, n] inclusive can be formed by the sum of some elements in the array. Return the minimum number of patches required.
This has a dynamic programming approach
If the range of sums that can be generated using the first k numbers in nums is from 0 to r, and the next number has value x, then either:
• x <= r+1 and it is possible to generate any sum from 0 to r+x using the first k+1 numbers, or
• x > r + 1 and it is not possible to generate the sum r + 1, since x and all the following numbers have greater values.
We keep adding x=r+1 and counting such additions until r meets or exceeds n.

int GetMinPatches(List<int> nums, int n)
{
int count = 0;
Int index = 0;
int r = MaxSumImpossible(nums, 0, index);
while ( index <  nums.count  && r < n)
{
Index++;
 int x = MaxSumImpossible(nums, 0, index+1)
If ( nums.contains(x) == false){
 count ++;
  Nums.add (x, index);
   Index++
}
 r = r +x;
}
return count;
}

int MaxSumImpossible(Numbers, I, j)
if len(Numbers) == 0:
   return 0;
If len(Numbers) == 1:
   Return Numbers[0]
m[I,j] = 0
For k = 0 to j-I-1
     if (jth coin <= MaxSumImpossible(Numbers Choose K, I,j-1)
           q =   MaxSumImpossible(Numbers Choose K, I,j-1) + MaxSumImpossible(Numbers with jth Number only, j,j)
     If q > m[I,j]
             M[I,j] = q
  return m[I,j]

Tuesday, April 5, 2016

Today we continue to discuss software testing effectiveness measures - these are  P-Measure, E-Measure and F-Measure. We had read that the E-measure helps make P-Measure more useful.Their relative values change similarly. Moreover E-measure can differentiate the detection of more than one failure while the P-Measure regards a testing strategy as good as another if they can detect at least one failure. E-Measure assumes homogeneous subdomains whereas to maximize P-Measure, only one subdomain has the failure causing inputs.
Moreover, it has been difficult to generalize such conditions for overlapping subdomains. But in this case too E-Measure is able to identify some characteristics of overlapping subdomains testing  For example, T.Y. Chen and Y. T. Yu found that the relative performance of subdomain testing to random testing depends on the aggregate of the differences in the subdomain failure rate and the overall failure rate. The differences must be weighted by the number of test cases selected from that sub-domain. In the overlapping case, the subdomain failure rates can be higher than the overall which makes the score in favor of subdomain testing as opposed to random testing.
Lastly if the E-Measure says that the subdomain testing strategy is better than the random testing then it must be the case with P-Measure also. This leads to characterizations using P-Measure. This does not mean that the same program can get the same ranking from both measures. That said, if the failure rates are small, the E-Measure serves as a good approximation to the P-Measure. More analysis with E-measure can also be helpful if the subdomain testing strategies are do-able.

Monday, April 4, 2016

Today we continue to discuss software testing effectiveness measures - these are  P-Measure, E-Measure and F-measure. We were discussing random testing. The P-measure in this case is equal to 1 minus the success rate raised to the power of the inverse of failure rate p. The E-measure is the combination of the number of test cases and the failure rate. The F-measure is the ratio of the size of the input domain and the size of the failure region. Random black box sampling of the input domain gives the worst case. That said the randomness may still be required because the failure patterns are not always known. This leads to a technique called adaptive random testing. This is  a technique to distribute test cases evenly within the input space without losing all randomness. The fixed size candidate set case of this technique involves the following steps; First, randomly generate a set of test cases from each iteration of the test case generation. For each such set, the distance of the candidate from the set of previously executed tests is calculated. Lastly, the candidate with the greatest distance is executed and added to the set of executed tests. This technique has achieved result near the upper bound for effectiveness.
In subdomain testing, the effectiveness of testing is measured in terms of the faults detected and the maximum value of E-measure gives such a case. In contrast, to maximize P-measure, it is only necessary to have one subdomain full of failure causing inputs, even though all other subdomains may have low failure rates. Therefore the E-measure has more merit for this purpose. Besides, it can distinguish the capability of detecting more than one failure while the P-measure treats them the same if they result in at least one failure.  Furthermore, it has been difficult to extend the formal analysis to more general and more common case of overlapping subdomains. By using E-measure, T.Y.Chen and Y.T.Yu say that new properties of P-measure can be found rather than with P-measure alone. For example when comparing subdomain testing with random testing, a greater value or equal value of  E-measure assures a greater value of P-measure. Thus the relative value of E-measure also gives a good indication of P-measure.
There is no single measure that serves as the best measure.

Sunday, April 3, 2016

Today we discuss software testing effectiveness measures - these are  P-Measure, E-Measure and F-measure. The E-measure is the expected number of failures detected by the set of test cases. The P-Measure is the probability of detecting at least one failure from the set of test cases. Both the P-measure and the E-measure have been used widely.   The F-measure is the number of tests required to detect the first failure.  Also faults are different from failures. A failure is any manifestation of deviation from the correctness of a program execution. One or more faults may be the reasons for the failures. Until a failure occurs, a fault is said to be not known.  A fault is typically found by debugging. A failure is seen by observation. The F-measure is therefore concerning the first fault. It leads to fewer test cases and a reduced cost of overall testing. Even within the same number of test cases, the order of test cases matters.
The detection of more failures does not always help us identify more faults. That said it is still better to reveal as many failures as possible. Intuitively the more failures the testing reveals the more information for debugging faults and the higher the chance of detecting more faults. In this sense, a measure can be to detect as many failures as possible. This is what we attempt with maximum value of E-measure.(courtesy T.Y. Chen and Y.T. Yu)
Failures occur for a subset of the possible input domain of the program. Generally, inputs that produce failures will cluster in regions of  the input domain. Contiguous regions form patterns in the input domain. For two dimensions, there are (a) block pattern, (b) strip pattern and (c) point pattern.
The assumptions made in finding patterns include the following:
(a) the failure region's size, shape and orientation is known
and (b) the region's location in the input domain is not known. (courtesy: cs.umd.edu CMSC737)
courtesy:
#codingexercise
int GetP-Measure(int p)
{
return 1 - (1-p) ^ (1/p);
}
int GetE-Measure(int n, int p)
{
return n*p;
}
int GetF-Measure(int sizeOfInputDomain, int sizeOfFailureRegion)
{
  return sizeOfInputDomain / sizeOfFailureRegion;
}

Saturday, April 2, 2016


Background Tasks


Contents





 

Problem statement:


We have a self service file share provisioning portal which creates and maintains a file share on a remote storage clusters that are named by their location. Each cluster has a driver that we call the connector and it exposes a set of SSH commands for the said operations. An API layer translates the user portal request into a command for a target connector. The API layer is reached through a gateway. Frequently we have seen timeouts from different sources  - sometimes the gateway times out, sometimes the connector takes too long, sometimes the SSH connection to a remote cluster takes too long. For now we have been working with an upper bound on the timeout. In this post, we try to find an alternative solution using an asynchronous mechanism.


Current Interaction diagram looks like this:



 

In this interaction, API is on hold until the resource is created.

Instead if a promise could be made that can be fulfilled later, then the API can be relieved right away.

This introduces two challenges:

1)      Earlier the object was created and its properties cannot be changed and operations cannot be made on the object until it has been created and marked active. Moreover we send emails to the customer when the share has been created.

2)      Earlier the object creation was scoped within a transaction so we had a clean state to begin or end with.

We now overcome these to have an interim state before it is marked active. We call this state pending. Since each resource is identified by it location and file share name, the API can check multiple requests for the same object creation by finding the existence of a record in the database and making sure that it is not pending. If the object is pending, the API can return an error as a bad request or method not allowed. If the object is already created, then the bad request error returned with a different error message. No other operations are permitted unless the object is active. Therefore the create call is an idempotent call

In order to receive a notification, the API layer can spawn workers to separate the send and receive operations on different tasks. This lets the send operation relieve the caller immediately. Moreover the receive operation can be made a background task. On completion of this task, the user can be emailed that the share is ready. If the share creation fails, the share entry is removed from the current database so that it is in a clean state and the create operation can be retried.

This background task happens as processing of a message queue. During the send operation the database is marked and a message is placed on the message queue. The message queue processor reads the message and issues a command to the connector waiting on the response. When the connector responds back, it suitably marks the record as active or deletes the entry. Only the processor can delete the entry. It notifies the API or the portal or the user via http or email. Both the processor and the API have to check the entries in a database. There is an understanding between the API and the processor that redundant create messages may appear but there can only be one file share and one corresponding entry in the database.

The background processor or spool :




 

 


The interaction diagram now looks like this:



 

 

Choice of a background task processor versus concurrent APIs:

In the above description, a background task processor has been proposed. The same task separation could have been executed independently with concurrency features of the language. However, the background task enables separation from the caller and the recipient of the results. Moreover it can handle different types of tasks. A message queue because allows retries on failures and can scale as a cluster.

Conclusion:


APIs can become more responsive with a background task processor and a new state for the resource.


Friday, April 1, 2016

Today we continue to review the paper Queues are Databases by Jim Gray.
 He talks about message oriented middleware (MOM) and made the observation that queues are better inside the database for transactions. When queues are outside database, they pose performance and concurrency control problems. The mechanisms for concurrency control within a database include Read_Past, Read_Through, Notify,
Read_Past allows skipping over dirty records to find the first committed record and is used in dequeue(). Read_Through allows a program to examine records that have not yet been committed. Notify allows a program to wait for the state change in a lock.
Queue managers are simple TP-monitors
Queues may have workerpools associated with them. This helps to service the entries in the queue. TP monitors configure, manage and load balance these pools. Queued processing has many variants. These include periodic, event and batch. Queues may also have a scheduling policy.
When the queues are implemented as part of the database, triggers may be invoked to give a simple queued transaction processing system.
Queue managers can be built in an Object Relational database system
#coding exercise
Implement promises on message queues.
function promise ( function (resolve, reject){

});
The constructor sends a message to the queue
The callback receives a message from the queue.