Cluster computing

Tuesday, April 29, 2014

The design criteria for the queue alert module mentioned in the previous post include :

Support for retries by the queue processor: The design of a queue alert module must consider the retries by the queue processor for the same message. All exception paths including dead letter queues should be considered for the same treatment.

Non-invasive: When possible, we should consider a non-invasive approach that doesn’t require instrumentation of the queues. In other words, it can work for any version of the queues and doesn’t affect the queue processing. It could work by sniffing the data changes or the logs.

Polling: Any polling approach must be robust and rely on relieving high CPU usages during its processing.

Support for transactional as well as non-transactional messages: The alerts module must work for both kinds of messages so that the user can specify the criteria and not be limited to only a set of messages. Concurrent processing of both messages must be supported.

Support for distributed transactions: When transactions involve messages across queues, this alert module should enable evaluating those messages as part of the transaction or at least log the transaction and the actions taken with the transactions so that the final state can be determined by the alerts module.

Support for clusters: The queues may not all be local to a machine and could be distributed on different nodes in a cluster or they may all be in a failover cluster. Alert module should target the messages and the queues and even the machines.

Scoping of alerts: Alerts need not be registered at the message level. They could be registered at the queue level or at the machine level. Whatever the hierarchy chosen would take care of all the alerts at the inner scope by the outer scope. This means that the CRUD on the alerts at a queue scope automatically performs the same at the message scope.

Changes to the rules or registration of alerts: Alerts registered with the alerts module will not take effect until the system reconfigures. This enables the changes to the alerts to be picked up for processing by the module and gives time for setup and cleanup operations by the module.

Deployment: The alerts module should come in a standalone service or executable so that it can be an add-on to existing queue processing. The module itself could be deployable by copying or via an installer.

Automatic administration of rules, actions, messages and queues could be performed where possible.

The use of message format: When interacting with the queues to read the messages, the alerts module will evaluate the fields of the messages against the criteria specified in the rules by the user. The message format should not be opaque and as in the case of MSMQ should expose known fields for evaluation against the rules.

Control of concurrency: The alerts module could make the invocation of actions registered with the rules as concurrent so that the evaluation of an action for a message does not block other.

Full-text or key-value search over message body: The expressions to search over the text of the messages could be resource intensive and optionally enabled. Rules to perform such search could be outside the alerts mechanism and done with the help of an indexer. As such this may not be in scope for the alerts module.

Text messages versus binary messages: The alerts module should support both formats. The module should rely on the fields versus the contents. Subsequent processing of say JSON vs. XML text could be offloaded to other systems.

Asynchronous communication mechanism: This could be enabled between producers and consumers so that they don’t block each other.

Performance: Volumes of hundred thousand transactions per submission that reach millions of transactions per day and involve several messages across different queues should be targeted. Working on a set of few messages or queues or rules or alerts at a time could enable this.

Monday, April 28, 2014

Today we will look at scalable queue alert design I describe a method to evaluate multiple queues for processing. Queues can have multiple messages. Messages can have different attributes. User would like to author rules for actions on queues based on attributes of both queues and messages. If a message arrives in a queue, it is evaluated against all the rules authored by the users for the corresponding action to be taken. Rules comprise of conditions and actions. Conditions are expressions based on attributes and logical operators. Action can be any one of predetermined actions such as running a script or logging. The rules are specified in a user defined function. This helps the user to manage the rules. The rules are all evaluated against each message of each queue. This means that the attributes have to be deterministic, with no side effects and easy to lookup.
When we scale out the queues, we are going to evaluate these rules based on each of the queues. When we process the messages we may do one after the other across queues. This means the user defined rules can work the same across queues and messages.
The rules evaluation for any message in any queue will evaluate to one or more of the actions. The default action is a no-op which is not specified explicitly. No-op here in this case means that no additional actions will be triggered other than the default message processing by the queue. The alerting mechanism is independent of the queue processing and is checked right after the message is processed. This could be done right before the message processing but its only when the message is processed do we know that the current message has been handled.
The queue alert mechanism can live outside the queue processing service. This implies that the queue alert mechanism can be used for journaling in a non-invasive manner. The action corresponding to the queue processing could be to log the messages.
Another use of the queue alert mechanism is to enable different actions to be specified for these messages. For example, action could be to launch a script for selected messages instead of all messages. Scripts could trigger additional workflows
Trigger mechanism needs to be up-to-date with the queues. If the queues are added or deleted, then the rules may need to be re-defined. Evaluation of stale rules should default to no-op. This ensures execution of the messages.

Sunday, April 27, 2014

Today we look at some more usages of random variables. We mentioned so far that random variables can be combined. We know that random variables can be independent and for the different values that it can take, the average value gives a good indication of the summary. Let us take an interesting application of this technique in a hiring problem. Let us say you wanted to hire an office assistant. We can use indicator random variables with this. Let us first describe the problem. When you hire an office assistant, you may have to interview some candidates. You want to hire a suitable candidate but to actually hire somebody you have more costs. You have to fire the current candidate and you must pay a large hiring fee to the employment agency that is sending the candidates. You are interested in estimating this price This is the hiring problem.
This is written as follows:
int Hire-Assistant(int[] candidates)
{
int best = -1; // least - qualified dummy candidate
foreach (var candidate in candidates)
{
if (interview(candidate) > best )
    {
   best = candidate;
    }
}
hire(best);
return best;
}
We now use probability to analyze this problem. In order to do that, we must use the assumptions about the distribution of inputs. Then we analyze our algorithm and compute an expected run-time.
Since we take the distribution over the inputs, we are averaging the running time over all possible inputs.
We use probabilistic analysis when we can make assumptions about the distribution of inputs i.e. we can assume something about the set of all possible inputs and for designing an efficient algorithm and as a means for gaining insight into the hiring problem. For the hiring problem, we can assume that the candidates are coming in random order. This means we can compare any two candidates in any order and decide who's best.In fact, we can use this fact to establish a distinct ranking of the candidates.
An indicator random variable associated with an event A is defined as 1 if the event occurs and 0 otherwise.
Let us determine the expected number of successes for the interviews. Our sample space is S = {S, F} and we define a random variable which can take one of the two values of Success or Failure with equal probability. We can then define an indicator random variable which we can express as the event Y = Success. The expected number of successes obtained in one interview is simply the expected value of our indicator variable.

Saturday, April 26, 2014

In today's post we discuss discrete random variables from the textbook we have been referring. A random variable X is a function from a finite or countably infinite sample space S to the real numbers. It associates a real number with each possible outcome of an experiment, which allows us to work on probability distribution induced on the resulting set of numbers. These variables can also be defined for uncountably infinite sample spaces but we will only look at random variables that are discrete.
For a random variable X and a real number x, the event X = x to be such that {s belongs to S : X(s) = x } thus Pr[ X = x] = Sum Pr[s]
The function f(x) = Pr[X = x] is the probability density function of the random variable X
Per the definitions of probabilities we know that Pr[X = x] >= 0
and that the sum of the individual probabilities is equal to 1.
If we take the example of a pair of dice with six possible outcomes each and we define a random variable X to be the maximum of the two values showing on the dice, then we have
Pr[X = 3] = 5/ 36
because there are 36 possible outcomes when we take the values in pairs
and the value that X assigns is 3 since
it has 5 possible outcomes (1,3), (2,3), (3,3), (3,2), (3,1)
It is common for several random variables to be defined on the same sample space.
If there are two random variables defined on the same sample space, say X and Y
then their co-occurrence has a probability distribution function that is
Pr [ X = x and Y = y] which is the joint probability distribution.
If we fix one of the values, we can vary the other and this can be summed.
For a fixed value y, Pr[Y = y] = Sum of all x Pr[X=x and Y = y]
The same goes for a fixed value of x, where we can vary y.
We can extend this to conditional probabilities as well. For example,
Pr[X = x | Y = y] = Pr [ X = x and Y = y] / Pr [Y = y]
We can say that two random variables x and y are independent if for all x and y
the events X = x and Y = y are independent which we can express as
Pr[ X = x and Y = y] = Pr [X = x].Pr[Y = y]
The simplest summary of the distribution of a random variable is the average of the values it takes.

Friday, April 25, 2014

Today we look at some counting theories:
Counting theories explain how many without actually enumerating how many. This is very helpful when it is not only daunting to count a set of items but also when it is difficult to make the set.
Consider for example how many different ways can we arrange n distinct elements ?
We review some of the elements of counting theory.
A set of items that we wish to count can sometimes be expressed as a union of disjoint sets or as a Cartesian product of sets.
The rule of sum says that the number of ways to choose an element from one of two disjoints sets is the sum of the cardinalities of the sets.
The rule of product says that the number of ways to choose an ordered pair is the number of ways to choose the first element times the number of ways to choose the second element.
We look at them in detail now.
If A and B are two finite sets with no members in common, then the number of ways to choose an item from one of the sets is the count of items in both sets. For example, a license plate may have either alphabets or numbers in each of the position. Since there are 26 alphabets and 10 numbers, there is only one pick out of 36. We can now extend this to sets that have duplicates and the answer does not change because it depends on cardinalities.
If we use the same sets A and B we can express the number of ways to choose an ordered pair is to choose the first element times that from the other set. For example, an icecream with 28 flavors and 4 toppings can be mixed and matched to give 28*4 different icecreams.

We sometimes need to bound the size of a binomial coefficient. For 1 <= k= < n , we have the lower bound as
(n
k) = n ( n-1) ... (n-k + 1) / k.k-1...1
which we can rearrange to get >= (n/k) ^ n
Here use the inequality k! >= (k / e ) ^ k derived from Stirling's approximation, we obtain the upper bounds
(n
k) = n.(n-1)...(n-k+1) / k. k-1. ... 1 <= (n^k / k!) <= (en/k) ^ k
for all 0 =k <= n , we can use mathematical induction to prove the bound
(n
k) <= n^n / k^k.(n-k)^(n-k)
where for convenience we assume that 0 ^ 0 = 1
i.e for the trivial case we have k = 0, we have
(n
0) <= 1
and for k we assume that it holds, now we look at k+1
(n
k+1) <= n^n/ (k+1)^(k+1).(n-k-1)^(n-k-1) which we rearrange to
write as <= n^n / k^k.(n-k)^(n-k) because the denominator increases with k+1
and it holds for k+1
For k = lambda.n, where 0<=lambda<=1, this bound can be rewritten as
(n
lambda.n) <= n^n / (lambda.n) ^ (lambda.n) . (((1-lambda)n) ^ ((1-lambda)n))

Thursday, April 24, 2014

We look at approximation by integrals today.When a summation is expressed in terms of a function of k where this f(k) is a monitonically increasing function, we can approximate it by integrals because it finds the area under the curves in slices. By using the summation of the rectangles under the curve we can get a pretty good idea on the bounds of the function. Moreover, this applies to both the monotonically increasing and decreasing curves.
In the case of the monotonically increasing function on a graph from left to right, the area of the slice on the left of a chosen slice will be lesser or equal and the slice on the right of the chosen slice will be higher or equal. That is we have a generic three partitions of the ranges and we can show this for any of the repeating three slices.
The integral approximation gives a tight estimate for the nth harmonic number. For a lower bound, we obtain.
Sum of k = 1 to n of (1/k) greater than or equal to Integral of 1 to n slices of (1/x) dx = ln (n + 1) because each slice can be of unit width.
For the upper bound we derive the inequality
Sum of k = 2 to n of (1/k) is less than or equal to Integral of slices (1/x)(dx) = ln (n) again based on unit-width slices
Together this yields the bound of the harmonic series Sum of k = 1 to n of (1/k) <= ln (n) + 1.
We note that the total area under the curve is the the value of the summation. The integral is represented by the shaded area under the curve. By comparing the areas for the lower and upper bound, we see that the rectangles are slightly under the curve in the case of the lower bound and slightly over the curve in the case of the upper bound. We compute the areas as
Sum m-1 to n of f(x)dx <= Sum m to n of f(x)dx and by shifting the rectanges one to the right we establish sum m to n of f(x)dx <= Sum m to n+1 of f(x)dx.