Cluster computing

Wednesday, March 5, 2014

We look at the challenges of numerical algorithms i.e we show that there are limitations of algorithmic power.

Numerical algorithms are those for computations of problems in a continuous mathematical world.
Therefore the first obstacle is that we have finite approximations of what is really infinite objects.
This results in truncation erros.
The second class of errors we get from round off errors. Real numbers cannot be used accurately because we cannot represent the real answers so we put them in a range.
To overcome the first limitation, we use what is called the Taylor's expansion.

e ^ x = 1 + x + x ^ 2 / 2! + x ^ 3 / 3! + ... x ^ n / n! + ...

we stop the expansion at a definite n. Anything beyond that fixed n is the truncation error.

The second class is evident from when we take a graph of the curve and want to compute the area under it. In this case, we slice up the area into geometrical shapes, trapezoidals to be precise and this is called the trapezoidal rule by which we approximately compute the area under the curve.

One way to get an upper bound on the truncations error is
to use the equation:
e ^ x - ( 1 + x + x ^ 2/ 2! + ... x ^n / n! ) <= M/(n+1)! |x| ^ n+1 where M = max e ^ y for 0 <y <x
i.e by choosing an upper bound M for any y that can be between 0 and x, we are able to establish an upper bound on the truncation error.

We can take an example to show how we come up with M
Let's say we target e ^ 0.5 with an error of 0.0001
Since 0< y < x and e ^ y is increasing , this has to be less than e ^ 0.5. Since e is between 2 and 3, we can take value of M as 4 ^ 0.5 = 2
Now the truncation error is M / (n + 1)! |x| ^ n + 1 and this is determined to be upper bounded by ( as per our chosen values) as 2 / (n + 1) ! | 0.5 | ^ n + 1 which should be less than 0.0001 and this lets us find out n. In this case, we may find n = 5.

We can similarly write a formula to find the upper bound on the error of the trapezoidal rule. Here the problem is defined as how thin and how many trapezoidals can be put between lower and upper limit of the curve.

In Today's post we look at the limitations of algorithm power specifically the lower bound arguments.
If a problem P takes time t(n) then there is a time t under which no algorithm can solve P.
Knowing the lower bound is advantageous. For example, we know that sorting algorithms can take O(n^2) in some cases and O(nlogn) in other cases. Searching a sorted array with brute force takes O(n) and with binary search takes O(log n) and thus an improvement. Minimum spanning tree on a connected graph is done with Prim's algorithm in O(mlogn) and single source shortest path is done by Djikstra's algorithm in O(mlogn). How far can we go in each of these cases ? This is perhaps answered by a lower bound.
Some problems can be really hard to solve. For example factoring a number into its prime numbers is hard to solve. Others require a lot of processing. For example, The RSA cryptographic scheme as used by say a bank to secure its clients messages generates two large prime numbers p and q and takes its product p.q as n. Then it creates a decryption encryption pair de = 1mod(p-1)(q-1) where (e,n) is taken as the public key and d is taken as the private key. When a message is sent by a client, he signs the message with m^e mod n. When the bank receives the message, it decrypts it by applying ((m^e)^d) mod n which is equal to m. Since factoring a number n into p.q is hard, this is a good security scheme.
A lower bound can be trivially established by enumerating the cases that need to be generated. This is the case for generating permutations of n elements since it will be n! Similarly a search in an unsorted array will have to cover the n elements and that O(n) becomes the lower bound.
Another way to establish the lower bound is the information theoretic. This is the case in a binary search where we know the number of questions that need to be answered before a problem can be solved.
Another way to establish a lower bound is by using an adversary arguments. In this case we don't show the full input and an adversary answers the questions an algorithm asks on behalf of the input.
as an example, let us take the problem of merging two sorted lists. In this case an algorithm is forced to ask 2n-1 questions on the input to come up with the merged list. That is the first element of the first list and the first element of the second list are compared and the second element of the first list and the first element of the second list are compared. i.e the way in which these comparisions are made are finite and pre-determined. If an algorithm asked a different set of questions and produced another output that was valid that would mean there are two different algorithms that produce two different valid outputs which should not happen for the same given input. Thus this is another way of establishing a lower bound.
The final way of establishing a lower bound is by a class of techniques called Reduction. Here if P reduces to Q in constant time, then any lower bound for P will hold for Q as well.
If Q were solvable any faster then that means P could be solved faster too which is a contradiction to the assumption we started with. Hence reduction can also be used to find the lower bound.
Courtesy: @mecr.org

Tuesday, March 4, 2014

In today's post we discuss RabinKarp algorithm for string matching based on mecr lectures. We match a string pattern P of length m in a target string T of length n. The pattern matches O(mn) times when we go character by character in the target string. This can be improved with two good ideas:
one is that on a mismatch we try to skip ahead and we can skip upto m characters. So we have O(m) extra space and O(m + n) worst case time. This is still an improvement.
second is that we do approximate matches where if there is a mismatch, we know for sure but if there is a match, it may or may not be so. This way we can skip ahead and definitive mis-matches.
We implement the approximation for instance with a hash match. We take strings at index 0 to n - m and hash m characters to see if the hash matches that of the pattern.
If a portion of the string matches, then their portion of hash should match. But we look at the opposite way here. If the hash of the match is different then we know that the string is not the same as the pattern and we can skip ahead. So while we do a little bit more work in the hashing and consecutively our worst case is more than the first approach but in most cases our hash function will work very well. We just have a hash function that matches our needs which are:
1) we should pre-compute a hash of the pattern
2) we compute hash of string at index i to i+m-1for different i
3) the hash function should be such that
h(T[i, i+m-1]) = g(T[i]) "+" g(T[i+1]) "+" .... g(T[i+m-1])
where + allows us to compute
h(T[i+1, i+m]) = g(T[i+1]) + g(T[i +2)) + .... g(T[i+m])
with these properties it takes a constant time to compute
h(T[i+1, i+m]) from h(T[i, i+m-1])
In addition if this hash function has very few collisions, then we have a fast running time of O(n+m)
As an example, we can take a hash function h(string.letter) as 4h(string) + h(letter)
if a string is made up of DNA letters ACGT
h(ACT) = 4h(AC) + h(T) and so on.

Monday, March 3, 2014

Today I looked at a few course materials available on mecr.org and these are available for reading. One of the topics I looked at was evaluating complexity in terms of big O, Omega and Theta. I want to briefly mention them here.

If we take a few methods f and g where f is O(g(n)) we can show there are non-unique values of coefficient c and n that satisfy this i.e c > 1 n > 1 and there are different forms of polynomials that can satisfy the relationship for f and g.
Typically in calculating complexity, usually only the higher term matters. This becomes important in polynomials as it simplifies the discussion on complexity.
While O is applicable to a single algorithm. Omega is applicable to a class of algorithms. For example, we say that Omega for sorting algorithms is nlogn. This we see from different sorting algorithms having a complexity of O(nlogn).
Theta on the other hand is used for a range. For example if f(n) lies between c1.g(n) and c2.g(n) i.e
c1.g(n) < f(n) < c2.g(n) then we say f(n) is theta for g(n) for all n0, c1, c2 such that n > n0.
These two functions are generally of the same order.
big O, Omega, Theta are all related.
For example, f(n) is O(g(n)) and g(n) is Omega(f(n)). Then we can take two coefficients c and 1/c for which we can say f(n) <= c.g(n) and 1/c .f(n) <= g(n)
and consecutively we can say f(n) is Theta(g(n). In other words, f(n) is O(g(n)) and also f(n) is Omega(g(n)).
There is also an interesting observation that
if f1 is O(g1(n))
and f2 is O(g2(n))
the f1 + f2 is O( max(g1, g2))
This is true even if we have more than two such functions. Its easier to visualize these as applicable to a sequence of code blocks in a program where the overall program complexity is the max of any one of the blocks of code.
Also, we can show that with a choice of coefficients for the functions above, where say n3 is max of n1, n2 and c3 is max of c1, c2, then f1 + f2 = c3g1(n) + c3g2(n) = 2.c3.(max (g1,g2)) and that leads to
f1 + f2 is O(max(g1, g2))

Sunday, March 2, 2014

In today's post I want to cover a topic on writing fast low level logs. There are many logging providers and I've covered at different levels earlier. Some examples included log4net and log4cpp. Kernel mode tracing is often considered as an example of low level logging. Database systems also have their own system of logging.
There are a few salient features of most logging methods.
First is they enable formatting to the text that needs to be logged. These are variable length arguments and allow the different specifiers to be enabled
Second is they enable different levels of logging. By this we mean we control how much or how little of the logs to be written. For the components that provide the log strings, they could continue to log at different levels and the end user could decide how much or how little of the logs to be seen. This is a switch independent of all the components and each log provider is expected to honor these levels.
Components are expected to separate their logging with labels or strings that they identify by initializing or teardown.
Another feature is that these logging are based on file appenders and the logs are written at the end of the file.
Aside from the debug levels, there may be other metadata for the log.
But if we take the implementation of a logger, we neither use an event loop or a overlapped IO.
In principle, both could be used. However, neither are.
A logger has many clients that send messages. The messages are serialized onto the end of the file.
The offset of the file is incremented in a thread safe manner as opposed to locking the entire data.
or putting the entire log entry addition code within critical section. The process of writing the log is that of filling out buffers which are then flushed to disks. Typically this buffer is left to file system instead of maintaining a separate one. Some applications choose to implement a round robin circular linked list of buffers which are then flushed to disk.
No matter what the buffering is there can be many producers and one consumer and serialized queue. That is why this sounds similar to an overlapped IO.

Saturday, March 1, 2014

In today's blog post we review CLI commands for Splunk. Again the command line option is complimentary to User Interface and works well for management operations and scriptability.
The command line option to enable Splunk to startup in debug mode is
splunk start splunkd --debug;
if we were to clean all data , we could say
splunk clean all
or splunk clean [option] where eventdata, globaldata, userdata or inputdata can be specified in option
In addition we can disable one or more components such as
splunk disable app, boot-start, deploy-client, deploy-server, dist-search, index, listen, local-index, perfmon, web-server, web-ssi.
and followed with a way to toggle them back on with
splunk enable app, boot-start, deploy-client, deploy-server, dist-search, index, listen, local-index, perfmon web-server, web-ssi.
splunk display command will display the status for each of these.
splunk list is different from splunk display. The list command lists all configurations and setting or the collections.The display option is only the state of that feature.
The CLI commands provide options for working with data. These include:
splunk import userdata
and splunk export userdata, eventdata
If we made a change to the filter the data, such as with a conf file we can granularly enable or disable them such as with the following:
splunk reload ad, auth, deploy-server, index, monitor, registry-script, tcp, udp, perfmon, wmi
For most of these components when we enable or disable them, we can check the logs to set that they have indeed been enabled or disabled.
The CLI commands provide a way to work with the logs. we do this with
splunk find log
When the commands are applied to many different machines for a cluster, the CLI provides a way to do this.
For example, we can type splunk apply cluster-bundle to apply (make active) a cluster bundle to all the peers in the cluster. To check the status on all the peers, the splunk show cluster-bundle-status command can be used at the master. For silent apply, we can say ./splunk apply cluster-bundle --skip-validation --answer-yes. The CLI list command also provides other cluster related options such as splunk list cluster-config, cluster-generation, cluster-peers and cluster-buckets. The splunk rtsearch and splunk search commands are also available to search. Note that the splunk rt-search command is used for searching events before they are indexed and to preview reports as the events stream in. The command arguments are similar between the traditional search and the rt-search commands. App is used to specify an app context to run the search, batch is used to specify handle updates in preview mode, detach triggers an asynchronous search and displays the job id and ttl for the search, header indicates whether to display a header in the table output mode, max_time indicates the length of time in seconds that a search job runs before it is finalized, maxout indicates the maximum number of events to return or send to stdout, output indicates how to display the job such as rawdata, table, csv and auto, timeout denotes the length of time in seconds that a search job is allowed to live after running. Defaults to 0 which means the job is canceled immediately after it is run. and wrap that indicates whether to wrap lines that exceed terminal width.
The splunk commands provide output that can be used to view the success or failure of your command. Typically an error message is provided when it fails.

The Event loop discussed in the previous post runs continuously and uses a Time-heap data structure which we discuss today. we maintain the time-outs on a heap, the pending heap. When we want to run them, we put them on the doubly linked list mentioned earlier. The timeouts are in a heap because it is an efficient way to organize them by order of expiring timeouts. In this implementation I'm looking at, the left and the right children point to similar events with timeouts greater than the parent. The left differs from the right only in the timeoutcount. if the timeout count is even, it is steered to the right.
whenever new elements are inserted they are bubbled up and down the tree as necessary. After every add or delete, the tree is fixed up.
Timeouts have a link pointer that indicates which list it belongs to. Note that when the timeouts are on the list we reuse the storage for the left and right pointers to be the forward and backward pointers in the doubly linked list
As a test to the Timeout heap, we could make sure that the events are added once and in the correct position on the pending queue and that they are removed from the pending when they are on the expired list to be run and that they are added back to the pending heap and in the correct position for re-triggering. Their position as left or right should remain unchanged if no new nodes were added or deleted.
By doing the test on a controlled heap we know that the events are processed in sequence. In the real world, the events should be guaranteed to be executed if their timeout expired but their sequence could be arbitrary. The heaps are supposed to be short and therefore none of them is expected to be infinite.