Cluster computing

Thursday, November 17, 2016

Yesterday we were discussing Metropolis Hastings algorithm. It helps with drawing independent samples from a high dimensional distribution. This algorithm is iterative. In each iteration, more and more samples are produced where the next sample depends only on the current in a Markov chain like manner.The number of samples is proportional to the number of iterations. In each iteration it picks a sample based on the current value. then it measures how close this sample is to the desired distribution. if the sample is accepted, then the new sample is used as the current value.if it is rejected, the current value is reused. The method uses two steps to generate the next in the sequence.
The parameters to the algorithm are the number of samples to draw denoted by n and the standard deviation denoted by sigma. When the standard deviation is too small, the proposal distribution is set too narrow. When the standard deviation is very large, the proposal distribution is set too wide, the acceptance rate is low and the distribution is highly irregular.
In addition, there are two parameters that help with improving the sampling. The first is called the burnin by which we delay the actual collection until the algorithm has run for a while. The other parameter is called the lag by which we allow several iterations of the sampler to elapse in between successive samples
We were looking at a few interesting applications of this algorithm. One such application was in mapping sequences of unknown text to sequences of labels in known text so as to decipher unknown text. Let us say x is the set of words in unknown text that mapped to y labels in known set. Where there is some association of words to labels and we denote it by f. We take a well known text and count the consecutive text associations from x to y in a matrix M(x,y) of transitions. Then we define a Markov chain function for the consecutive occurrences. We compute this function for two symbols as a starting guess. We change to a candidate by choosing randomly from the values f assigns and then we perform the two steps of the algorithm - first two compute the value of the function for the candidate and then to accept/reject it. Accordingly we move to the next candidate or perform another iteration. This way we move from an unknown space to a sequence in a known space that we understand.
Another application of this algorithm is the "Nested sampling for general Bayesian computation" as described by Skilling. Nested sampling estimates directly how the likelihood function relates to previous value of the discrete random variable. It can be used to compare two data models for the same set by comparing the conditional probability that is assigned after the relevant evidence is taken into account.
#codingexercise
we were discussing several methods to find the maximum non-repeated substring in a given string. We mentioned an approach that lets us merge two sub ranges by counting the desired length in both directions from the join. Additionally, we know the max length of the desired substring in both sub ranges as left and right. Consequently the maximum on merge is max(left, right, length from join)
As we compute bottom up, we know the left and right progressively since we start from single element length sub ranges and keep merging up until we get the original sequence.Since the join forms at every element boundary, we know the maximum propagates to the top.
int GetMergeBottomUpMax(string input, int start, int end)
{
int max = 0;
if (start < end)
{
int mid = (start+end)/2;
int left = GetMergeBottomUpMax(input, start, mid);
int right = GetMergeBottomUpMax(input,mid+1,end);
max = merge_with_init(input, start,mid,end,left,right);
}
return max;
}

Cluster computing

Thursday, November 17, 2016

No comments:

Post a Comment