Wednesday, November 16, 2016

Yesterday we were discussing Metropolis Hastings algorithm. It helps with drawing independent samples from a high dimensional distribution. This algorithm is iterative. In each iteration, more and more samples are produced where the next sample depends only on the current in a Markov chain like manner.The number of samples is proportional to the number of iterations. In each iteration it picks a sample based on the current value. then it measures how close this sample is to the desired distribution. if the sample is accepted, then the new sample is used as the current value.if it is rejected, the current value is reused. The method uses two steps to generate the next in the sequence.
The parameters to the algorithm are the number of samples to draw denoted by n and the standard deviation denoted by sigma. When the standard deviation is too small, the proposal distribution is set too narrow. When the standard deviation is very large, the proposal distribution is set too wide, the acceptance rate is low and the distribution is highly irregular.
In addition, there are two parameters that help with improving the sampling. The first is called the burnin by which we delay the actual collection until the algorithm has run for a while. The other parameter is called the lag by which we allow several iterations of the sampler to elapse in between successive samples

One naiive and direct way to apply this algorithm in text analysis is when we project the words of a text into an alterbate word space. let us say we dont know what the words in a text mean but we know that they are mapped uniquely to some labels or topics that is in our case predetermined.then we apply this algorithm to establish the topics.

let us say we have a sequence in unknown text x that maps to a sequence of predefined labels y
we find the proportion of consecutive text symbols from x to y. This gives a matrix M(x, y) of transitions. One may then associate a plausibility to f as a function that depends on previous and current state, 
then  we do the following :
 Compute Pl(f).
• Change to fdash by making a random transposition of the values f assigns to two symbols.
• Compute Pl(fdash); if this is larger than Pl(f), accept fdash.
• If not, flip a Pl(fdash)/Pl(f) coin; if it comes up heads, accept fdash.
• If the coin toss comes up tails, stay at f.


#codingexercise
we were discussing several methods to find the maximum non-repeated substring in a given string.
When they are O(N^2) it may take a long time to get to the desired substring if it is skewed towards one end.
Instead, in one way, we can distribute the search in a scatter-gather mode where we pick different starting points throughout the length of the string for a faster search. This starting point is denoted with the parameter mid below:

        static int merge(string input, int start, int mid, int end, int left, int right)
        {
            if (mid == start || mid == end) return 0;
            var ret = new List<int>(){left,right};
            var h = new List<int>(256);
            h.AddRange(Enumerable.Repeat(0, 256));
            h[input[mid]]++;
            int lcount = 1;
            for (int i = mid - 1; i >= start; i--)
            {
                if (h[input[i]] != 0)
                {
                    break;
                }
                else
                {
                    h[input[i]]++;
                    lcount++;
                }
            }
            var t = new List<int>(256);
            t.AddRange(Enumerable.Repeat(0, 256));
            t[input[mid]]++;
            int rcount = 1;
            for (int j = mid + 1; j <= end; j++)
            {
                if (t[input[j]] != 0)
                {
                    break;
                }
                else
                {
                    t[input[j]]++;
                    rcount++;
                }
            }
            int ccount = lcount;
            for (int k = mid + 1; k <= end; k++)
            {
                if (h[input[k]] != 0)
                {
                    break;
                }
                else
                {
                    h[input[k]]++;
                    ccount++;
                }
            }
            ret.Add(lcount);
            ret.Add(rcount);
            ret.Add(ccount);
            return ret.Max();
        }

the above implementation is only to show that the we could start for all points between start and end as mid and because we could go about that in any order we get an early convergence towards max.

No comments:

Post a Comment