We continue discussing the paper "Nested sampling  for general Bayesian computation" by Skilling
We were looking at the transformation of computing the evidence based on the prior mass instead of the parameters. Today we look at the integration performed.
The integration is straightforward online because the integration is performed by computing the area in strips under the curve. This area of the strips is the weighted sum of Likelihood values for each of the strips. We know the curve is non-increasing between 0 and 1. Therefore each strip is bounded below by any value larger than prior mass X where the strips are arranged from right to left in m strips. Also the lower bound is given by Xm+1 = 0 and the area of the curve has to be greater than or equal to the sum of weighted Likelihood values where the weights are X intervals. There is also an upper bound with the the integration not exceeding the calculated area (weighted sum) plus the last prior mass times max likelihood value. This follows from the trapezoidal rule where the difference between the area covered by the strips and the area under the curve are triangular whose sum cannot exceed the last strip with Lmax likelihood
We already know the weighted sum is online. When more and more likelihood values are encountered we add the corresponding weighted sum in the proportion of their classes thereby extending the current to the next evidence. But the more interesting thing here is that the likelihood values are all sorted. Do we need these values to be sorted for the computation of the evidence ? The answer is no as we will shortly see. Before that let us discuss the sampling. The integral for the evidence is dominated by wherever the bulk of the posterior mass is found. It occupies a small fraction of the prior. To see the width of the posterior in terms of X just like we did with prior, let us say the likelihood function is a multivariate in rank C where the coefficients are normalized. The likelihood can then be written as proportional to the exponential in terms of radius squared. The radius is in C dimensions and with the likelihood function in terms of exponential of r squared, we get the area of the curve as proportional to r raised to the power C because there are C dimensions taken together and the integral translates to log. In other words, the posterior is distributed over a range in log X. In order to cover such a range, the sampling should be linear in log X rather than in X, and then we set X1 = t1, X2 = t1t2, ... Xi = t1t2..ti, ... Xm = t1t2t3...tm where each ti lies between 0 and 1. By setting values of successive t, we can then write the evidence in terms of weighted sum again but the weights being represented in terms of t and the likelihood values for each interval.
In practice finding the exact value of t is difficult but we can certainly do with a statistical value. We could do this by finding a new point Xi from the prior subject to the condition that it is smaller than the previous prior and the initial prior being equal to 1. Then our knowledge of the new point would be specified by Xi = ti. Xi-1. To find such a point, we could sample uniformly from the points in the restricted range (0,Xi-1) and then investigate from the original sorted likelihood values what its parameter would have been. But in that parameter space, we would be guided by the likelihood value subject to the constraint that the likelihood value for that chosen parameter is greater than the likelihood of the previous interval with the initial likelihood being equal to zero and in proportion to the prior density. Either way we find a random point and with distribution just the same. Therefore we need not choose the latter method which requires sorting based on the likelihood value and instead perform the sampling based on a random value of t. Thus we avoid sorting. And if we set t to be 0.99, then we can reach the bulk of the posterior in 100 steps of width that is proportional to the fraction of prior mass that contains it instead of taking thousand such steps as is generally the case.
Successive data points in sampling the prior within the box defined by the likelihood greater than the current is done by some Markov chain monte carlo approximation starting at some candidate that obeys the constraints if available or at the worst found from the previous iteration that lies on the previous likelihood contour in the parameter space. This paper essentially says don't navigate the parameter space. It is sufficient to explore a likelihood weighted space.
The nested sampling procedure is therefore :
Choose the number of classes as N and the number of iterations as j
Record the lowest of the current likelihood values
In each iteration
set the initial prior to the exponential of a value that depends on iteration sequence
set the weight to be the difference in previous and the current prior
increment the evidence based on the strip area
then replace the point of the lowest likelihood by new one drawn from within the likelihood
Increment the evidence by filling in the missing band with weight w using the chosen point
Notice that this greatly simplified procedure invovles only exponentials and weighted sum.
#codingexercise
Yesterday we discussed
int F(List<int>V, int i, int j)
{
if (i > j || i >= V.Count || j >= V.Count) return 0;
if (i==j) return V[i];
if (i ==j + 1) return max(V[i], V[j]);
return Max(V[i] + min (F(V, i+2,j), F(V, i+1, j-1)), V[j] + min (F(V, i+1, j-1), F(V, i, j-2)));
}
Today we write down a memoized way
static int FMemoized(List<int> V, int n)
{
var table = new int[n,n];
for (int k = 0; k < n; k++)
{
for (int i = 0, j=k; j <n; i++,j++) // the simultaneous increment is due to the fact that we use each axes for each component of the max function and update the diagonal as the matrix grows in size.
// also we initialize one of the axis differently because we can skip what we have already computed.
This causes the lower left of the matrix to remain stationary while the upper right to update
{
int x = ((i+2) <= j) ? table[i+2,j] : 0;
int y = ((i+1) <= (j - 1)) ? table[i+1, j-1] : 0;
int z = (i <= (j-2)) ? table[i,j-2] : 0;
table[i,j] = Math.Max(V[i] + Math.Min(x,y), V[j] + Math.Min(y,z));
}
}
return table[0,n-1];
}
We were looking at the transformation of computing the evidence based on the prior mass instead of the parameters. Today we look at the integration performed.
The integration is straightforward online because the integration is performed by computing the area in strips under the curve. This area of the strips is the weighted sum of Likelihood values for each of the strips. We know the curve is non-increasing between 0 and 1. Therefore each strip is bounded below by any value larger than prior mass X where the strips are arranged from right to left in m strips. Also the lower bound is given by Xm+1 = 0 and the area of the curve has to be greater than or equal to the sum of weighted Likelihood values where the weights are X intervals. There is also an upper bound with the the integration not exceeding the calculated area (weighted sum) plus the last prior mass times max likelihood value. This follows from the trapezoidal rule where the difference between the area covered by the strips and the area under the curve are triangular whose sum cannot exceed the last strip with Lmax likelihood
We already know the weighted sum is online. When more and more likelihood values are encountered we add the corresponding weighted sum in the proportion of their classes thereby extending the current to the next evidence. But the more interesting thing here is that the likelihood values are all sorted. Do we need these values to be sorted for the computation of the evidence ? The answer is no as we will shortly see. Before that let us discuss the sampling. The integral for the evidence is dominated by wherever the bulk of the posterior mass is found. It occupies a small fraction of the prior. To see the width of the posterior in terms of X just like we did with prior, let us say the likelihood function is a multivariate in rank C where the coefficients are normalized. The likelihood can then be written as proportional to the exponential in terms of radius squared. The radius is in C dimensions and with the likelihood function in terms of exponential of r squared, we get the area of the curve as proportional to r raised to the power C because there are C dimensions taken together and the integral translates to log. In other words, the posterior is distributed over a range in log X. In order to cover such a range, the sampling should be linear in log X rather than in X, and then we set X1 = t1, X2 = t1t2, ... Xi = t1t2..ti, ... Xm = t1t2t3...tm where each ti lies between 0 and 1. By setting values of successive t, we can then write the evidence in terms of weighted sum again but the weights being represented in terms of t and the likelihood values for each interval.
In practice finding the exact value of t is difficult but we can certainly do with a statistical value. We could do this by finding a new point Xi from the prior subject to the condition that it is smaller than the previous prior and the initial prior being equal to 1. Then our knowledge of the new point would be specified by Xi = ti. Xi-1. To find such a point, we could sample uniformly from the points in the restricted range (0,Xi-1) and then investigate from the original sorted likelihood values what its parameter would have been. But in that parameter space, we would be guided by the likelihood value subject to the constraint that the likelihood value for that chosen parameter is greater than the likelihood of the previous interval with the initial likelihood being equal to zero and in proportion to the prior density. Either way we find a random point and with distribution just the same. Therefore we need not choose the latter method which requires sorting based on the likelihood value and instead perform the sampling based on a random value of t. Thus we avoid sorting. And if we set t to be 0.99, then we can reach the bulk of the posterior in 100 steps of width that is proportional to the fraction of prior mass that contains it instead of taking thousand such steps as is generally the case.
Successive data points in sampling the prior within the box defined by the likelihood greater than the current is done by some Markov chain monte carlo approximation starting at some candidate that obeys the constraints if available or at the worst found from the previous iteration that lies on the previous likelihood contour in the parameter space. This paper essentially says don't navigate the parameter space. It is sufficient to explore a likelihood weighted space.
The nested sampling procedure is therefore :
Choose the number of classes as N and the number of iterations as j
Record the lowest of the current likelihood values
In each iteration
set the initial prior to the exponential of a value that depends on iteration sequence
set the weight to be the difference in previous and the current prior
increment the evidence based on the strip area
then replace the point of the lowest likelihood by new one drawn from within the likelihood
Increment the evidence by filling in the missing band with weight w using the chosen point
Notice that this greatly simplified procedure invovles only exponentials and weighted sum.
#codingexercise
Yesterday we discussed
int F(List<int>V, int i, int j)
{
if (i > j || i >= V.Count || j >= V.Count) return 0;
if (i==j) return V[i];
if (i ==j + 1) return max(V[i], V[j]);
return Max(V[i] + min (F(V, i+2,j), F(V, i+1, j-1)), V[j] + min (F(V, i+1, j-1), F(V, i, j-2)));
}
Today we write down a memoized way
static int FMemoized(List<int> V, int n)
{
var table = new int[n,n];
for (int k = 0; k < n; k++)
{
for (int i = 0, j=k; j <n; i++,j++) // the simultaneous increment is due to the fact that we use each axes for each component of the max function and update the diagonal as the matrix grows in size.
// also we initialize one of the axis differently because we can skip what we have already computed.
This causes the lower left of the matrix to remain stationary while the upper right to update
{
int x = ((i+2) <= j) ? table[i+2,j] : 0;
int y = ((i+1) <= (j - 1)) ? table[i+1, j-1] : 0;
int z = (i <= (j-2)) ? table[i,j-2] : 0;
table[i,j] = Math.Max(V[i] + Math.Min(x,y), V[j] + Math.Min(y,z));
}
}
return table[0,n-1];
}
No comments:
Post a Comment