Wednesday, September 10, 2014

we will be resuming the discussion on Microsoft time series algorithm. We discussed the decision tree algorithm in the previous post but we will cover how its applied here. The decision tree algorithm could predict discrete data. For continuous data, the decision tree algorithm builds a tree where each node contains a regression formula. At the point where a regression formula shows non-linearity, the decision tree algorithm adds a split. For example, if the scatter plot of the data could be fitted with trends represented by two lines joining at a vertex, the corresponding decision tree generated would have the equations for the two lines as the leaf nodes from a split of all the data.

The time series algorithm on the other hand uses the random variables of a Markov chain to predict the immediate next given the history. It's called an auto regressive model because the probability for a value at a time has mean autoregressively dependent on the last p values in the series.
An autoregressive tree model has AR model at every leaf. This model represents the time periods in a series.
 Today we will continue our study of the decision tree and time series algorithm.
When we fit a regressor line to the scatter plot of the data, we can use it to describe the trend for the upcoming events. However this is often too general. Instead if we split the data into regions and fit the regressor lines, we get better approximation. This is what the splitting does.  To learn the decision tree for the time series, we apply a split-leaf operator. This operator takes a predictor variable and a value.  The resulting left and right nodes are for the two resulting regions.
To apply the split for a leaf, we use the seven values of each predictor variable that come from the boundaries of eight equiprobable contiguous regions of a normal distribution estimated from the restricted data set at the leaf for the predictor variable.  We choose the split that results in the largest increase in model score. We stop splitting the tree when no split yields a higher score.
If we do not know the set of potential split variables, we iteratively apply the above method to get an ART for each and choose the one with the highest Bayesian score.
The Decision tree building is thus a recursive approach.


We will now see how to use this to predict the next step data. Note that the leaves of the decision tree for a piece wise trend predictor. To forecast, we can do a one step forecasting or a multi step forecasting. The former is more accurate than the latter. We are only interested in the former. To predict the distribution of the random variable for the next step, we only need one of the leaf nodes. Each leaf for the tree has a conditional time distribution for this variable because the path from the root to the leaf we are interested in has a conjunction of all the formulas. Remember that our generic auto regressive model of length p is a linear regression modeled using a normal distribution with model parameters as the mean, variance and the coefficients for each of the p observations. The autoregressive model tree is merely a piecewise linear AR model. This structure is chosen based on the highest posterior conditional probability p(s|d) for the structure s given the data d. And this probability can be expanded as the combination of that for the structure and the marginal likelihood where the likelihood of the data is averaged over the uncertainty in the model parameters. To predict the data, we use this normal distribution at the maximum a posteriori value.

The likelihood could be measured with t-distribution but we use the normal distribution and for the leaf we are interested in, we find the transition probability to the next one-step and we find this given its a stationary random Markov chain.


By that I mean we use the equation for the leaf that gives us the one step transition probability from the distribution we have. To find the equation, we find the the coefficients, mean and variance from the last p observations using a normal distribution estimated from the dataset. We transform the dataset into the length p transformation of the time series data set where each of the transformation x = (x1, x2, ..., xp+1) and  x is calculated for each i in 1 to T-p. The elements within the bracket denoted by xj is the original event (i+j-1). For example, if we have events y = (1, 3, 2, 4) then the length-2 transformation is x1 = (1,3), x2 = (3,2)  and x3 = (2,4) and the length-3 transformation is (1,3,2) and (3,2,4). This is called windowing and with the transformations that we get with this approach, we can now apply the decision tree technique mentioned earlier. We create a single leaf with the normal distribution from this transformation dataset which fits a line to the transformations with a target as the p+1. Then  we decide to split it at one of the seven values corresponding to the eight equiprobable continuous regions (each of standard deviation length) of the normal distribution for the predictor variable. We continue to split until it does not improve the model score. The model score is a Bayesian score with the first term as  the probability of the structure and the second term is the marginal likelihood. The second term balances the fit of the model to the complexity of the data. The second term can be approximated as a fit penalized by complexity to ease the calculation of the score for the model.

I noticed today that the PageRank algorithm is also a stationary Markov chain.




No comments:

Post a Comment