Saturday, September 13, 2014

In the previous post, we discussed the time series algorithm. We saw how the ART model is constructed and how we apply it to predict the next one step transition. Note that the ART tree is created on the values of the random variable. To fit the linear regression for a restricted data set, we determine the values of the random variable from the length p transformations of the time series data set.
For a given time series data set, a corresponding nine data sets for length p transformations is created. The p varies from zero to eight for the nine data sets. Each of these transformed datasets is centered and standardized before modeling; that is for each variable we subtract the mean value and divide by the standard deviation. Then we divided the data set into a training set used as input to the learning method and a holdout set to evaluate the model. The holdout set contains the cases corresponding to the last five observations in the sequence.

Central to the step of fitting the linear regression, is the notion of covariance stationarity.
By that we mean
the mean is not dependent on t
the standard deviation is not dependent on t
the covariance (Yt, Yt-j) exists and is finite and does not depend on t
This last factor is called jth order  autocovariance
The jth order auto-correlation is described as autocovariance divided by the square of standard deviation

The autocovariance measures the direction of the linear dependence between Yt and Yt-j.
while the autocorrelation measures both the direction and the strength of the linear dependence between the Yt and Yt-j.
An autoregressive process is defined as one in which the time dependence in the process decays to zero as the random variables in the process get farther and farther apart. It has the following properties:
E(Yt) = mean
Var(Yt) = sigma squared
Cov(Yt, Yt-1) = sigma squared . phi
Cor(Yt, Yt-1) = phi


Courtesy: Time Series Concepts UW.
There are other curve fitting techniques notably the non-linear curve fitting that could be helpful in predicting the one-step transition. Such techniques could involve finding the R-value and/or the Chi-square value. Perhaps a numerical approach to predicting the time series could be to use a Fast Fourier Transform.

By the way, somebody surprised me with a question to print Fibonacci series. I hadn't heard it in a long while and yes there is a recursive solution but this works too:

int Fib(int index)
{
 if (index < 1) throw new Exception();
 int lastToLast = 1;
 int last = 1;
 if (index == 1) return lastToLast;
 if (index == 2 ) return last;
 int sum = 0;
 for (int i = 2; i <= index; i++)
{
  sum = last + lastToLast;
   lastToLast = last;
   last = sum;
}
return sum;
}

int Fib(int index)
{
 Assert (index > 0);
 if (index == 1) return 1;
 if (index == 2) return 1;
 return Fib(index-1) + Fib(index-2);
}
we will next discuss openstack and try out a sample with Rackspace

Wednesday, September 10, 2014

we will be resuming the discussion on Microsoft time series algorithm. We discussed the decision tree algorithm in the previous post but we will cover how its applied here. The decision tree algorithm could predict discrete data. For continuous data, the decision tree algorithm builds a tree where each node contains a regression formula. At the point where a regression formula shows non-linearity, the decision tree algorithm adds a split. For example, if the scatter plot of the data could be fitted with trends represented by two lines joining at a vertex, the corresponding decision tree generated would have the equations for the two lines as the leaf nodes from a split of all the data.

The time series algorithm on the other hand uses the random variables of a Markov chain to predict the immediate next given the history. It's called an auto regressive model because the probability for a value at a time has mean autoregressively dependent on the last p values in the series.
An autoregressive tree model has AR model at every leaf. This model represents the time periods in a series.
 Today we will continue our study of the decision tree and time series algorithm.
When we fit a regressor line to the scatter plot of the data, we can use it to describe the trend for the upcoming events. However this is often too general. Instead if we split the data into regions and fit the regressor lines, we get better approximation. This is what the splitting does.  To learn the decision tree for the time series, we apply a split-leaf operator. This operator takes a predictor variable and a value.  The resulting left and right nodes are for the two resulting regions.
To apply the split for a leaf, we use the seven values of each predictor variable that come from the boundaries of eight equiprobable contiguous regions of a normal distribution estimated from the restricted data set at the leaf for the predictor variable.  We choose the split that results in the largest increase in model score. We stop splitting the tree when no split yields a higher score.
If we do not know the set of potential split variables, we iteratively apply the above method to get an ART for each and choose the one with the highest Bayesian score.
The Decision tree building is thus a recursive approach.


We will now see how to use this to predict the next step data. Note that the leaves of the decision tree for a piece wise trend predictor. To forecast, we can do a one step forecasting or a multi step forecasting. The former is more accurate than the latter. We are only interested in the former. To predict the distribution of the random variable for the next step, we only need one of the leaf nodes. Each leaf for the tree has a conditional time distribution for this variable because the path from the root to the leaf we are interested in has a conjunction of all the formulas. Remember that our generic auto regressive model of length p is a linear regression modeled using a normal distribution with model parameters as the mean, variance and the coefficients for each of the p observations. The autoregressive model tree is merely a piecewise linear AR model. This structure is chosen based on the highest posterior conditional probability p(s|d) for the structure s given the data d. And this probability can be expanded as the combination of that for the structure and the marginal likelihood where the likelihood of the data is averaged over the uncertainty in the model parameters. To predict the data, we use this normal distribution at the maximum a posteriori value.

The likelihood could be measured with t-distribution but we use the normal distribution and for the leaf we are interested in, we find the transition probability to the next one-step and we find this given its a stationary random Markov chain.


By that I mean we use the equation for the leaf that gives us the one step transition probability from the distribution we have. To find the equation, we find the the coefficients, mean and variance from the last p observations using a normal distribution estimated from the dataset. We transform the dataset into the length p transformation of the time series data set where each of the transformation x = (x1, x2, ..., xp+1) and  x is calculated for each i in 1 to T-p. The elements within the bracket denoted by xj is the original event (i+j-1). For example, if we have events y = (1, 3, 2, 4) then the length-2 transformation is x1 = (1,3), x2 = (3,2)  and x3 = (2,4) and the length-3 transformation is (1,3,2) and (3,2,4). This is called windowing and with the transformations that we get with this approach, we can now apply the decision tree technique mentioned earlier. We create a single leaf with the normal distribution from this transformation dataset which fits a line to the transformations with a target as the p+1. Then  we decide to split it at one of the seven values corresponding to the eight equiprobable continuous regions (each of standard deviation length) of the normal distribution for the predictor variable. We continue to split until it does not improve the model score. The model score is a Bayesian score with the first term as  the probability of the structure and the second term is the marginal likelihood. The second term balances the fit of the model to the complexity of the data. The second term can be approximated as a fit penalized by complexity to ease the calculation of the score for the model.

I noticed today that the PageRank algorithm is also a stationary Markov chain.




Tuesday, September 9, 2014

A quick review of the Java programming language keywords:
final : final class cannot be derived
          final method cannot be overridden
          final variable can be initialized once, reference cannot be changed but value can be updated.
default:
      lets you add new methods to existing interfaces
native:
      to invoke platform dependent code (generally in other languages)
strictfp:
      is used to make the floating point operations predictable across platforms
super:
     if the method overrides a superclass' method, the overriden method can be invoked by super
synchronized :
     a method marked synchronized acquires a monitor for mutual exclusion before execution.
transient:
     variables marked as transient are not part of the persistent state of the object.
volatile :
     is used for threadsafe updates to shared variables


public void removeAll(Entry<E> head, int val) throws RuntimeException()
{

 if (head == null) throw RuntimeException();

 Entry<E> prev = null;
 Entry<E> current = head;

while (current)
{
    Entry<E> next = current.next;
    if (current.val == val)
    {
           if (prev != null)
               prev.next = next;
          else
                head.next = next;
          current = next;
     }
     else
     {
               prev = current;
               current = next;
      }
}
}
In this post, we briefly review the paper Autoregressive Tree Models for Time-Series Analysis by Meek, Chickering and Herman. This paper identifies models for continuous time series data which they call ART model. These models are built this way : First, the time series data is transformed into a set of cases suitable for a regression analysis. The predictor and target variables are the past and the current value. This transformed data set is used to learn a decision tree for the target variable. The decision tree has linear regressions at its leaves and thus producing a piecewise-linear auto regression model.  A Bayesian technique is used to learn the structure and parameter of the decision tree.
The Time Series data can be viewed as a p-order Markov chain where p is the number of previous observations taken into consideration. From our previous posts on Markov chain, we know that the current event is independent of the previous observations. However, the dependence of the current event on the preceding regressor variables does not change with time. The models are therefore linear predictors and are simple and windowing based so they are easy to interpret. The structure and parameters of the decision tree is learned using the Bayes rule which places probability distributions on the structure and the parameter and choosing the structure s that has the highest probability given the data d and make predictions based on that structure. The models described in this paper yield accurate forecasts.

Monday, September 8, 2014

In this post we quickly enumerate a few JQuery plugins:
Alertify : raises alerts and messages 
iCheck : for form controls and beautiful flat-style skins
LongPress: is a JQuery plugin that eases the writing of accented or rare characters.
File Upload : is  a widget that lets multiple file upload.
Complexify : is a widget that checks password strength
Pickadate  : is a widget for choosing date.
Chosen : is for converting a select input list into a dropdown
FancyInput: is a JQuery plugin that makes entering or deleting text easy
Typeahead : is a autocomplete library by Twitter.
Parsley : is an unobtrusive form validation library
Windows : lets you take up the whole screen for your page
Sticky : lets your elements stay on the page while scrolling
Scrollorama: is for cool scroll animations.
Bacon : lets you wrap text around a curved line.
Also, a quick recap of jquery utilities:
.contains () checks to see if a DOM element is contained in another DOM element.
.data () arbitrary data associated with the specified element and/or return the value that was set
.each () a generic iterator
.extend( ) merge the contents of two or more objects, used often
fn.extend() merge the contents of an object into a jQuery prototype to return new instance methods.
globalEval() execute some JavaScript globally
grep() finds the elements of an array that satisfy a filter condition
isEmptyObject() checks to see if an object is empty()
isFunction() determines if a function passed is a Javascript function object.
isNumeric() determines if whether the argument is a number
isPlainObject() checks if an object is plain.
isXmlDoc() checks if its an xmldoc
        public static int IndexOf(int[] sorted, int start, int end, int val)
        {
            if (end < start ) return -1;
            if (sorted[start] == val) return start;
            if (sorted[end] == val) return end;
            int mid = (start + end) / 2;
            if (sorted[mid] == val) return mid;
            if (sorted[mid] < val) start = mid;
            else end = mid;
            return IndexOf(sorted, start, end, val);
        }
We implement a decision tree as a recursive function:
tree CreateDecisionTree(data, attributes, target_attr, fitness_func)
   best_attribute = feature_select(attributes)
   tree = {best_attribute};
   for val in get_values(data, best)
         subtree = CreateDecisionTree(
                                   get_examples(data, best, val),
                                   attributes other than best,
                                   target_attr,
                                   fitness, func)
         tree[best][val] = subtree;
    return tree

Sunday, September 7, 2014

Let us quickly visit the data-mining algorithms first we mentioned in the previous post :
The Microsoft Decision Tree algorithm can be used to predict both discrete and continuous attributes. For discrete attributes, the algorithm makes predictions based on the input column or columns in a dataset. For example, if eight out of ten customers who bought chips were kids, the algorithm decides that age is a factor for the split. Here the predictable column would be a plot of the chip buyers against their age and the mining model would make a decision split for high or low age. Keeping this predictable column and the pattern and statistics such as count of the data helps with subsequent query.  For continuous variables, the algorithm uses linear regression to determine where a decision tree splits.
The Microsoft Naive Bayes Theorem algorithm uses the Bayesian techniques assuming the factors involved are independent. The algorithm calculates the probability of every state of each input column, given each possible state of the predictable column. For example, the age of the chips buyers is  broken down into age group. Then for each possible outcome of high or low age groups, it calculates the probability distribution of those age groups.  The patterns of the data and the probability, score and support are used for subsequent queries.
The Microsoft  Clustering Algorithm identifies the relationships in a dataset and then generates a cluster. The model specified by the user identifies the relationship and so there is no predictable column required. Taking the example of chips buyers, we can see that the kids form a separate cluster than the others. Further splitting the clusters into age groups, yields smaller clusters.
The Microsoft Sequence Clustering algorithm is similar to clustering algorithm mentioned above but instead of finding groups based on similar attributes, it finds groups based on similar paths in a sequence. The sequence is a series of events or transitions between states in a dataset as in a Markov chain. Think of the sequences as IDs of any sortable data maintained in a separate table. The sequence for each data is analyzed to form groups.
The Microsoft Neural network algorithm combines each possible state of the input attribute with each possible state of the predictable attribute. The input attribute values and their probabilities are used to assign weights which then affect the outcome or predictable value. Generally this needs a large training data.
The Microsoft Association algorithm is an association algorithm that provides recommendations such as a market basket analysis. For example, if customers bought chips, then they also bought dips. The support parameter here is the number of cases that contain both chips and dips. The association rules are output to the algorithm
The Microsoft Time Series algorithm uses the historical information on the data to make predictions for the future. This can also be used for cross prediction where if we train the algorithm with two separate but related series, the resulting model can be used to predict one series based on another. To predict the time series, the method involves using a windowing transformation of the dataset into a series suitable for regression analysis where the past and the present are used in predictor and target variables respectively.
Let us look at the implementation for the time series algorithm next.
We will specifically look at ARTXP and ARIMA algorithms mentioned here
These are for short term and long term predictions respectively and they use decision trees. In mixed mode, both algorithms are used and there is a parameter to bias it to one of the algorithms with a 0 and to the other with a 1 and intermediary in between.