Cluster computing

Tuesday, September 9, 2014

A quick review of the Java programming language keywords:
final : final class cannot be derived
final method cannot be overridden
final variable can be initialized once, reference cannot be changed but value can be updated.
default:
lets you add new methods to existing interfaces
native:
to invoke platform dependent code (generally in other languages)
strictfp:
is used to make the floating point operations predictable across platforms
super:
if the method overrides a superclass' method, the overriden method can be invoked by super
synchronized :
a method marked synchronized acquires a monitor for mutual exclusion before execution.
transient:
variables marked as transient are not part of the persistent state of the object.
volatile :
is used for threadsafe updates to shared variables

public void removeAll(Entry<E> head, int val) throws RuntimeException()
{

if (head == null) throw RuntimeException();

Entry<E> prev = null;
Entry<E> current = head;

while (current)
{
Entry<E> next = current.next;
if (current.val == val)
{
if (prev != null)
prev.next = next;
else
head.next = next;
current = next;
}
else
{
prev = current;
current = next;
}
}
}

In this post, we briefly review the paper Autoregressive Tree Models for Time-Series Analysis by Meek, Chickering and Herman. This paper identifies models for continuous time series data which they call ART model. These models are built this way : First, the time series data is transformed into a set of cases suitable for a regression analysis. The predictor and target variables are the past and the current value. This transformed data set is used to learn a decision tree for the target variable. The decision tree has linear regressions at its leaves and thus producing a piecewise-linear auto regression model. A Bayesian technique is used to learn the structure and parameter of the decision tree.
The Time Series data can be viewed as a p-order Markov chain where p is the number of previous observations taken into consideration. From our previous posts on Markov chain, we know that the current event is independent of the previous observations. However, the dependence of the current event on the preceding regressor variables does not change with time. The models are therefore linear predictors and are simple and windowing based so they are easy to interpret. The structure and parameters of the decision tree is learned using the Bayes rule which places probability distributions on the structure and the parameter and choosing the structure s that has the highest probability given the data d and make predictions based on that structure. The models described in this paper yield accurate forecasts.

Monday, September 8, 2014

In this post we quickly enumerate a few JQuery plugins:

Alertify : raises alerts and messages

iCheck : for form controls and beautiful flat-style skins

LongPress: is a JQuery plugin that eases the writing of accented or rare characters.

File Upload : is a widget that lets multiple file upload.
Complexify : is a widget that checks password strength
Pickadate : is a widget for choosing date.
Chosen : is for converting a select input list into a dropdown
FancyInput: is a JQuery plugin that makes entering or deleting text easy
Typeahead : is a autocomplete library by Twitter.
Parsley : is an unobtrusive form validation library
Windows : lets you take up the whole screen for your page
Sticky : lets your elements stay on the page while scrolling
Scrollorama: is for cool scroll animations.
Bacon : lets you wrap text around a curved line.
Also, a quick recap of jquery utilities:
.contains () checks to see if a DOM element is contained in another DOM element.
.data () arbitrary data associated with the specified element and/or return the value that was set
.each () a generic iterator
.extend( ) merge the contents of two or more objects, used often
fn.extend() merge the contents of an object into a jQuery prototype to return new instance methods.
globalEval() execute some JavaScript globally
grep() finds the elements of an array that satisfy a filter condition
isEmptyObject() checks to see if an object is empty()
isFunction() determines if a function passed is a Javascript function object.
isNumeric() determines if whether the argument is a number
isPlainObject() checks if an object is plain.
isXmlDoc() checks if its an xmldoc
public static int IndexOf(int[] sorted, int start, int end, int val)
{
if (end < start ) return -1;
if (sorted[start] == val) return start;
if (sorted[end] == val) return end;
int mid = (start + end) / 2;
if (sorted[mid] == val) return mid;
if (sorted[mid] < val) start = mid;
else end = mid;
return IndexOf(sorted, start, end, val);
}

We implement a decision tree as a recursive function:
tree CreateDecisionTree(data, attributes, target_attr, fitness_func)
best_attribute = feature_select(attributes)
tree = {best_attribute};
for val in get_values(data, best)
subtree = CreateDecisionTree(
get_examples(data, best, val),
attributes other than best,
target_attr,
fitness, func)
tree[best][val] = subtree;
return tree

Sunday, September 7, 2014

Let us quickly visit the data-mining algorithms first we mentioned in the previous post :
The Microsoft Decision Tree algorithm can be used to predict both discrete and continuous attributes. For discrete attributes, the algorithm makes predictions based on the input column or columns in a dataset. For example, if eight out of ten customers who bought chips were kids, the algorithm decides that age is a factor for the split. Here the predictable column would be a plot of the chip buyers against their age and the mining model would make a decision split for high or low age. Keeping this predictable column and the pattern and statistics such as count of the data helps with subsequent query. For continuous variables, the algorithm uses linear regression to determine where a decision tree splits.
The Microsoft Naive Bayes Theorem algorithm uses the Bayesian techniques assuming the factors involved are independent. The algorithm calculates the probability of every state of each input column, given each possible state of the predictable column. For example, the age of the chips buyers is broken down into age group. Then for each possible outcome of high or low age groups, it calculates the probability distribution of those age groups. The patterns of the data and the probability, score and support are used for subsequent queries.
The Microsoft Clustering Algorithm identifies the relationships in a dataset and then generates a cluster. The model specified by the user identifies the relationship and so there is no predictable column required. Taking the example of chips buyers, we can see that the kids form a separate cluster than the others. Further splitting the clusters into age groups, yields smaller clusters.
The Microsoft Sequence Clustering algorithm is similar to clustering algorithm mentioned above but instead of finding groups based on similar attributes, it finds groups based on similar paths in a sequence. The sequence is a series of events or transitions between states in a dataset as in a Markov chain. Think of the sequences as IDs of any sortable data maintained in a separate table. The sequence for each data is analyzed to form groups.
The Microsoft Neural network algorithm combines each possible state of the input attribute with each possible state of the predictable attribute. The input attribute values and their probabilities are used to assign weights which then affect the outcome or predictable value. Generally this needs a large training data.
The Microsoft Association algorithm is an association algorithm that provides recommendations such as a market basket analysis. For example, if customers bought chips, then they also bought dips. The support parameter here is the number of cases that contain both chips and dips. The association rules are output to the algorithm
The Microsoft Time Series algorithm uses the historical information on the data to make predictions for the future. This can also be used for cross prediction where if we train the algorithm with two separate but related series, the resulting model can be used to predict one series based on another. To predict the time series, the method involves using a windowing transformation of the dataset into a series suitable for regression analysis where the past and the present are used in predictor and target variables respectively.
Let us look at the implementation for the time series algorithm next.
We will specifically look at ARTXP and ARIMA algorithms mentioned here
These are for short term and long term predictions respectively and they use decision trees. In mixed mode, both algorithms are used and there is a parameter to bias it to one of the algorithms with a 0 and to the other with a 1 and intermediary in between.

Saturday, September 6, 2014

In the post previous to the last, I mentioned applying centrality to keyword extraction. In that we said we could find the edges in the graph based on mutual information. Mutual information is based on co-occurrence. That helps when the words are expressed in the same form. But writers seldom repeat their words and express their import using synonyms and belabored language. In such cases, the function of the keywords is also as important as the form. Therefore, we could use alternative metrics for using pair wise relationships. One such metric could be based on ontology. We look up two words to see whether they are similar. With WordNet for example, we get a distance metric. Such a metric adds more relevance to the edges of the graph. Synonyms and antonyms could have a constant measure different from the default for unknown words. By adding a metric for import in addition to co-occurrence, we improve the ranking we get from centrality.

SemanticSimilarity semsim=new SemanticSimilarity() ;
    float score=semsim.GetScore(word1, word2);
                               
Here is a link for the same : http://www.codeproject.com/Articles/11835/WordNet-based-semantic-similarity-measurement

I want to mention a caveat that this is not a substitute for clustering or mining algorithms. For more on mining algorithms, we can refer here:  http://msdn.microsoft.com/en-us/library/ms175595.aspx

and they include the following algorithms :

Classification algorithms

Regression algorithms

Segmentation algorithms

Association algorithms

Sequence Analysis Algorithms

Friday, September 5, 2014

// find anagrams
using System;
using System.IO;

class MyClass {
public void check_anagrams(string[] firstWords, string[] secondWords) {
if (firstWords == null ||
secondWords == null ||
firstWords.Length != secondWords.Length ||
firstWords.Length == 0) return;
for (int i = 0; i < firstWords.Length; i++)
{
char[] left = firstWords[i].ToCharArray();
char[] right = secondWords[i].ToCharArray();
Array.Sort<char>(left);
Array.Sort<char>(right);
if (new string(left) == new string(right))
Console.WriteLine ("1");
else
Console.WriteLine("0");
}
}
}

// find matching braces pairs
using System;
using System.Collections;

class MyClass {
public void check_braces(string[] expressions) {
foreach (var expression in expressions)
{
Stack s = new Stack();
char[] tokens = expression.ToCharArray();
int output = 1;
foreach (var token in tokens)
{
switch (token)
{
case '(':
s.Push(token);
break;
case '{':
s.Push(token);
break;
case '[':
s.Push(token);
break;
case ')':
if (s.Count > 0){
char c = (char)s.Pop();
if (c != '(') output = 0;
continue;}
else output = 0;
case '}':
if (s.Count > 0){
char c = (char)s.Pop();
if (c != '{') output = 0;
continue;}
else output = 0;
case ']':
if (s.Count > 0){
char c = (char)s.Pop();
if (c != '[') output = 0;
continue;}
else output = 0;
}
}
Console.WriteLine("{0}", output);
}
}
}

find the minimum integer X in a given unordered integer array such that each element needs to be adjusted by at most X to make the array sorted.

        static void Main(string[] args)
        {
            int[] v = new int[] { 5, 4, 1, 2, 3, 8 };
            hill(v);
        }
        static public void hill(int[] v)
        {
            if (v == null || v.Length == 0) return;
            int max = 0;
            int min = 0;
            int diff = max - min;
            for (int i = 1; i < v.Length; i++)
            {
                if (v[i] < v[i - 1])
                {
                    if (v[i - 1] > max) { max = v[i - 1]; min = v[i]; }
                    if (v[i] < min) { min = v[i]; }
                }
                else
                {
                    if (max - min > diff)
                        diff = max - min;
                }
            }
            Console.WriteLine("{0}", diff);
        }