Cluster computing: August 2014

Sunday, August 31, 2014

In Kullback-Leibler divergence that we mentioned in earlier posts, we saw that the divergence was measured word by word as the average probability of that word against the distribution. We calculate nw as the total number of terms where w appears and pw as the total number of terms where w appears divided by the total number of terms in the document and we measured the P(tk/q) as [nw / Sum-for-all-terms-x-in-q (nw)] . When the divergence was greater than a threshold, we selected it as a keyword. It is not necessary to measure the divergence of the term one by one against the background distribution in a document because the metric hold for any two distributions such as P(x) and Q(x) and their divergence is measured as P(x)-Q(x)logP(x)/Q(x). The term distribution of sample document is the compared with the distribution of categories that number C. The probability distribution of a term tk in a document dj is the ration of the term frequencies in that document compared to the overall term frequency across all documents in the case that the terms appear in the document otherwise zero. The term probability distribution across categories is normalized to 1.
The equation we refer to for the divergence comes in many forms.
Most equations use a default value for when a term doesn't appear in either P or Q. This is because the zero values skews the equation. The probability epsilon corresponds to an unknown word.

Today I will resume some discussion on Keyword extraction.
We discussed co-occurrence of terms as an indicator of the keywords. This has traditionally meant clustering keywords based on similarity. Similarity is often measured based on Jensen-Shannon divergence or Kullback-Leibler divergence. However similarity doesn't give an indication of relevance. Pair-wise co-occurrence or mutual information gives some indication of relevance.
Sometimes we need to use both or prefer one over the other based on chi-square goodness of fit.
In our case, co-occurrence of a term and a cluster means co-occurrence of the term and any term in the cluster although we could use nearest, farthest terms or the average from the cluster.
What we did was we populated co-occurrence matrix from the top occurring terms and their counts. For each of the terms, we count the co-occurrences with the frequent terms that we have selected. These frequent terms are based on a threshold we select.
When we classify, we take two terms and find the clusters they belong to. Words don't belong to any cluster initially. They are put in the same cluster based on the mutual information which is calculated as the ratio of the probability of co-occurring terms to the individual probabilities of the terms. We translate this to the counts and calculate each probability in terms of counts from the co-occurrence matrix.
We measure the cluster quality by calculating the chi-square. This we do by summing over all the components of the chi-square as measured for each word in the frequent terms. Each component is the square of the difference between the observed co-occurrence frequency and the expected frequency and divided by the expected frequency of co-occurrence. The expected frequency is calculated in turn as the combination of the expected probability pg of that word g from the frequent terms and the co-occurrence of the term w with frequent terms denoted by nw.
If the terms have a large chi-square value, then they are relatively more important. If the terms have a low chi-square value then they are relatively trivial. Chi-square gives a notion of the deviation from the mean indicating the contribution each cluster makes and hence its likelihood to bring out the salient keywords. For a look at the implementation, here it is. We attempted Kullback-Leibler divergence as a method for keyword extraction as well. Here we used one of the formula for the divergence.

Saturday, August 30, 2014

Trying out Node.js, webmatrix and MongoDB.

var MongoClient = require('mongodb').MongoClient
, format = require('util').format;

MongoClient.connect('mongodb://127.0.0.1:27017/test', function(err, db) {
if(err) throw err;

var collection = db.collection('test_events');
collection.insert( [
{ Timestamp: (new Date()).toString(), host: "local", source: "source1", sourcetype: "sourcetype1" },
{ Timestamp: (new Date()).toString(), host: "local", source: "source2", sourcetype: "sourcetype2" },
],{ ordered: true }, function(err, docs) {
collection.count(function(err, count) {
console.log(format("err = %s", err));
console.log(format("count = %s", count));
db.close();
});
});
});

C:\Users\Admin>node connect.js
err = null

count = 2

> db.test_events.find()
{ "_id" : ObjectId("54025fbb9c212420085d82fb"), "Timestamp" : "Sat Aug 30 2014 1
6:35:23 GMT-0700 (Pacific Daylight Time)", "host" : "local", "source" : "source1
", "sourcetype" : "sourcetype1" }
{ "_id" : ObjectId("54025fbb9c212420085d82fc"), "Timestamp" : "Sat Aug 30 2014 1
6:35:23 GMT-0700 (Pacific Daylight Time)", "host" : "local", "source" : "source2
", "sourcetype" : "sourcetype2" }

I came across an interesting topic about how to store key values in relational tables and whether we should move to NoSQL just for the sake of storing key values. The trouble with storing key values in relational tables is that the same key has multiple values. If we keep each record for a key value pair soon we have a flooding of the table but more importantly this just seems like a collection and not an entity which is fundamental to the database model. We could alleviate the storage concerns and call them different entities by say calling one table as keys and another table as values and adding a relation between them.
That said, probably an easy to implement way is to store it as an XML in a single column. This alleviates the problem of having two columns where one is the fieldvalue and the other is the fieldvaluetype. Moreover, the column can be indexed and queried with XPath. This is why it is preferred over JSON.
Another approach is to use the EntityAttributeValue model also called the EAV model. Here the attributes are available as columns and they can be numerous with only a few columns holding values at some time. This is also called sparse matrix.
The thing to note here is that the tendency to add custom properties is not restricted to a single table and can become an epidemic in the system. That is why the data model may need to be redesigned or at least extended if such things crop up. The normalization is important just as much as the convenience of using key-values.
Key Value are generally stored using a hashing function because they are essentially collections. The hashing allows to bucket the keys and collisions are resolved by overflow lookups.
The NoSQL stores such as MongoDB serve more purposes as well. They are better suited for the following use cases:
column stores
key value stores
document stores
graph stores
If we look at the

use EventsDB

go

CREATE TABLE dbo.Event

( ID int identity not null,

Timestamp datetime not null

CONSTRAINT IX_Event_Timestamp

PRIMARY KEY CLUSTERED (Timestamp, ID)

WITH (IGNORE_DUP_KEY = OFF),

Host nvarchar(4000) not null,

Source nvarchar(100) not null,

SourceType nvarchar(100) not null,

FieldMap xml null,

);

INSERT INTO dbo.Event VALUES (GETDATE(), HOST_NAME(), N'Source1', N'SourceType1', NULL);

INSERT INTO dbo.Event VALUES (DATEADD(DD, 1, GETDATE()), HOST_NAME(), N'Source2', N'SourceType2', NULL);

UPDATE dbo.Event set FieldMap = '<xml><Source>Source1</Source><SourceType>SourceType1</SourceType></xml>'

WHERE SOURCE=N'Source1'

UPDATE dbo.Event set FieldMap = '<xml><Source>Source2</Source><SourceType>SourceType2</SourceType></xml>'

WHERE SOURCE=N'Source2'

SELECT * FROM dbo.Event

go
ID Timestamp Host Source SourceType FieldMap
1 2014-08-31 09:22:28.193 ADMIN-PC Source1 SourceType1 <xml><Source>Source1</Source><SourceType>SourceType1</SourceType></xml>
2 2014-09-01 09:22:28.193 ADMIN-PC Source2 SourceType2 <xml><Source>Source2</Source><SourceType>SourceType2</SourceType></xml>

SELECT ID, FieldMap.query('data(/xml/Source)') as Value

FROM dbo.Event

WHERE FieldMap.exist('/xml/SourceType') = 1

Friday, August 29, 2014

Today we will be quickly reviewing SEO - search engine optimization from the SEO guide.
Search engines have two major functions - 1. crawling and building an index 2. providing answers by calculating relevancy and serving results.

The web is connected by links between pages. Crawlers index these pages for fast lookups. Search engines have to provide answers in fractions of seconds because users lose focus after 2-3 seconds. Relevancy and importance from such voluminous data is determined in this time. SEO targets both these metrics. Relevance is a way of ranking what pertains to the user query. Importance is somewhat irrespective of the user query and focuses on the popularity of the site. There are several algorithms used to determine relevance and importance. Each Search engine may have their own.
Google recommends the following to get better ranking:

don't cloak your pages differently to search engines from what they appear to users
organize your site with each page reachable by a static link
pages are cohesive with respect to the content they provide
use redirects and rel="canonical" for duplicate content.

Bing suggests the following:

construct clean URLs with keywords
provide keyword-rich content and refresh content
don't hide the text to be indexed in resources

There are three types of queries search users perform. These are :
"Do" transactional queries such as to buy a plane ticket
"Know" informational queries such as to find the name of a restaurant
"Go" such as site specific queries such as LinkedIn

In spite of the efforts of the search engine, the users attention is not drawn to the results but to the words appearing in bold or titles or brief descriptions - things that are explicitly targeted in paid search listings.
To get your page noticed, you could:
wrap your images, plugins, video and audio content with text description of the content.
Search boxes can be supplemented with navigable and crawlable links and include a sitemap. Crawlable links means that the webpages are connected and the hrefs in the html point to say static links. If you want to hide content from the search engine, you could add rel="nofollow" for your hrefs or add meta robots tag or the robots.txt. When using keywords, use specific ones and don't stuff it on the content. keyword density does not help. Keywords in the titles help but don't make it longer than 65-75 characters. Always improve readability and emotional impact.

Search engines have millions of smaller databases that have keyword based indices. This makes it much faster for the engines to retrieve the data. Keywords thus play an important role in the search engine's ranking algorithm. Keywords appearing in title, text, image alt attributes and metadata promote a page's relevance. When keywords appear in the title, be mindful of the length, order and leverage branding.
Meta robot tags such as 'index/noindex' 'follow/nofollow' 'noarchive' 'nosnippet' 'noodp/noydir' restrict spider activity. They could be used judiciously. The meta description tag is what appears together with a search listing.
URLs could be made in a way such that they are shorter, use keywords, use hyphens to separate keywords and are static. Use canonicalization so that every unique piece of content has one and only one URL.

Thursday, August 28, 2014

We will be reviewing some simple random walk theorems. We will look at the amount of time that a SSRW is positive. If the random walk is at zero, it takes an even number of steps in either direction to return to zero.
We will look at the distribution when the random walk is positive. We can take an example with coin tosses. When we toss a coin, the probability that the number of heads equals k in 2n coin tosses is maximum when k = n . What we are saying is that we know for sure we can exceed the number of tails when the number of heads and the number of tails are equal. If we wanted to maximize the number for one outcome, we would have to choose the value 0 or 2n for the remainder. Since we mentioned earlier that the probability a random walk reaches a certain count can be derived from the probabilities from the outcomes of the states from the distribution, we can express the random walk probabilities in terms of the counts of the probabilities for the individual states.

Tuesday, August 26, 2014

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text;

public class Test
{
    public static void Main()
    {
        char[,] wordSearch = new char [3,3] { {'B', 'O', 'B'}, {'B', 'O', 'B'}, {'B', 'O', 'B'} };
        var wordList = new List<string>() { "BOB", "BOX", "COB" };
        printWordsInMaze(wordSearch, 3, 3, wordList);
    }

    public static void printWordsInMaze(char[,] wordSearch, int row, int col, List<string> wordList)
    {
        string[ ] directions = {"LR", "RL", "U", "D", "DUL", "DUR", "DDL","DDR"};
    for (int i = 0; i < col; i++)
    for (int j = 0; j < row; j++)
    {
        char letter = wordSearch[i,j];
        string pattern;
           List<string> candidates = wordList;

        int start = i;
        int end = j;
        foreach (var direction in directions)
        {
            int r = i;
            int c = j;
            pattern = String.Empty;
            switch (direction)
             {
                            case "LR":
                              while (c < col) { pattern += wordSearch[r,c]; c++;}
                              printMatchingWord(start, end, pattern, candidates, direction);
                            break;

                            case "RL":
                              while (c >= 0) { pattern+= wordSearch[r,c]; c--;}
                              printMatchingWord(start, end,pattern, candidates, direction);
                            break;

                            case "U":
                              while (r >= 0) { pattern += wordSearch[r,c]; r--;}
                              printMatchingWord(start, end,pattern, candidates, direction);
                            break;

                            case "D":
                              while (r < col) { pattern += wordSearch[r,c]; r++;}
                              printMatchingWord(start, end, pattern, candidates, direction);
                            break;

                            case "DUL" :
                              while (r >= 0 && c >= 0) { pattern += wordSearch[r,c]; r--; c--;}
                              printMatchingWord(start, end, pattern, candidates, direction);
                            break;

                            case "DUR" :
                              while (r >= 0 && c < col) { pattern += wordSearch[r,c]; r--; c++;}
                              printMatchingWord(start, end, pattern, candidates, direction);
                            break;

                            case "DDL" :
                              while (c >= 0 && r < col) { pattern += wordSearch[r,c]; c--; r++;}
                              printMatchingWord(start, end,pattern, candidates, direction);
                            break;

                            case "DDR" :
                              while (r < col && c < row) { pattern += wordSearch[r,c]; r++; c++;}
                              printMatchingWord(start, end, pattern, candidates, direction);
                            break;
               }
               //i = start;
               //j = end;
              }
         }
    }

private static void printMatchingWord(int i, int j, string pattern, List<string> candidates, string direction)
{
                                  foreach (var word in candidates)
                                  {
                                      Console.WriteLine("Pattern={0},word={1}", pattern, word);
                                       if (pattern.StartsWith(word))
                                        Console.WriteLine ("Word = {0} at row = {1} and col = {2} and direction = {3}", word, i,j, direction);
                                }
}

}

Success time: 0.06 memory: 33936 signal:0

Pattern=BOB,word=BOB
Word = BOB at row = 0 and col = 0 and direction = LR
Pattern=BOB,word=BOX
Pattern=BOB,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BBB,word=BOB
Pattern=BBB,word=BOX
Pattern=BBB,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BOB,word=BOB
Word = BOB at row = 0 and col = 0 and direction = DDR
Pattern=BOB,word=BOX
Pattern=BOB,word=COB
Pattern=OB,word=BOB
Pattern=OB,word=BOX
Pattern=OB,word=COB
Pattern=OB,word=BOB
Pattern=OB,word=BOX
Pattern=OB,word=COB
Pattern=O,word=BOB
Pattern=O,word=BOX
Pattern=O,word=COB
Pattern=OOO,word=BOB
Pattern=OOO,word=BOX
Pattern=OOO,word=COB
Pattern=O,word=BOB
Pattern=O,word=BOX
Pattern=O,word=COB
Pattern=O,word=BOB
Pattern=O,word=BOX
Pattern=O,word=COB
Pattern=OB,word=BOB
Pattern=OB,word=BOX
Pattern=OB,word=COB
Pattern=OB,word=BOB
Pattern=OB,word=BOX
Pattern=OB,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BOB,word=BOB
Word = BOB at row = 0 and col = 2 and direction = RL
Pattern=BOB,word=BOX
Pattern=BOB,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BBB,word=BOB
Pattern=BBB,word=BOX
Pattern=BBB,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BOB,word=BOB
Word = BOB at row = 0 and col = 2 and direction = DDL
Pattern=BOB,word=BOX
Pattern=BOB,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BOB,word=BOB
Word = BOB at row = 1 and col = 0 and direction = LR
Pattern=BOB,word=BOX
Pattern=BOB,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BB,word=BOB
Pattern=BB,word=BOX
Pattern=BB,word=COB
Pattern=BB,word=BOB
Pattern=BB,word=BOX
Pattern=BB,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BO,word=BOB
Pattern=BO,word=BOX
Pattern=BO,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BO,word=BOB
Pattern=BO,word=BOX
Pattern=BO,word=COB
Pattern=OB,word=BOB
Pattern=OB,word=BOX
Pattern=OB,word=COB
Pattern=OB,word=BOB
Pattern=OB,word=BOX
Pattern=OB,word=COB
Pattern=OO,word=BOB
Pattern=OO,word=BOX
Pattern=OO,word=COB
Pattern=OO,word=BOB
Pattern=OO,word=BOX
Pattern=OO,word=COB
Pattern=OB,word=BOB
Pattern=OB,word=BOX
Pattern=OB,word=COB
Pattern=OB,word=BOB
Pattern=OB,word=BOX
Pattern=OB,word=COB
Pattern=OB,word=BOB
Pattern=OB,word=BOX
Pattern=OB,word=COB
Pattern=OB,word=BOB
Pattern=OB,word=BOX
Pattern=OB,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BOB,word=BOB
Word = BOB at row = 1 and col = 2 and direction = RL
Pattern=BOB,word=BOX
Pattern=BOB,word=COB
Pattern=BB,word=BOB
Pattern=BB,word=BOX
Pattern=BB,word=COB
Pattern=BB,word=BOB
Pattern=BB,word=BOX
Pattern=BB,word=COB
Pattern=BO,word=BOB
Pattern=BO,word=BOX
Pattern=BO,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BO,word=BOB
Pattern=BO,word=BOX
Pattern=BO,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BOB,word=BOB
Word = BOB at row = 2 and col = 0 and direction = LR
Pattern=BOB,word=BOX
Pattern=BOB,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BBB,word=BOB
Pattern=BBB,word=BOX
Pattern=BBB,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BOB,word=BOB
Word = BOB at row = 2 and col = 0 and direction = DUR
Pattern=BOB,word=BOX
Pattern=BOB,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=OB,word=BOB
Pattern=OB,word=BOX
Pattern=OB,word=COB
Pattern=OB,word=BOB
Pattern=OB,word=BOX
Pattern=OB,word=COB
Pattern=OOO,word=BOB
Pattern=OOO,word=BOX
Pattern=OOO,word=COB
Pattern=O,word=BOB
Pattern=O,word=BOX
Pattern=O,word=COB
Pattern=OB,word=BOB
Pattern=OB,word=BOX
Pattern=OB,word=COB
Pattern=OB,word=BOB
Pattern=OB,word=BOX
Pattern=OB,word=COB
Pattern=O,word=BOB
Pattern=O,word=BOX
Pattern=O,word=COB
Pattern=O,word=BOB
Pattern=O,word=BOX
Pattern=O,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BOB,word=BOB
Word = BOB at row = 2 and col = 2 and direction = RL
Pattern=BOB,word=BOX
Pattern=BOB,word=COB
Pattern=BBB,word=BOB
Pattern=BBB,word=BOX
Pattern=BBB,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=BOB,word=BOB
Word = BOB at row = 2 and col = 2 and direction = DUL
Pattern=BOB,word=BOX
Pattern=BOB,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB
Pattern=B,word=BOB
Pattern=B,word=BOX
Pattern=B,word=COB

Today we will be reviewing the Atom web standards. There is the Atom publishing format and the Atom syndication format. The latter is the XML language used to describe the Atom feeds. The former is the HTTP based protocol for creating and updating the web resources.Web feeds follow a publisher subscriber model. A web feed provider is usually a site owner that publishes resources in say xml format. Subscribers then use feed readers to view their content. Browsers typically support feed readers. One usage is say when bloggers keep track of each others posts via Atom feeds.
A feed can contain both data and metadata. It has labeled content. It is usually timestamped to communicate the information on when it was created/updated. Each text element can be internationalized using xml:lang The Atom format came after the RSS format which was struggling with backward compatibility. While RSS is still used by Apple and iTunes, the Atom format has been adopted widely by Google.
To provide a link to an Atom feed, we use the following in HTML:
<link href="atom.xml" type="application/atom+xml" rel="alternate" title="feedtitle" />
The feed is specified as <feed xmlns="http://www.w3.org/2014/Atom"><metadata/><data/></feed>
.Net MVC webservers allow RSSResult ActionResult to be cached with an expiration time because it is just a file.
The iCalendar format is different from Atom in that it specifies a meeting request and tasks for sharing with a file and an extension named .ics. The content is specified in begin and end tags with the top level being vcalendar. The items are listed in key:value pairs and form a succint description of the meeting. It has syntax to specify the event/journal/free/busy time.
Flair is a javascript based application framework for Adobes AIR html+ajax SDK. Its an acronym for Framework Leveraged AIR. It can be used to specify a list of items say for example based on source ip of request as gathered from the request headers HTTP_X_FORWARDED_FOR and REMOTE_ADDR

Monday, August 25, 2014

In this post, we describe an application that does a time series inverted search on computer logs that are described as raw text, timestamp, host, source and source type. We call these events and we keep them in a table. The events are stored in a table in a SQL database with a clustered index on the time-stamp. The application is a .Net Web API application that connects to the table over entity framework or any object-relational mapping library. The Web API application exposes an API to post the events into the server and an API to do search over the events. When the events are posted into the table, they are sorted by their time. When the search is executed by the table, the time range is used to select the events and returned to the user as search results. Code is being written at http://github.com/ravibeta/csharpexamples/SplunkLite.Net

Sunday, August 24, 2014

In the previous post, we mentioned Markov chains and a closed communicating class. If we have a Markov chain with transition matrix P and fix n transitions from the overall N, then the Markov chain with transition matrix Pn has exactly the same closed communicating classes. Recall that each single-element set from the states is a closed communicating class and by definition a closed communicating class has all the states that the origin leads to. A general Markov chain can have an infinite number of communicating classes. This is something we see from sets and relations. When we visualize the topological structure of a Markov chain, we will see closed communicating classes that have no arrows for communication and arrows between two non-closed communicating classes that are in the same direction. If we remove the non-closed communicating classes, we can get disjoint closed communicating classes.
If we have an irreducible chain where all the states form a single closed communicating class, we can further simplify the behavior of the chain by noticing that if the chain is in one set it can move to only a few other sets. For example, if we take the set with four communicating classes, the diagonal states will transition to the opposite. This we say has a period of two. A triangular three communicating class will have a period of three. The numbers two and three are indicative of a divisor in the communicating classes. If an integer n divides m and the reflexive state transitions are possible with n transitions, then its possible with m transitions. So, whether it is possible to return to i in m steps can be decided by one of the integer divisors of m. The period of the state is defined as the greatest common divisor of all natural numbers n with such that it is possible to return to i in n steps. A state i is called aperiodic if it can be returned to only one step.

Saturday, August 23, 2014

The topological structure of a Markov chain can be represented as a digraph and denoted by a pair (V,E ) with V being the vertices and E being the edges of the graph. The graph of a chain refers to the structure of the process that does not depend on the exact values of the transition probabilities but only on which of them are positive. We say that a state i leads to state j if, starting from i the chain will visit j at some finite time.
We denote this by i ~ j and it is equivalent to stating the probability for some final state j after n iterations starting from i is greater than zero. This relation is reflexive. In fact for all the states in the event space, we can have probabilities between pairs to be greater than zero.
This relation is also transitive. If we can reach state j from i and k from j then it implies that there is a chain possible from i to k .
If it's a symmetric relationship which means that there is a chain from i to j and j to i then we can say that i communicates with j.
The communication relationship is an equivalence relation which means it is symmetric, reflexive and transitive. Equivalence classes also called as communicating classes partition the space. The communicating class corresponding to state i is, by definition the set of all states that communicate with i.
Two communicating classes are either identical or completely disjoint.
A communicating class is said to be closed when all the states belong to the communicating class. Closed communicating classes are particularly important because they decompose the chain into smaller more manageable parts.The single element state i is essential if it belongs to closed communicating class. Otherwise the state is inessential. If all the states communicate with all others, then the chain is considered irreducible.

Friday, August 22, 2014

Today we will continue to discuss Skorokhod theorem. Recall that the theorem mentions the stopping time in terms of a probability from the distribution. The simpler case was the symmetrical random walk but Skorokhod embedding theorem works even better for the distribution. Moreover we use it for describing a random walk with a new random variable that can take values above a value a and below b.

Today we will continue to discuss random walks. In the previous post, we were proving Skorokhod embedding theorem. We introduced a random variable for a simple random walk that takes a value x greater than the value taken by random variable A but not reaching the value taken by the random variable B and assuming that this walk is independent of both A and B. The claim was that this new random variable takes a probability pk from the distribution and thus leading to the said theorem that the stopping time takes a probability from the distribution. To make the claim, we looked at the initial values when k = 0, then Z = 0 if and only if A = 0 and B = 0 and this has probability p0 by definition. Then we take k > 0 and in this case we work out the probability that our random variable takes the value k. We first expand this probability as the cumulative sum of the independent probabilities over all i < 0 and j > 0 . The independent probabilities are for the random variable to take a value k and for A to take value i and B to take value j. We can eliminate j let B take value k for the probability we are calculating. Then we apply the theorem from the gambler's ruin that describes the probability to take value b > x before reaching a < x as equals (x - a) / (b - a). Further we use the modified expression of that probability in terms of a random walk Ta ^ Tb to reach level b as the same probability and equaling (-a / (b-a)) or (-i/(k-i)) as in this case. So we simplify the independent probabilities with this value and the normalization factor times (k-i) pi pk. Since the sum of this has a zero average component, it simplifies to the probability pk thus proving the theorem.

Thursday, August 21, 2014

In the previous post, we mentioned the Skorokhod embedding theorem. This theorem gives the probability for the stopping time T as one of the probabilities in the distribution. In this post, we try to prove it. We define a pair of random variables (A,B) that takes values anywhere in the N x N grid with the following distribution:
the initial state with A=0 and B=0 to have a probability p0 and the
any other state with A=i and B=j to have a probability as a normalization factor times j-i times the combination of the independent probabilities for i and j from the distribution.
We can now cumulate all the other states to have a probability of 1 - p0. We can split the (j-1) to two separate terms.
Using the zero mean equation in the cumulation equation, we can apply the above two for all i < 0 < j, we get that the normalization factor is the inverse of this common value : sum of i and pi over all i > 0
Assuming the stopping time as an infinite series of Sn = i and (A,B) to be independent of Sn, we can take a new random variable Z in terms of the intersection of the random walk to reach value TA and before it reaches TB.
The claim is that the probability for this random variable to take a value k is the corresponding probability from the distribution. For k = 0, Z = 0 has a probability p0. For k > 0 we can use the theorem that computes the probability of the random variable we defined and we see that this has a value pk.

Wednesday, August 20, 2014

Today we will look at Skorokhod embedding theorem. We consider a simple symmetric random walk where the probability for an event with value v is 1/2 and that against is 1/2. When we sample, we want the probability that the gambler's fortune will reach level b > x before reaching level a < x is given by (x-a)/(b-a). When we start the random walk from S0 = 0, the probability will reach level b > x before reaching level a < x equals -a / (b-a) and the probability that it will reach level a < x before b > x is b / (b-a).
These same probabilities can now be read as simple symmetric random walks with distribution as the range between a and b to reach b and a respectively.
We view the random walk this way. Given a 2 point probability distribution (p, 1-p) such that p is rational, we can always find integers a,b a < 0 < b such that pa + (1-p)b = 0.
This means we can simulate the distribution by generating a simple symmetric random walk starting from 0 and watching it till the first time it exits the interval [a,b] The probability it exits from the left point is p and the the probability it exits from the right point is 1-p. Having introduced the theorem, lets look at some more complicated probability distribution.
If the probabilities were to be many in a distribution, can we find a stopping time T such that the random walk has the given distribution ? This is where Skorokhod embedding theorem helps.
This theorem states that for a simple symmetric random walk and (pi, i belongs to Z) a given probability distribution on Z with zero mean which means satisfying the equation sum of i.pi = 0, then there exits a stopping time T where the probability for random walk to arrive at i = pi

Monday, August 18, 2014

Today we will continue to discuss random walks. We discussed single dimension and two dimension random walks so far.We saw that the two dimensional random walk can be expressed in terms of random walks in single dimensions that are independent. We proved this by enumerating the probabilities of the transitions and by considering them relative to origin. We now look at another lemma that states the probability for the two dimensional random walk to return to zero after 2n times is equal to twice that of the simple symmetric random walk (ssrw) to return to zero after 2n times. This intuitively follows from the first argument we made where we described the two dimensional random walk to be expressed in terms of the two independent ssrw. Further more, both the ssrw have the same probabilities for the final state we are interested in. It follows that the probability for the two dimensional ssrw is twice that of the same for a ssrw which in this case is the count of number of paths from (0,0) to (2n,0) to the overall paths and works out to be 2n Choose n times 2 ^ -2n. The probability for the final state for the two dimensional random walk is double that.
We next look at recurrence
Remember that Markov chain has finitely many states and at least one state must be visited infinitely many times. Perhaps some states will be visited again and again infinitely. Such a state is called recurrent. If the chain returns to a state i infinitely many times starting from i, we say that the Pi(Xn = i for infinitely many n) = 1
In the two dimensional symmetric random walk, from the previous lemma we can say that the recurrence probability holds true for both sides as it tends to 1 for infinitely many series and given the probability we worked out and substituting for different n and approximating with Stirling's approximation to c/n for some constant c, we see that the recurrence holds true for two dimensional random walk.

Sunday, August 17, 2014

In the previous post, we considered a simple symmetric random walk in two dimensions. We said that when we change the co-ordinates say by rotating the axis and scaling, the random walks along the single dimensions are independent. To prove this, we start with the random walk in two dimensions and express it as a sequence of vectors where each vector is an incremental displacement along the rotated and scaled axis in cartesian co-ordinate pairs. This sequence is also i.i.d because it is taken with some function on the elements of sequence consisting of incremental vectors along the original axis. When we look at this function, we see that the rotation and the scaling implies taking the sum and the difference of the incremental displacements along the original axis. This kind of sum and difference is independent for each of the elements in the sequence of the two dimension walk. Moreover, the possible values for each incremental in the two dimension walk are either positive or negative 1. So we can write the probabilites for state transtions of the two dimension walk as follows:
Probability to get to (1,1) in the two dimension walk = Probability to get to (1,0) in the single dimension walk = 1/4
Probability to get to (-1, -1) in the two dimension walk = Probability to get to (-1,0) in the single dimension walk = 1/4
Probability to get to (1, -1) in the two dimension walk = Probability to get to (0,1) in the single dimension walk = 1/4
Probability to get to (-1,1) in the two dimension walk = Probability to get to (0, -1) in the single dimension walk = 1/4
This implies that the probability for each of the incremental displacements along the rotated axis is 1/2. And so we can decompose the random walk in two dimensions with probability for the final state to a pair (n1 to be some value e1 , n2 to be some value e2) as a combination of the probability for the final state to be (n1 to be value e1) and the probability of the final state to be (n2 to be some value e2) for all choices of the value e1 and e2 to be in positive or negative unit distance. Thus they are independent.
We will next look at a few more lemmas in this case.

Thursday, August 14, 2014

Today I will discuss simple symmetric random walk in two dimensions. We can consider the same drunkard problem in two dimensions. Consider a drunkard in a city whose roads are arranged in a rectangular grid. The corners have coordinates (x,y) and the origin is (0,0). He moves east, west, north or south with equal probability of 1/4. When he reaches a corner, he moves again in the same manner.
This random walk can be considered in the form Z^2 = Z * Z. Let E1, E2 be i.i.d random numbers such that the incremental relative displacements are each 1/4.
Because the relative displacements are all the same, we move the start of all random walks to be the origin.
We further describe the horizontal and the vertical displacements with separate notations say x1 and x2 for horizontal and vertical respectively such that they belong to the grid and the final state is the sum of the incrementals in both directions taken together.
One of the lemmas now is that if both the horizontal and the vertical displacements are random walks, the two are not independent.
Since E1, E2 are independent their x and y projections are also independent and they are identically distributed with a transition probability to zero as 1/2 and to either 1 or -1 as 1/4.
So the horizontal states are a sum of iid random variables and hence a random walk. But it is not a simple random walk because the transition to zero has a different probability.
The same argument applies to vertical states.
Thus we see that the two random walks are not independent. Further, if we change the co-ordinates by rotating the axis by 45 degrees and scaling them then we can suppose the final state is described by the sum and the difference of their horizontal and vertical final states. In such a case the random walks along those axis is now considered independent.

Wednesday, August 13, 2014

Having discussed Splunk DB in the previous posts, I want to outline and discuss one of the key concepts and its implementation in this post today. This is the time series partitioning of data or buckets and the inverted time search on these buckets. There are other very powerful concepts such as key value map and reduce that enables powerful analytics. We suggested and follow up on whether we could externalized this to NoSQL databases. In the meantime, I want to focus on just the time series data, its persistence and search - essentially whether we can avoid buckets.
Events that flow into Splunk are marked with timestamp. The timestamp gives an order to the event. In the absence of any other search query parameters, the timestamp helps partition the data correctly.
The events need to be sorted by timestamp. However, events can arrive out of order. Its helpful to think of these events as precious data. We should not drop events, if possible. So a sufficient window to let the out of order events arrive could be maintained. When the events arrive continuously, they can be streamed to a file or a table. There is no need to split the file or the table logically. If we are concerned about the reliability and fault tolerance, we can handle it externally. If we are concerned about a single point of maintenance, we can distribute it. If we are concerned about ACID properties, we can use a traditional relational database. Outwards, there is nothing preventing us from viewing it as a singular data store.
The ability to put an order to the late arrival events and then to search back time for query matching is the consideration for this post. One of the ways to do this is with a sliding window protocol. While we cannot solve the problem of unlimited sequence numbers, we can hope to index and maintain events that are of reasonable delay. If we cannot control that, we specify an auxilary storage for such overly delayed packets, which we can later retry sequencing asynchronously. We could do this with fixed sequence numbers. For every event sequenced, the sliding window moves forward by one event. Sequencing one event involves inserting it in the right place. Typically we avoid the lookup and the insertion by copying the minimum time-stamp to the start position of the sliding window and exchanging it with the element at the start. We do this only if the element at start is not the minimum. We also check that the element before the start is earlier than the element at the start. Then we advance the sliding window forward by one event.Since we check only the elements within the window and the element prior to start we can wrap-around the index for a fixed size sequence number.

Tuesday, August 12, 2014

We return to the discussion on Ballot theorem. We said that if the urn contains a azure balls and b = n - a black balls then if we are given that there is sampling without replacement, then the probability that azure balls come out ahead is (a-b)/n as long as a >= b. We prove this by drawing similarity to the random walk and translating it to a question about random numbers. The drawing of the azure ball can be considered positive and the drawing of the black ball can be considered negative.The sampling is precisely the event that the S0, S1, S2, ..., Sn are each positive (where Sn is the state of the walk at time n) And the random walk remains positive y. This we know from our equations for random walk to be y/n.
Next we consider asymmetric simple random walk with values in Z but which is not necessarily symmetric. The probability for a positive sign is taken as p and that for a negative sign is taken as 1-p, the displacements belong to Z.
Probability for the distribution with n steps and the initial state S0 to remain positive k is given by
n Choose (n+k) / 2 p ^ (n+k)/2 q ^ (n-k) /2
Here we are simply counting the paths. The numerator n Choose (n+k)/2 is the number of paths from (0,0) to (n,k) And the total number of paths is given by the inverse of the subsequent terms which are for positive and negative walk.
The initial state S0 is independent of the increments that lead to Sn.
We can now look at the hitting time -
We will now look at the probabilistic and analytical methods to compute the distribution of the hitting time The hitting time can be expressed as a infinite series of iterations n where n >= 0 for the final state Sn to arrive at x in Z and it takes values 0,1, 2 .. infinity
For a given x, y in displacement space Z, we know that the probability for the hitting time to arrive at y starting from x to be equal to t is the same probability for the hitting time to arrive at y-x from start to be equal to t. In other words, the space is homogenous and the relative transitions between any two points if similar, will result in similar hitting times. We use this to our advantage by considering all such random walks to start from zero without losing any generality.