Sunday, August 31, 2014

In Kullback-Leibler divergence that we mentioned in earlier posts, we saw that the divergence was measured word by word as the average probability of that word against the distribution. We calculate nw as the total number of terms where w appears and pw as the total number of terms where w appears divided by the total number of terms in the document and we measured the P(tk/q) as [nw / Sum-for-all-terms-x-in-q (nw)] . When the divergence was greater than a threshold, we selected it as a keyword. It is not necessary to measure the divergence of the term one by one against the background distribution in a document because the metric hold for any two distributions such as P(x) and Q(x) and their divergence is measured as P(x)-Q(x)logP(x)/Q(x). The term distribution of  sample document is the compared with the distribution of categories that number C. The probability distribution of a term tk in a document dj is the ration of the term frequencies in that document compared to the overall term frequency across all documents in the case that the terms appear in the document otherwise zero. The term probability distribution across categories is normalized to 1.
The equation we refer to for the divergence comes in many forms.
Most equations use a default value for when a term doesn't appear in either P or Q. This is because the zero values skews the equation. The probability epsilon corresponds to an unknown word.
Today I will resume some discussion on Keyword extraction.
We discussed co-occurrence of terms as an indicator of the keywords. This has traditionally meant clustering keywords based on similarity. Similarity is often measured based on Jensen-Shannon divergence or Kullback-Leibler divergence. However similarity doesn't give an indication of relevance. Pair-wise co-occurrence or mutual information gives some indication of relevance.
 Sometimes we need to use both or prefer one over the other based on chi-square goodness of fit.
In our case, co-occurrence of a term and a cluster means co-occurrence of the term and any term in the cluster although we could use nearest, farthest terms or the average from the cluster.
What we did was we populated co-occurrence matrix from the top occurring terms and their counts. For each of the terms, we count the co-occurrences with the frequent terms that we have selected. These frequent terms are based on a threshold we select.
When we classify, we take two terms and find the clusters they belong to. Words don't belong to any cluster initially. They are  put in the same cluster based on the mutual information which is calculated as the ratio of the probability of co-occurring terms to the individual probabilities of the terms. We translate this to the counts and calculate each probability in terms of counts from the co-occurrence matrix.
We measure the cluster quality by calculating the chi-square. This we do by summing over all the components of the chi-square as measured for each word in the frequent terms. Each component is the square of the difference between the observed co-occurrence frequency and the expected frequency and divided by the expected frequency of co-occurrence. The expected frequency is calculated in turn as the combination of the expected probability pg of that word g from the frequent terms and the co-occurrence of the term w with frequent terms denoted by nw.
If the terms have a large chi-square value, then they are relatively more important. If the terms have  a low chi-square value then they are relatively trivial. Chi-square gives a notion of the deviation from the mean indicating the contribution each cluster makes and hence its likelihood to bring out the salient keywords. For a look at the implementation, here it is. We attempted Kullback-Leibler divergence as a method for keyword extraction as well. Here we used one of the formula for the divergence.

Saturday, August 30, 2014


Trying out Node.js, webmatrix and MongoDB.

var MongoClient = require('mongodb').MongoClient
    , format = require('util').format;

MongoClient.connect('mongodb://127.0.0.1:27017/test', function(err, db) {
    if(err) throw err;

    var collection = db.collection('test_events');
    collection.insert(   [
     { Timestamp: (new Date()).toString(), host: "local", source: "source1", sourcetype: "sourcetype1" },
     { Timestamp: (new Date()).toString(), host: "local", source: "source2", sourcetype: "sourcetype2" },
   ],{ ordered: true }, function(err, docs) {
        collection.count(function(err, count) {
            console.log(format("err = %s", err));
            console.log(format("count = %s", count));
            db.close();
        });
    });
});

C:\Users\Admin>node connect.js
err = null
count = 2

> db.test_events.find()
{ "_id" : ObjectId("54025fbb9c212420085d82fb"), "Timestamp" : "Sat Aug 30 2014 1
6:35:23 GMT-0700 (Pacific Daylight Time)", "host" : "local", "source" : "source1
", "sourcetype" : "sourcetype1" }
{ "_id" : ObjectId("54025fbb9c212420085d82fc"), "Timestamp" : "Sat Aug 30 2014 1
6:35:23 GMT-0700 (Pacific Daylight Time)", "host" : "local", "source" : "source2
", "sourcetype" : "sourcetype2" }
I came across an interesting topic about how to store key values in relational tables and whether we should move to NoSQL just for the sake of storing key values. The trouble with storing key values in relational tables is that the same key has multiple values. If we keep each record for a key value pair soon we have a flooding of the table but more importantly this just seems like  a collection and not an entity which is fundamental to the database model. We could alleviate the storage concerns and call them different entities by say calling one table as keys and another table as values and adding a relation between them.
That said, probably an easy to implement way is to store it as an XML in a single column. This alleviates the problem of having two columns where one is the fieldvalue and the other is the fieldvaluetype. Moreover, the column can be indexed and queried with XPath. This is why it is preferred over JSON.
Another approach is to use the EntityAttributeValue model also called the EAV model. Here the attributes are available as columns and they can be numerous with only a few columns holding values at some time. This is also called sparse matrix.
The thing to note here is that the tendency to add custom properties is not restricted to a single table and can become an epidemic in the system. That is why the data model may need to be redesigned or at least extended if such things crop up. The normalization is important just as much as the convenience of using key-values.
Key Value are generally stored using a hashing function because they are essentially collections. The hashing allows to bucket the keys and collisions are resolved by overflow lookups.
The NoSQL stores such as MongoDB serve more purposes as well. They are better suited for the following use cases:
column stores
key value stores
document stores
graph stores
If we look at the

use EventsDB

go
 

CREATE TABLE dbo.Event

( ID int identity not null,

Timestamp datetime not null

CONSTRAINT IX_Event_Timestamp

PRIMARY KEY CLUSTERED (Timestamp, ID)

WITH (IGNORE_DUP_KEY = OFF),

Host nvarchar(4000) not null,

Source nvarchar(100) not null,

SourceType nvarchar(100) not null,

FieldMap xml null,

);

INSERT INTO dbo.Event VALUES (GETDATE(), HOST_NAME(), N'Source1', N'SourceType1', NULL);

INSERT INTO dbo.Event VALUES (DATEADD(DD, 1, GETDATE()), HOST_NAME(), N'Source2', N'SourceType2', NULL);

UPDATE dbo.Event set FieldMap = '<xml><Source>Source1</Source><SourceType>SourceType1</SourceType></xml>'

WHERE SOURCE=N'Source1'

 
 
UPDATE dbo.Event set FieldMap = '<xml><Source>Source2</Source><SourceType>SourceType2</SourceType></xml>'

WHERE SOURCE=N'Source2'

SELECT * FROM dbo.Event

go

ID Timestamp Host Source SourceType FieldMap
1 2014-08-31 09:22:28.193 ADMIN-PC Source1 SourceType1 <xml><Source>Source1</Source><SourceType>SourceType1</SourceType></xml>
2 2014-09-01 09:22:28.193 ADMIN-PC Source2 SourceType2 <xml><Source>Source2</Source><SourceType>SourceType2</SourceType></xml>


SELECT ID, FieldMap.query('data(/xml/Source)') as Value

FROM dbo.Event

WHERE FieldMap.exist('/xml/SourceType') = 1

Friday, August 29, 2014

Today we will be quickly reviewing SEO - search engine optimization from the SEO guide.
Search engines have two major functions - 1. crawling and building an index 2. providing answers by calculating relevancy and serving results.

The web is connected by links between pages. Crawlers index these pages for fast lookups. Search engines have to provide answers in fractions of seconds because users lose focus after 2-3 seconds. Relevancy and importance from such voluminous data is determined in this time. SEO targets both these metrics. Relevance is a way of ranking what pertains to the user query. Importance is somewhat irrespective of the user query and focuses on the popularity of the site. There are several algorithms used to determine relevance and importance. Each Search engine may have their own.
Google recommends  the following to get better ranking:
  • don't cloak your pages differently to search engines from what they appear to users
  • organize your site with each page reachable by a static link
  • pages are cohesive with respect to the content they provide
  • use redirects and rel="canonical" for duplicate content.
Bing suggests the following:
  • construct clean URLs with keywords
  • provide keyword-rich content and refresh content
  • don't hide the text to be indexed in resources 
 There are three types of queries search users perform. These are :
"Do" transactional queries such as to buy a plane ticket
"Know" informational queries such as to find the name of a restaurant
 "Go" such as site specific queries such as LinkedIn

In spite of the efforts of the search engine, the users attention is not drawn to the results but to the words appearing in bold or titles or brief descriptions - things that are explicitly targeted in paid search listings.
 To get your page noticed, you could:
wrap your images, plugins, video and audio content with text description of the  content.
Search boxes can be supplemented with navigable and crawlable links and include a sitemap. Crawlable links means that the webpages are connected and the hrefs in the html point to say static links. If you want to hide content from the search engine, you could add rel="nofollow" for your hrefs or add meta robots tag or the robots.txt. When using keywords, use specific ones and don't stuff it on the content. keyword density does not help. Keywords in the titles help but don't make it longer than 65-75 characters. Always improve readability and emotional impact.

Search engines have millions of smaller databases that have keyword based indices. This makes it much faster for the engines to retrieve the data. Keywords thus play an important role in the search engine's ranking algorithm. Keywords appearing in title, text, image alt attributes and metadata promote a page's relevance. When keywords appear in the title, be mindful of the length, order and leverage branding.
Meta robot tags such as 'index/noindex' 'follow/nofollow' 'noarchive' 'nosnippet' 'noodp/noydir' restrict spider activity. They could be used judiciously.  The meta description tag is what appears together with a search listing.
URLs could be made in a way such that they are shorter, use keywords, use hyphens to separate keywords and are static.  Use canonicalization so that every unique piece of content has one and only one URL.

Thursday, August 28, 2014

We will be reviewing some simple random walk theorems. We will look at the amount of time that a SSRW is positive. If the random walk is at zero, it takes an even number of steps in either direction to return to zero.
We will look at the distribution when the random walk is positive. We can take an example with coin tosses. When we toss a coin, the probability that the number of heads equals k in 2n coin tosses is maximum when k = n . What we are saying is that we know for sure we can exceed the number of tails when the number of heads and the number of tails are equal. If we wanted to maximize the number for one outcome, we would have to choose the value 0 or 2n for the remainder.  Since we mentioned earlier that the probability a random walk reaches a certain count can be derived from the probabilities from the outcomes of the states from the distribution, we can express the random walk probabilities in terms of  the counts of the probabilities for the individual states. 

Tuesday, August 26, 2014

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text;

public class Test
{
    public static void Main()
    {
        char[,] wordSearch = new char [3,3] { {'B', 'O', 'B'}, {'B', 'O', 'B'}, {'B', 'O', 'B'} };
        var wordList = new List<string>() { "BOB", "BOX", "COB" };
        printWordsInMaze(wordSearch, 3, 3, wordList);
    }
   
    public static void printWordsInMaze(char[,] wordSearch, int row, int col, List<string> wordList)
    {
        string[ ] directions = {"LR", "RL", "U", "D", "DUL", "DUR", "DDL","DDR"};
    for (int i = 0; i < col; i++)
    for (int j = 0; j < row; j++)
    {
        char letter = wordSearch[i,j];
          string pattern;
           List<string> candidates = wordList;
      
          int start = i;
          int end = j;
          foreach (var direction in directions)
          {
              int r = i;
              int c = j;
              pattern = String.Empty;
              switch (direction)
             {
                            case "LR":
                              while (c < col) { pattern += wordSearch[r,c]; c++;}
                              printMatchingWord(start, end, pattern, candidates, direction);
                            break;

                            case "RL":
                              while (c >= 0) { pattern+= wordSearch[r,c]; c--;}
                              printMatchingWord(start, end,pattern, candidates, direction);
                            break;

                            case "U":
                              while (r >= 0) { pattern += wordSearch[r,c]; r--;}
                              printMatchingWord(start, end,pattern, candidates, direction);
                            break;

                            case "D":
                              while (r < col) { pattern += wordSearch[r,c]; r++;}
                              printMatchingWord(start, end, pattern, candidates, direction);
                            break;

                            case "DUL" :
                              while (r >= 0 && c >= 0) { pattern += wordSearch[r,c]; r--; c--;}
                              printMatchingWord(start, end, pattern, candidates, direction);
                            break;

                            case "DUR" :
                              while (r >= 0 && c < col) { pattern += wordSearch[r,c]; r--; c++;}
                              printMatchingWord(start, end, pattern, candidates, direction);
                            break;

                            case "DDL" :
                              while (c >= 0 && r < col) { pattern += wordSearch[r,c]; c--; r++;}
                              printMatchingWord(start, end,pattern, candidates, direction);
                            break;

                            case "DDR" :
                              while (r < col && c < row) { pattern += wordSearch[r,c]; r++; c++;}
                              printMatchingWord(start, end, pattern, candidates, direction);
                            break;
               }
               //i = start;
               //j = end;
              }
         }
    }
   
private static void printMatchingWord(int i, int j, string pattern, List<string> candidates, string direction)
{
                                  foreach (var word in candidates)
                                  {
                                      Console.WriteLine("Pattern={0},word={1}", pattern, word);
                                       if (pattern.StartsWith(word))
                                        Console.WriteLine ("Word = {0} at row = {1} and col = {2} and direction = {3}", word, i,j, direction);
                                }
}

}

  • Success time: 0.06 memory: 33936 signal:0
    Pattern=BOB,word=BOB
    Word = BOB at row = 0 and col = 0 and direction = LR
    Pattern=BOB,word=BOX
    Pattern=BOB,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BBB,word=BOB
    Pattern=BBB,word=BOX
    Pattern=BBB,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BOB,word=BOB
    Word = BOB at row = 0 and col = 0 and direction = DDR
    Pattern=BOB,word=BOX
    Pattern=BOB,word=COB
    Pattern=OB,word=BOB
    Pattern=OB,word=BOX
    Pattern=OB,word=COB
    Pattern=OB,word=BOB
    Pattern=OB,word=BOX
    Pattern=OB,word=COB
    Pattern=O,word=BOB
    Pattern=O,word=BOX
    Pattern=O,word=COB
    Pattern=OOO,word=BOB
    Pattern=OOO,word=BOX
    Pattern=OOO,word=COB
    Pattern=O,word=BOB
    Pattern=O,word=BOX
    Pattern=O,word=COB
    Pattern=O,word=BOB
    Pattern=O,word=BOX
    Pattern=O,word=COB
    Pattern=OB,word=BOB
    Pattern=OB,word=BOX
    Pattern=OB,word=COB
    Pattern=OB,word=BOB
    Pattern=OB,word=BOX
    Pattern=OB,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BOB,word=BOB
    Word = BOB at row = 0 and col = 2 and direction = RL
    Pattern=BOB,word=BOX
    Pattern=BOB,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BBB,word=BOB
    Pattern=BBB,word=BOX
    Pattern=BBB,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BOB,word=BOB
    Word = BOB at row = 0 and col = 2 and direction = DDL
    Pattern=BOB,word=BOX
    Pattern=BOB,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BOB,word=BOB
    Word = BOB at row = 1 and col = 0 and direction = LR
    Pattern=BOB,word=BOX
    Pattern=BOB,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BB,word=BOB
    Pattern=BB,word=BOX
    Pattern=BB,word=COB
    Pattern=BB,word=BOB
    Pattern=BB,word=BOX
    Pattern=BB,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BO,word=BOB
    Pattern=BO,word=BOX
    Pattern=BO,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BO,word=BOB
    Pattern=BO,word=BOX
    Pattern=BO,word=COB
    Pattern=OB,word=BOB
    Pattern=OB,word=BOX
    Pattern=OB,word=COB
    Pattern=OB,word=BOB
    Pattern=OB,word=BOX
    Pattern=OB,word=COB
    Pattern=OO,word=BOB
    Pattern=OO,word=BOX
    Pattern=OO,word=COB
    Pattern=OO,word=BOB
    Pattern=OO,word=BOX
    Pattern=OO,word=COB
    Pattern=OB,word=BOB
    Pattern=OB,word=BOX
    Pattern=OB,word=COB
    Pattern=OB,word=BOB
    Pattern=OB,word=BOX
    Pattern=OB,word=COB
    Pattern=OB,word=BOB
    Pattern=OB,word=BOX
    Pattern=OB,word=COB
    Pattern=OB,word=BOB
    Pattern=OB,word=BOX
    Pattern=OB,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BOB,word=BOB
    Word = BOB at row = 1 and col = 2 and direction = RL
    Pattern=BOB,word=BOX
    Pattern=BOB,word=COB
    Pattern=BB,word=BOB
    Pattern=BB,word=BOX
    Pattern=BB,word=COB
    Pattern=BB,word=BOB
    Pattern=BB,word=BOX
    Pattern=BB,word=COB
    Pattern=BO,word=BOB
    Pattern=BO,word=BOX
    Pattern=BO,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BO,word=BOB
    Pattern=BO,word=BOX
    Pattern=BO,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BOB,word=BOB
    Word = BOB at row = 2 and col = 0 and direction = LR
    Pattern=BOB,word=BOX
    Pattern=BOB,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BBB,word=BOB
    Pattern=BBB,word=BOX
    Pattern=BBB,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BOB,word=BOB
    Word = BOB at row = 2 and col = 0 and direction = DUR
    Pattern=BOB,word=BOX
    Pattern=BOB,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=OB,word=BOB
    Pattern=OB,word=BOX
    Pattern=OB,word=COB
    Pattern=OB,word=BOB
    Pattern=OB,word=BOX
    Pattern=OB,word=COB
    Pattern=OOO,word=BOB
    Pattern=OOO,word=BOX
    Pattern=OOO,word=COB
    Pattern=O,word=BOB
    Pattern=O,word=BOX
    Pattern=O,word=COB
    Pattern=OB,word=BOB
    Pattern=OB,word=BOX
    Pattern=OB,word=COB
    Pattern=OB,word=BOB
    Pattern=OB,word=BOX
    Pattern=OB,word=COB
    Pattern=O,word=BOB
    Pattern=O,word=BOX
    Pattern=O,word=COB
    Pattern=O,word=BOB
    Pattern=O,word=BOX
    Pattern=O,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BOB,word=BOB
    Word = BOB at row = 2 and col = 2 and direction = RL
    Pattern=BOB,word=BOX
    Pattern=BOB,word=COB
    Pattern=BBB,word=BOB
    Pattern=BBB,word=BOX
    Pattern=BBB,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=BOB,word=BOB
    Word = BOB at row = 2 and col = 2 and direction = DUL
    Pattern=BOB,word=BOX
    Pattern=BOB,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB
    Pattern=B,word=BOB
    Pattern=B,word=BOX
    Pattern=B,word=COB