Cluster computing

Tuesday, September 2, 2014

Today I want to discuss a few algorithms from the Algorithms and Data Structures. We will quickly review tree and graph algorithms.
To delete a node from a binary search tree:
we have to consider three cases:
If the current has no right child, then the current's left child becomes the node pointed to by the parent.
If the current's right child has no left child, then the current's right child replaces the current in the tree
If the current's right child has a left child, replace current with the current's right child's left-most node.
To insert a node to a binary search tree, insert as a leaf with the check for the parent.
An AVL is a self balancing binary search tree. AVL trees maintain the property that the height of the left and right subtree must differ by at most 1.
If a node has no sibling, it's height is considered -1 for that subtree.
To insert a node into the AVL tree, insert the node just as in the BST. Then as a next step, if the height is violated, rotate the tree to adjust the height at the node where the violation is detected. Check each node for violation up the parent chain because one rotation may not be sufficient.
If the node with the violation is A, then
If a node is inserted into the left subtree of the left child of A, then there is one rotation
If a node is inserted into the right subtree of the right child of A, then there is another rotation
If a node is inserted into the left subtree of the right child of A, then there is double rotation
If a node is inserted into the right subtree of the left child of A, then there is double rotation.
To delete a node from an AVL tree, is more involved than one rotation because each of the nodes in the parent chain needs to be checked for violations since imbalance may propagate upwards.
An AVL tree comes in useful when subsequent operations access the recently added node such as in the splay trees. Splaying means moving the node to the root by rotations. The tree remains roughly balanced.
A Red-black tree satisfies the additional properties that
Every node is either red or black.
The root and the leaves are black
If a node is red, then its children are black.
For each node, all simple paths from the node to the descendant leaves contain the same number of black nodes.
To insert a node in the red-black tree, insert as in a BST and then fixup the colors with three cases.
case 1 is when we recolor the nodes.
case 2 is when we left rotate the nodes
case 3 is when we right rotate the nodes.
To delete a node in the red-black tree, delete as in a BST and then fixup the colors with four cases
with one more case for the double rotation. Reverse the operations for right and left for the other subtree. In insertion we look up at the parent's sibling and in deletion we look at the sibling of the node and the children of the node for their color.
Graph algorithms are briefly enumerated below.
Breadth first search algorithms traverse reachable vertices with the smallest number of edges. Eg
Dijkstra's single source shortest path algorithm and
Prim's minimum spanning tree
Dijkstra's algorithm extracts the minimum shortest path estimate vertex and adds it to the set of vertices maintained. It then relaxes all the edges outbound from that vertex.
Prim's algorithm builds a minimum spanning tree by adding edges to the tree in a way that is safe and minimum cost for the tree.
Depth first search explores vertex outbound from the most recently added vertex. The nodes are initially white. We color the nodes gray when we discover them and after we have explored the edges, we color them black. The nodes are thus found as in a parenthesis structure and intervals of nodes can either be disjoint, descendant and contained one within the other.
We now look at some data structures.
a hash_map is an associative container that indexes based on the hash of the contained elements as opposed to the map that uses the less than operation on the elements contained. Collisions are resolved to buckets.
A multimap is like a map except that it allows duplicate entries. The range of duplicate entries can be looked up with equal_range. A multiset is a set that allows duplicate keys. Valarrays and bitset are specialized containers.

Monday, September 1, 2014

Some interview questions I came across online from Hanselman's blog and posting quick (and probably not sufficient ) answers below:
Q: From constructor to destructor (taking into consideration Dispose() and the concept of non-deterministic finalization), what the are events fired as part of the ASP.NET System.Web.UI.Page lifecycle. Why are they important? What interesting things can you do at each?
A: Page goes through Request -> Start-> Initialization-> Load->Validation->PostbackEventHandling->Rendering->Unloading.
Events fired are
PreInit() Check IsPostBack property to determine if the page is being processed the first time
Init () raised after all controls have been initialized and used to initialize control properties
InitComplete() raised by the Page object for initialization completion
PreLoad after this the page loads view state for itself and all controls.
Load corresponding to OnLoad event, Page recursively loads all controls.This is used to establish database connections
Control events: eg. TextChanged Button's Click events
LoadComplete: used for anything else to be loaded.
PreRender - Used to make final changes to the contents of the page or its controls.
SaveStateComplete - at this point viewstate has been saved.
Render - Page calls this method on each control
Unload - calls cleanup in the reverse sequence - controls, controls specific database connections, page itself, logging and request specific tasks.

Q: What are ASHX files? What are HttpHandlers? Where can they be configured?
ashx files are handlers, ascx files are controls, asax files are state handlers, asmx is for web services, aspx is for pages
Builtin http handlers target the above ashx aspx asmx trace.axd file extensions.
They are configured in IIS using Add/Edit Application Extension Mapping.
HttpHandlers' ProcessRequest method is invoked to return the response back.

Q: What is needed to configure a new extension for use in ASP.NET? For example, what if I wanted my system to serve ASPX files with a *.jsp extension?
First you need to map the extension in IIS and then map the custom handler to the extension in the application.The handler can also be reused for different extension mappings.

Q: What events fire when binding data to a data grid? What are they good for?
Page.DataBinding and Control.DataBinding methods bind the data to the data source.

Q: Explain how PostBacks work, on both the client-side and server-side. How do I chain my own JavaScript into the client side without losing PostBack functionality?
PostBacks are for posting back to the server some information such as login credentials or a selection from a control to retrieve the display to be shown Checking for the post back helps be efficient int he code. The client side posts back on page level as well as control level and is configured by Auto PostBack property of the controls. Callback events are raised for chaining.

Q: How does ViewState work and why is it either useful or evil?
ViewState is used to persist state across postbacks and is generally used to store any programmatic changes to the page's state. It can be useful for storing information but it can also be misused to stuff anything.The StateBag is like a Hashtable

Q: What is the OO relationship between an ASPX page and its CS/VB code behind file in ASP.NET 1.1? in 2.0?
The code behind is the Page-Controller pattern. You can specify the handlers in the code behind for the declarations of the page.

Q: What happens from the point an HTTP request is received on a TCP/IP port up until the Page fires the On_Load event?
Request goes through the HTTP pipeline ( IIS and ASP.Net) , passed through the http modules and handlers before the response is returned.

Q: How does IIS communicate at runtime with ASP.NET? Where is ASP.NET at runtime in IIS5? IIS6?
The request is dispatched to the asp.net engine via the aspnet_isapi.dll

Q: What is an assembly binding redirect? Where are the places an administrator or developer can affect how assembly binding policy is applied?
assembly binding redirect is useful to resolve an assembly by name at different locations. The locations are determined based on CAS policy. The policy can be set at AppDomain, application , machine and enterprise level. The first is by code and the rest are by config files.

Q: Compare and contrast LoadLibrary(), CoCreateInstance(), CreateObject() and Assembly.Load().
These are module initialization routines for native, COM, .Net and Assembly respectively.

Sunday, August 31, 2014

In Kullback-Leibler divergence that we mentioned in earlier posts, we saw that the divergence was measured word by word as the average probability of that word against the distribution. We calculate nw as the total number of terms where w appears and pw as the total number of terms where w appears divided by the total number of terms in the document and we measured the P(tk/q) as [nw / Sum-for-all-terms-x-in-q (nw)] . When the divergence was greater than a threshold, we selected it as a keyword. It is not necessary to measure the divergence of the term one by one against the background distribution in a document because the metric hold for any two distributions such as P(x) and Q(x) and their divergence is measured as P(x)-Q(x)logP(x)/Q(x). The term distribution of sample document is the compared with the distribution of categories that number C. The probability distribution of a term tk in a document dj is the ration of the term frequencies in that document compared to the overall term frequency across all documents in the case that the terms appear in the document otherwise zero. The term probability distribution across categories is normalized to 1.
The equation we refer to for the divergence comes in many forms.
Most equations use a default value for when a term doesn't appear in either P or Q. This is because the zero values skews the equation. The probability epsilon corresponds to an unknown word.

Today I will resume some discussion on Keyword extraction.
We discussed co-occurrence of terms as an indicator of the keywords. This has traditionally meant clustering keywords based on similarity. Similarity is often measured based on Jensen-Shannon divergence or Kullback-Leibler divergence. However similarity doesn't give an indication of relevance. Pair-wise co-occurrence or mutual information gives some indication of relevance.
Sometimes we need to use both or prefer one over the other based on chi-square goodness of fit.
In our case, co-occurrence of a term and a cluster means co-occurrence of the term and any term in the cluster although we could use nearest, farthest terms or the average from the cluster.
What we did was we populated co-occurrence matrix from the top occurring terms and their counts. For each of the terms, we count the co-occurrences with the frequent terms that we have selected. These frequent terms are based on a threshold we select.
When we classify, we take two terms and find the clusters they belong to. Words don't belong to any cluster initially. They are put in the same cluster based on the mutual information which is calculated as the ratio of the probability of co-occurring terms to the individual probabilities of the terms. We translate this to the counts and calculate each probability in terms of counts from the co-occurrence matrix.
We measure the cluster quality by calculating the chi-square. This we do by summing over all the components of the chi-square as measured for each word in the frequent terms. Each component is the square of the difference between the observed co-occurrence frequency and the expected frequency and divided by the expected frequency of co-occurrence. The expected frequency is calculated in turn as the combination of the expected probability pg of that word g from the frequent terms and the co-occurrence of the term w with frequent terms denoted by nw.
If the terms have a large chi-square value, then they are relatively more important. If the terms have a low chi-square value then they are relatively trivial. Chi-square gives a notion of the deviation from the mean indicating the contribution each cluster makes and hence its likelihood to bring out the salient keywords. For a look at the implementation, here it is. We attempted Kullback-Leibler divergence as a method for keyword extraction as well. Here we used one of the formula for the divergence.

Saturday, August 30, 2014

Trying out Node.js, webmatrix and MongoDB.

var MongoClient = require('mongodb').MongoClient
, format = require('util').format;

MongoClient.connect('mongodb://127.0.0.1:27017/test', function(err, db) {
if(err) throw err;

var collection = db.collection('test_events');
collection.insert( [
{ Timestamp: (new Date()).toString(), host: "local", source: "source1", sourcetype: "sourcetype1" },
{ Timestamp: (new Date()).toString(), host: "local", source: "source2", sourcetype: "sourcetype2" },
],{ ordered: true }, function(err, docs) {
collection.count(function(err, count) {
console.log(format("err = %s", err));
console.log(format("count = %s", count));
db.close();
});
});
});

C:\Users\Admin>node connect.js
err = null

count = 2

> db.test_events.find()
{ "_id" : ObjectId("54025fbb9c212420085d82fb"), "Timestamp" : "Sat Aug 30 2014 1
6:35:23 GMT-0700 (Pacific Daylight Time)", "host" : "local", "source" : "source1
", "sourcetype" : "sourcetype1" }
{ "_id" : ObjectId("54025fbb9c212420085d82fc"), "Timestamp" : "Sat Aug 30 2014 1
6:35:23 GMT-0700 (Pacific Daylight Time)", "host" : "local", "source" : "source2
", "sourcetype" : "sourcetype2" }

I came across an interesting topic about how to store key values in relational tables and whether we should move to NoSQL just for the sake of storing key values. The trouble with storing key values in relational tables is that the same key has multiple values. If we keep each record for a key value pair soon we have a flooding of the table but more importantly this just seems like a collection and not an entity which is fundamental to the database model. We could alleviate the storage concerns and call them different entities by say calling one table as keys and another table as values and adding a relation between them.
That said, probably an easy to implement way is to store it as an XML in a single column. This alleviates the problem of having two columns where one is the fieldvalue and the other is the fieldvaluetype. Moreover, the column can be indexed and queried with XPath. This is why it is preferred over JSON.
Another approach is to use the EntityAttributeValue model also called the EAV model. Here the attributes are available as columns and they can be numerous with only a few columns holding values at some time. This is also called sparse matrix.
The thing to note here is that the tendency to add custom properties is not restricted to a single table and can become an epidemic in the system. That is why the data model may need to be redesigned or at least extended if such things crop up. The normalization is important just as much as the convenience of using key-values.
Key Value are generally stored using a hashing function because they are essentially collections. The hashing allows to bucket the keys and collisions are resolved by overflow lookups.
The NoSQL stores such as MongoDB serve more purposes as well. They are better suited for the following use cases:
column stores
key value stores
document stores
graph stores
If we look at the

use EventsDB

go

CREATE TABLE dbo.Event

( ID int identity not null,

Timestamp datetime not null

CONSTRAINT IX_Event_Timestamp

PRIMARY KEY CLUSTERED (Timestamp, ID)

WITH (IGNORE_DUP_KEY = OFF),

Host nvarchar(4000) not null,

Source nvarchar(100) not null,

SourceType nvarchar(100) not null,

FieldMap xml null,

);

INSERT INTO dbo.Event VALUES (GETDATE(), HOST_NAME(), N'Source1', N'SourceType1', NULL);

INSERT INTO dbo.Event VALUES (DATEADD(DD, 1, GETDATE()), HOST_NAME(), N'Source2', N'SourceType2', NULL);

UPDATE dbo.Event set FieldMap = '<xml><Source>Source1</Source><SourceType>SourceType1</SourceType></xml>'

WHERE SOURCE=N'Source1'

UPDATE dbo.Event set FieldMap = '<xml><Source>Source2</Source><SourceType>SourceType2</SourceType></xml>'

WHERE SOURCE=N'Source2'

SELECT * FROM dbo.Event

go
ID Timestamp Host Source SourceType FieldMap
1 2014-08-31 09:22:28.193 ADMIN-PC Source1 SourceType1 <xml><Source>Source1</Source><SourceType>SourceType1</SourceType></xml>
2 2014-09-01 09:22:28.193 ADMIN-PC Source2 SourceType2 <xml><Source>Source2</Source><SourceType>SourceType2</SourceType></xml>

SELECT ID, FieldMap.query('data(/xml/Source)') as Value

FROM dbo.Event

WHERE FieldMap.exist('/xml/SourceType') = 1

Friday, August 29, 2014

Today we will be quickly reviewing SEO - search engine optimization from the SEO guide.
Search engines have two major functions - 1. crawling and building an index 2. providing answers by calculating relevancy and serving results.

The web is connected by links between pages. Crawlers index these pages for fast lookups. Search engines have to provide answers in fractions of seconds because users lose focus after 2-3 seconds. Relevancy and importance from such voluminous data is determined in this time. SEO targets both these metrics. Relevance is a way of ranking what pertains to the user query. Importance is somewhat irrespective of the user query and focuses on the popularity of the site. There are several algorithms used to determine relevance and importance. Each Search engine may have their own.
Google recommends the following to get better ranking:

don't cloak your pages differently to search engines from what they appear to users
organize your site with each page reachable by a static link
pages are cohesive with respect to the content they provide
use redirects and rel="canonical" for duplicate content.

Bing suggests the following:

construct clean URLs with keywords
provide keyword-rich content and refresh content
don't hide the text to be indexed in resources

There are three types of queries search users perform. These are :
"Do" transactional queries such as to buy a plane ticket
"Know" informational queries such as to find the name of a restaurant
"Go" such as site specific queries such as LinkedIn

In spite of the efforts of the search engine, the users attention is not drawn to the results but to the words appearing in bold or titles or brief descriptions - things that are explicitly targeted in paid search listings.
To get your page noticed, you could:
wrap your images, plugins, video and audio content with text description of the content.
Search boxes can be supplemented with navigable and crawlable links and include a sitemap. Crawlable links means that the webpages are connected and the hrefs in the html point to say static links. If you want to hide content from the search engine, you could add rel="nofollow" for your hrefs or add meta robots tag or the robots.txt. When using keywords, use specific ones and don't stuff it on the content. keyword density does not help. Keywords in the titles help but don't make it longer than 65-75 characters. Always improve readability and emotional impact.

Search engines have millions of smaller databases that have keyword based indices. This makes it much faster for the engines to retrieve the data. Keywords thus play an important role in the search engine's ranking algorithm. Keywords appearing in title, text, image alt attributes and metadata promote a page's relevance. When keywords appear in the title, be mindful of the length, order and leverage branding.
Meta robot tags such as 'index/noindex' 'follow/nofollow' 'noarchive' 'nosnippet' 'noodp/noydir' restrict spider activity. They could be used judiciously. The meta description tag is what appears together with a search listing.
URLs could be made in a way such that they are shorter, use keywords, use hyphens to separate keywords and are static. Use canonicalization so that every unique piece of content has one and only one URL.