Thursday, November 28, 2013

Today we talk about DBSCAN algorithm. It's a simple and effective density-based clustering algorithm that illustrates a number of important concepts. We first consider the notion of density. There are several methods to measure density that are distinct and different from measuring similarity. In the center based approach, density is estimated for a particular point by counting the number of points within a certain radius including that point. This method is simple but the density measured this way depends on radius. If the radius is large enough, then all points will have a density of m which is the number of points in a dataset. If the radius is too small, then all points will have a density of 1.
By this approach, points will either lie in the core of a density based cluster, or they will lie in the border between the neighborhoods of several core points or they may be outliers where they are neither part of a cluster nor in the neighborhood of two or more clusters. Thus points in this approach are classified as core points, border points and noise points.
The algorithm proceeds like this. Any two core points that are close enough i.e. within the radius of one another are put in the same cluster. Any border point that is close enough to a cluster is included within the same cluster as a core point. If there are ties between two or more clusters, they are resolved by choosing the one that's closest. Noise points are discarded.
So the steps can be enumerated as follows:
1. Label the points as core, border and noise points
2. Eliminate the noise points
3. Put an edge between all core points that are within the specified distance of one another
4. Make a cluster out of each group of connected core points that are within the specified distance of one another.
5. Assign each border point to one of the clusters of its associated core points.
For each point we find the points that are within the neighborhood of this point, so the complexity of this algorithm is O(m^2). However, there are data structures that can help reduce the complexity in retrieving the points adjacent to this center to O(mlogm). The space complexity is linear because we only persist a small amount of metadata for each of the points namely the cluster label and the classification of each point as the core, border or noise point.
The parameters for this algorithm are the radius and the minimum number of points in a cluster to distinguish the core from the border points.
This algorithm has trouble when the density of the clusters vary widely. It also has difficulty with the high-dimensional data but it's one of the simple ways of clustering.
In order for us to work with a plugin for an editor such as MSWord as mentioned in the previous post, we need to visit the object model exposed by that application. A word object model comprises of the following five top level objects:
Application object - represents the word application
Document object - represents the document and all of its content
Selection object - represents the area that is currently selected
Range object - represents a contiguous area in a document with a start and end character. It is active while the document is open and not visible.
Bookmark object - represents a contiguous area in a document with a start and end position and is persisted.
As you may see, they closely represent the hierarchy seen in the user interface.
At the top of this hierarchy is the application object. The application object contains document, selection, bookmark and range objects. Both the document object and the selection object can contain bookmark and Range objects.
Word also has content control objects. This let us control the input and the presentation of text and other types of content such as a date picker or a combo box. They help you do the following
prevent users from editing or deleting parts of document and
bind parts of a document or template to data.
One of the content controls is the Rich text. A rich text control contains text or other items such as tables, pictures or other controls.
A rich text content control has a DefaultTextStyle  and a PlaceholderText member. The former gets the name of the character style to format the text and the latter gets or sets the text that is displayed in the rich text content control.
A document also has XMLNodeControl to directly manipulate the markups and their hierarchy, This lets you define the xml content and schema that can be queried with XPath and displayed with content control objects. This is done by adding the custom xml part with the data from the xml and then binding the controls to custom xml part. XmlNodes can have a SmartTag object that represents the tag associated with the object.
The Range object can be used to format text in a Word document.
var start = this.Content.Start; // this.Sentences[0].Start
var end = this.Content.End;
Word.Range rng = this.Range(ref start, ref end); // this.Paragraphs[1].Range;
rng.Font.Size = 14;
rng.Font.Name = "Arial";
rng.ParagraphFormat.Alignment = Word.WdParagraphAlignment.wdAl
object indentStyle = "Normal Indent"
rng.set_Style(ref indentStyle);
rng.Select();
rng.Find.ClearFormatting();
rng.Find.Text = "find me";
rng.Replacement.ClearFormatting();
rng.Replacement.Text = "Found";
var replaceAll  = Word.wdReplace.wdReplaceAll;
rng.Find.Execute(ref findText, ref missing);

Wednesday, November 27, 2013

In the post about Keyword extraction by Kullback-Leibler technique, we talked about having a sliding scale of threshold for selecting keywords. We also talked about a viewer for users to see the keywords as highlighted in place using fonts bigger than the others in the text. As the threshold on the sliding scale is increased, we talked about more keywords appearing on the viewer.
 In today's post, I'm not only going to cover how to render the text but also cover how to make it independent of the actual text.
We will use offsets or word counts to find relative positions of keywords from start.
With the relative positions, we are able to find the keywords dynamically. We use pagination for blocks of texts where each page can having varying word counts as can fit the page view determined independently. We represent the page with an abstraction that lets us keep track of blocks of text which we will refer to as pages. We represent what the user sees with a page view that can change to accommodate text spanning one or more of the pages. Whenever that happens, re-pagination is involved and the page views is updated with the corresponding page. Thus there is a one to one relationship between the page and the page view. The page for each page view is expected to change as the number of keywords in the text is increased and the space they occupy grows. By decoupling the view from the page, we let both of them vary independently. The views are also dependent on the available space on the user screen and the overall font size and the stretching or skewing that users do via the window in which it displays. The views could also change by the changes to the font size of the text. On the other hand, the page could change whenever text is added or removed from the overall document. The viewer could be built with tools to enable flipping through the pages and to goto a specific in the document via bookmarks. Decoupling the viewer altogether to programs that are dedicated editors is possible by just marking the keywords for the text with special markups that those editors recognize. It's also possible that those very editors may support custom rendering in case they don't already support the special treatment for keywords.
Whether a viewer is written as part of the program that selects the keywords, or written as a plugin for existing editors, the viewer is concerned with only the user attention to keywords. The program for selecting keywords need not happen for each document rendering but done either before hand or as part of document life cycle. That is the scope in which the keyword extraction works is determined by what's convenient to the tool. As an example, the text of each document could be sent to a centralized database where the documents are processed and keywords are selected. The viewers that reside on the devices accessed by the user could choose to download this processed document whenever necessary. The uploading and downloading of the text may need to be done only when the contents change otherwise the processed document can just be cached. This eliminates the cost in terms of time to serve the document to the user.
The page and the page views are artifacts for the viewer and have no place in the processed text that we keep in such a proposed centralized repository. The repository and processing only sees unpaginated structured text.
There are some interesting properties of Pearsons' hashing function we brought up in the previous post. We will discuss some variations now:
1) When compared with the hashing that updates the hash in each step as h[I] = h[I-1] + c[I] mod table_size, the function does not separate anagrams and demonstrates a significant deviation from uniformity.
2) That said, the non-uniform distribution of the additive hashing function above does not necessarily confer a large performance penalty.
3) If the range of the input characters can be limited, a smaller auxiliary table may be used. i.e alphanumeric input
4) In some cases, the hash values may be required to be greater than the eight-bit representations To increase it to sixteen bit values, we could do the following:
apply h to the string, called the result H1, 
add 1 modulo 256 to the first character of the string,
apply h to the modified string to get H2,
concatenate H1 and H2 to get a  16-bit index.
Note that when this hashing is applied to the same spell-checker words as the original hash function there was marginal improvement collisions indicating that both perform just about the same.
5) Collision handling can be improved by using a succession of hash indices. The original hashing function is well suited for this. By repeatedly incrementing the first character of the input string, modulo 256, one causes the hash index returned by h to pass through all 256 possible hash index values since strings differing by one character cannot produce the same hashing value.
6) This hashing algorithm lends itself to perfect and minimal hashing. Recall that perfect hashing is one in which there are no collisions and minimal hashing is one in which the integers onto which the particular list of words mapped form a contiguous set. Minimal perfect hashing is useful when a predetermined set of high frequency words are expected as input and the resulting hash has to be used as an index into an array relating those words. We can use this hashing algorithm to take a set of 31 common English words and map them onto unique integers between 1 and 31, in alphabetical order by using a special case of the table T.
7) The advantages of Pearson's hashing function include the following:
there is no restriction on the length of the string
the length does not need to be computed beforehand
there is very little computation for each character of the input being hashed
similar string are not likely to collide
 The disadvantages are that the output value ranges for the hash are in powers of 2 and to get others is more complicated  and more auxiliary memory is required due to the table used.

Tuesday, November 26, 2013

Pearsons function for fast hashing of variable length text strings is an interesting hashing technique.
It proposes a hashing function that takes a word W as an input consisting of n characters C1, C2, ... Cn, each character being represented by one byte. The function returns an integer in the range 0 to 255. A table T of 256 randomish bytes used in the process for an indexed memory read and to perform an exclusive or operation against the input. The length of the string doesn't matter because the null terminator at the end of the string is used instead.
The steps for the hashing are described as follows:
h[0] = 0;
for( int i = 1; i <= N; i++)
  h[i] = T[ h[i-1] xor C[i] ];
return h[n];
There are two desirable properties of this algorithm which derive from cryptographic checksums and message authentication codes. First, it ensures that small changes to the data result in large and nearly random changes to the checksum. Second, the effect of changing one part of data must not be canceled by an easily computed change to some other part and this is preserved by this algorithm.
The downside is that it is length reducing the input and based on exclusive-OR of substrings. There is no undesirable effect of using a particular T or another as long as it is random. On the other hand, a manipulation of T to intentionally disperse the result for short similar strings turned out to be a poor choice.
h can be evaluated on the following:
h[n] is random
if two input strings differ by a single bit, they will never collide
h was applied to a spelling checker dictionary with 26,662 words and the resulting output values were not very uneven.
We may recall the textbook example for hashing:
template<class T> size_t Hash<T>::operator()(const T& key) const
{
   size_t res = 0;
   size_t len = sizeof(T);
   const char* p  = reinterpret_cast<const char*> (&key);
   while (len--) res = (res << 1 )  ^ *p++;
   return res;
}
Let us take a closer look at the insertion and deletion of nodes in B-Trees. To ensure that there are very few disk access to locate a key, we have to maintain a balanced tree during insertion and deletion. During insertion, we check if the child node is able to hold an additional node. If not, then a new sibling node is added to the tree and the child's key are redistributed to make room for the new node. During redistribution, the middle key of the overfilled childs node could be sent to the parent and split at that key. If we descend the tree during insertion and the root is full, then some of the keys are given to the new child. This change in the level of the tree maintains a balanced tree. During deletion we check to see if the root can absorb the child nodes i.e the key to be removed is replaced with the largest key of the left subtree or the smallest key in the right subtree. This replacement is guaranteed to come from a leaf. If a node has very few keys, say half full, on deletion, it looks for keys from sibling. If such a sibling exists, the node takes a key from the parent and the parent takes the extra keys from the sibling. If the sibling also has fewer keys, then the two nodes are joined to form one full node.
B* trees are similar. The only difference is that the nodes are key 2/3 full. This results in better utilization of space in the tree and slightly better performance.
B+ trees store all the keys at the leaf level, with their associated data values. The parent nodes keep the duplicates of the keys to guide the search. During insertion and deletion, the parent nodes have to be updated correctly. For example, when modifying the first key in a leaf, the last right pointer found while descending the nodes will require modification to reflect the new key value. Also, since all the keys are in the leaf level, they may be linked for sequential access.
In the B++ tree, the keys are again all in the leaves. Each node can hold k keys and the root node can hold 3k keys. During insertion, we check to see if a child node is full before we descend the level in the tree. If it is, we take the child node and the two adjacent nodes, and merge and redistribute them. If the two adjacent nodes are also full, then another node is added resulting in four nodes each three-quarters full.  Similarly during deletion, we check to see if the child node is half full before we descend the level. If it is, the keys in the child nodes and the two adjacent nodes are all merged and redistributed. If the two adjacent nodes are also half-full, then they are merged into two nodes each three-quarters full.
We increase the level of the tree when the root is full during insertion in which case we distribute the keys to four new nodes, each three-quarters full. During deletion, we check the child nodes and if there are only three that are all half-full, then they are merged into the root and reduce the level of the tree.

Monday, November 25, 2013

we mentioned external sorting and dictionaries in the previous post for sorting from very large data. Dictionaries are very large files too and they are implemented as an index to the actual file and contains the key and record address of data. A dictionary can be implemented with a red-black tree replacing pointers with offsets from the beginning of the index file, and use random access to reference nodes of the tree. However, every transition on a link could imply a disk access and would be prohibitively expensive. Since disk access is by sectors, its btter to group several keys together in each node to minimize the number of I/O operations This is what B-Trees do.
B-Trees arrange keys surrounded by pointers or offsets that are less than or greater than the key value.For example, all keys less than the key value are to the left and all keys greater than the key value are to the right. We can locate any key in this tree with fewer disk access than a regular tree because if we were to group 100 keys / node, we would search over 1,000,000 keys in with just these three nodes accesses. To ensure this property holds, we must maintain a balanced tree during insertion and deletion.
B-Trees come in many flavors. For example, we have B-Tree, B*-Tree and B+-Tree. The standard B-Tree stores keys and data in both internal and leaf nodes.B* trees are similar only that the nodes are kept 2/3 full for better utilization of space in the tree and better performance.
In a B+ tree, all keys are stored at the leaf level with their associated data values. Duplicates of the keys appear in internal parent nodes to guide the search. Also, the pointers have a slightly different meaning. The left pointer designates all keys less than the value, while the right pointer designates all keys greater than or equal to the value. The author also mentions a  B++-Tree that is similar to the B+ trees except for the split/join strategy i.e the keys are scattered to and gathered from other nodes for insertion and deletion.