Wednesday, November 27, 2013

In the post about Keyword extraction by Kullback-Leibler technique, we talked about having a sliding scale of threshold for selecting keywords. We also talked about a viewer for users to see the keywords as highlighted in place using fonts bigger than the others in the text. As the threshold on the sliding scale is increased, we talked about more keywords appearing on the viewer.
 In today's post, I'm not only going to cover how to render the text but also cover how to make it independent of the actual text.
We will use offsets or word counts to find relative positions of keywords from start.
With the relative positions, we are able to find the keywords dynamically. We use pagination for blocks of texts where each page can having varying word counts as can fit the page view determined independently. We represent the page with an abstraction that lets us keep track of blocks of text which we will refer to as pages. We represent what the user sees with a page view that can change to accommodate text spanning one or more of the pages. Whenever that happens, re-pagination is involved and the page views is updated with the corresponding page. Thus there is a one to one relationship between the page and the page view. The page for each page view is expected to change as the number of keywords in the text is increased and the space they occupy grows. By decoupling the view from the page, we let both of them vary independently. The views are also dependent on the available space on the user screen and the overall font size and the stretching or skewing that users do via the window in which it displays. The views could also change by the changes to the font size of the text. On the other hand, the page could change whenever text is added or removed from the overall document. The viewer could be built with tools to enable flipping through the pages and to goto a specific in the document via bookmarks. Decoupling the viewer altogether to programs that are dedicated editors is possible by just marking the keywords for the text with special markups that those editors recognize. It's also possible that those very editors may support custom rendering in case they don't already support the special treatment for keywords.
Whether a viewer is written as part of the program that selects the keywords, or written as a plugin for existing editors, the viewer is concerned with only the user attention to keywords. The program for selecting keywords need not happen for each document rendering but done either before hand or as part of document life cycle. That is the scope in which the keyword extraction works is determined by what's convenient to the tool. As an example, the text of each document could be sent to a centralized database where the documents are processed and keywords are selected. The viewers that reside on the devices accessed by the user could choose to download this processed document whenever necessary. The uploading and downloading of the text may need to be done only when the contents change otherwise the processed document can just be cached. This eliminates the cost in terms of time to serve the document to the user.
The page and the page views are artifacts for the viewer and have no place in the processed text that we keep in such a proposed centralized repository. The repository and processing only sees unpaginated structured text.
There are some interesting properties of Pearsons' hashing function we brought up in the previous post. We will discuss some variations now:
1) When compared with the hashing that updates the hash in each step as h[I] = h[I-1] + c[I] mod table_size, the function does not separate anagrams and demonstrates a significant deviation from uniformity.
2) That said, the non-uniform distribution of the additive hashing function above does not necessarily confer a large performance penalty.
3) If the range of the input characters can be limited, a smaller auxiliary table may be used. i.e alphanumeric input
4) In some cases, the hash values may be required to be greater than the eight-bit representations To increase it to sixteen bit values, we could do the following:
apply h to the string, called the result H1, 
add 1 modulo 256 to the first character of the string,
apply h to the modified string to get H2,
concatenate H1 and H2 to get a  16-bit index.
Note that when this hashing is applied to the same spell-checker words as the original hash function there was marginal improvement collisions indicating that both perform just about the same.
5) Collision handling can be improved by using a succession of hash indices. The original hashing function is well suited for this. By repeatedly incrementing the first character of the input string, modulo 256, one causes the hash index returned by h to pass through all 256 possible hash index values since strings differing by one character cannot produce the same hashing value.
6) This hashing algorithm lends itself to perfect and minimal hashing. Recall that perfect hashing is one in which there are no collisions and minimal hashing is one in which the integers onto which the particular list of words mapped form a contiguous set. Minimal perfect hashing is useful when a predetermined set of high frequency words are expected as input and the resulting hash has to be used as an index into an array relating those words. We can use this hashing algorithm to take a set of 31 common English words and map them onto unique integers between 1 and 31, in alphabetical order by using a special case of the table T.
7) The advantages of Pearson's hashing function include the following:
there is no restriction on the length of the string
the length does not need to be computed beforehand
there is very little computation for each character of the input being hashed
similar string are not likely to collide
 The disadvantages are that the output value ranges for the hash are in powers of 2 and to get others is more complicated  and more auxiliary memory is required due to the table used.

Tuesday, November 26, 2013

Pearsons function for fast hashing of variable length text strings is an interesting hashing technique.
It proposes a hashing function that takes a word W as an input consisting of n characters C1, C2, ... Cn, each character being represented by one byte. The function returns an integer in the range 0 to 255. A table T of 256 randomish bytes used in the process for an indexed memory read and to perform an exclusive or operation against the input. The length of the string doesn't matter because the null terminator at the end of the string is used instead.
The steps for the hashing are described as follows:
h[0] = 0;
for( int i = 1; i <= N; i++)
  h[i] = T[ h[i-1] xor C[i] ];
return h[n];
There are two desirable properties of this algorithm which derive from cryptographic checksums and message authentication codes. First, it ensures that small changes to the data result in large and nearly random changes to the checksum. Second, the effect of changing one part of data must not be canceled by an easily computed change to some other part and this is preserved by this algorithm.
The downside is that it is length reducing the input and based on exclusive-OR of substrings. There is no undesirable effect of using a particular T or another as long as it is random. On the other hand, a manipulation of T to intentionally disperse the result for short similar strings turned out to be a poor choice.
h can be evaluated on the following:
h[n] is random
if two input strings differ by a single bit, they will never collide
h was applied to a spelling checker dictionary with 26,662 words and the resulting output values were not very uneven.
We may recall the textbook example for hashing:
template<class T> size_t Hash<T>::operator()(const T& key) const
{
   size_t res = 0;
   size_t len = sizeof(T);
   const char* p  = reinterpret_cast<const char*> (&key);
   while (len--) res = (res << 1 )  ^ *p++;
   return res;
}
Let us take a closer look at the insertion and deletion of nodes in B-Trees. To ensure that there are very few disk access to locate a key, we have to maintain a balanced tree during insertion and deletion. During insertion, we check if the child node is able to hold an additional node. If not, then a new sibling node is added to the tree and the child's key are redistributed to make room for the new node. During redistribution, the middle key of the overfilled childs node could be sent to the parent and split at that key. If we descend the tree during insertion and the root is full, then some of the keys are given to the new child. This change in the level of the tree maintains a balanced tree. During deletion we check to see if the root can absorb the child nodes i.e the key to be removed is replaced with the largest key of the left subtree or the smallest key in the right subtree. This replacement is guaranteed to come from a leaf. If a node has very few keys, say half full, on deletion, it looks for keys from sibling. If such a sibling exists, the node takes a key from the parent and the parent takes the extra keys from the sibling. If the sibling also has fewer keys, then the two nodes are joined to form one full node.
B* trees are similar. The only difference is that the nodes are key 2/3 full. This results in better utilization of space in the tree and slightly better performance.
B+ trees store all the keys at the leaf level, with their associated data values. The parent nodes keep the duplicates of the keys to guide the search. During insertion and deletion, the parent nodes have to be updated correctly. For example, when modifying the first key in a leaf, the last right pointer found while descending the nodes will require modification to reflect the new key value. Also, since all the keys are in the leaf level, they may be linked for sequential access.
In the B++ tree, the keys are again all in the leaves. Each node can hold k keys and the root node can hold 3k keys. During insertion, we check to see if a child node is full before we descend the level in the tree. If it is, we take the child node and the two adjacent nodes, and merge and redistribute them. If the two adjacent nodes are also full, then another node is added resulting in four nodes each three-quarters full.  Similarly during deletion, we check to see if the child node is half full before we descend the level. If it is, the keys in the child nodes and the two adjacent nodes are all merged and redistributed. If the two adjacent nodes are also half-full, then they are merged into two nodes each three-quarters full.
We increase the level of the tree when the root is full during insertion in which case we distribute the keys to four new nodes, each three-quarters full. During deletion, we check the child nodes and if there are only three that are all half-full, then they are merged into the root and reduce the level of the tree.

Monday, November 25, 2013

we mentioned external sorting and dictionaries in the previous post for sorting from very large data. Dictionaries are very large files too and they are implemented as an index to the actual file and contains the key and record address of data. A dictionary can be implemented with a red-black tree replacing pointers with offsets from the beginning of the index file, and use random access to reference nodes of the tree. However, every transition on a link could imply a disk access and would be prohibitively expensive. Since disk access is by sectors, its btter to group several keys together in each node to minimize the number of I/O operations This is what B-Trees do.
B-Trees arrange keys surrounded by pointers or offsets that are less than or greater than the key value.For example, all keys less than the key value are to the left and all keys greater than the key value are to the right. We can locate any key in this tree with fewer disk access than a regular tree because if we were to group 100 keys / node, we would search over 1,000,000 keys in with just these three nodes accesses. To ensure this property holds, we must maintain a balanced tree during insertion and deletion.
B-Trees come in many flavors. For example, we have B-Tree, B*-Tree and B+-Tree. The standard B-Tree stores keys and data in both internal and leaf nodes.B* trees are similar only that the nodes are kept 2/3 full for better utilization of space in the tree and better performance.
In a B+ tree, all keys are stored at the leaf level with their associated data values. Duplicates of the keys appear in internal parent nodes to guide the search. Also, the pointers have a slightly different meaning. The left pointer designates all keys less than the value, while the right pointer designates all keys greater than or equal to the value. The author also mentions a  B++-Tree that is similar to the B+ trees except for the split/join strategy i.e the keys are scattered to and gathered from other nodes for insertion and deletion.
In this post, I want to take a break to talk about sorting data from very large files. For data sets in memory, we can use one of several sorting algorithms. For data that is bigger than what can reside in memory, we will look at techniques for sorting externally and implementing dictionaries or B-Trees.
In external sorting, we implement what is known as replacement selection to establish initial runs followed by poly-phase merge sort to merge the runs into one sorted file. A run is a sequence of number that is already sorted.
To create the initial runs, a buffer is allocated in memory to act as a holding pane for several records. Initially the buffer is filled, then the following steps are repeated until the input is exhausted.
Select the record with the smallest key that is >= the last record written
If all the keys are smaller than the key of the last record written, then we have reached the end of the run.
Select the record with the smallest key for the first record of the next run.
Write the selected record
Replace the selected record with a new record from the input.
If we take a two record buffer, this is easy to illustrate. We load the buffer and write the smallest key to the destination. We replace this record with the next one and repeat the process. This process continues until all the keys written are smaller than the last key. At this point, we terminate the run and start another.
This strategy simply utilizes an intermediate buffer to hold values until the appropriate time for output.
Initially all the data is in one place. The source is read and the runs are distributed to other places. After the initial runs are created, they are merged. We use Fibonacci number of runs with each input so that the number of passes can be minimized.  Fibonacci numbers have the property that each number is the sum of two preceding numbers. When we take the number of runs in each input and merge in the third, we are left with run counts of smaller and smaller number in this sequence. Each merge takes the numbers down by a notch. As mentioned, this reduced the number of passes on the data and is very helpful when we are considering sequential access. We want optimized access because we want to be efficient on the large dataset.
In poly-phase merge, we work with frames of data at a time and we take runs of data that are sequential from both source and merge them into a longer run on a third storage in each phase.  Then we relabel the storage so that the merged data acts as a source and merge this with the remainder of one of the sources. Finally we write it back to the destination.
The runs help us to merge perfectly in the sense that there are no extra runs on any source. That is why the step to creating the initial runs helps.
courtesy : Niemann
 

Sunday, November 24, 2013

Johnson's method of clustering scheme was first written during the summer of 1965 when he was working at Bell Labs during a break from his PhD. It introduced cluster hierarchies and came with a program written in FORTRAN. Agglomerative hierarchical clustering places each object into its own cluster and gradually merges these atomic clusters into larger and larger clusters until there is only one. Divisive hierarchical clustering reverses this process by starting with all objects in one cluster and then subdividing into small clusters. Hierarchical classification is  a special sequence of partition-al classification.
Serial clustering handles the classification one by one while whereas simultaneous clustering works with the entire set of patterns at the same time.  A clustering algorithm can be expressed mathematically either by graph theory or by matrix algebra.  The former is used to describe connectedness and completeness and the latter is used to describe say mean-square-error.
For the purposes of clustering vectors based on KLD distances, we will use a simple partition scheme like the K-means we talked about. This is because we already know the sources categories when we pick the sample document. Besides, when we do keyword extraction, we will build a vocabulary from these documents. We may come across a large number of lemmas and we planned on adding the terms to an empty document that keeps the KLD distance above a threshold i.e D(avg-pt/q) > 1.
In addition, we mentioned varying the threshold on a sliding scale to retrieve more and more of the terms as needed. This we can set based on the samples at the hand that gives reasonable results. There is no pegging of the threshold that may be universally true however larger and larger values of the threshold are certain to give discriminative keywords. This can be built with a visual tool that renders the keywords in a slightly bigger font than the rest. This way the keywords will attract attention in place with the rest of the text. Together with a tool to render the sliding scale, we can increase the number of terms that have this bigger font and make it easier for the reader to find the keywords and see the changes in the number of keywords. Building this visual tool should be fairly straightforward in that it renders the same text in different forms. The bigger font or magnifying glass display for keywords can be enabled by setting the font and size markup to be different from the others.