Tuesday, October 14, 2014

Today we look at inverted lists. In Apache Lucene  index, a document consists of multiple fields and a field consists of multiple terms.  The indexing of the document is controlled by labeling thefields as whether it needs to be analyzed (preprocessed), indexed, or stored (ready to be retrieved). The inverted index has a list of fields, their labels and offsets within the postings list. The entry in the postings list corresponding to the field has a term and frequency and positions in the frequency list. The postion in the frequency list gives the documents sorted by their Id that contain this term and each with offsets into the position in terms of say word count where the term appears. The whole index also called the vocabulary is sorted first by document then by fields and then by terms.   Thus it’s a hierarchy of lists. 

We now look at document retrieval step. This is the second step of the two listed earlier.  Here we are interested in finding the top ranking matches of documents.  Each document is checked for a match and retrieved one at a time.   The documents are found from the postings list of the terms in the query.  The postings list can be traversed concurrently.  The TF-IDF score of the document is calculated to determine a match for the document.  Apache Lucene calculates a static and a dynamic score which are based on the document and the dot product of the query and document vectors respectively. It also has a boost factor for the term frequency. The reason we bring up the static score is that the documents could additionally be indexed based on static score. However, we will just use the notion of a score to rank the documents by inserting into a priority queue data structure.   

For large collections, the index can be distributed across multiple machines. The indexes are partitioned based on term partitioning and document partitioning.  Usually these are determined outside thesystem because the caller will know how to partition.  In ElasticSearch for example, the machines are organized by rows and columns. Each column represents a partition of documents while each rowrepresents a replica of the corpus.  

Note that for the score we use, the IDF is calculated based on the whole corpus not just the partitions. In this case, either we treat the partitions as uniform fractions of the corpus or we poll and collect thescores from each.  


#codingexercise
Implement strtok
Char* strtok (char* s, const char* delim)
{
Static Char* last;
Int c, sc;
If ( s == NULL && (s = last) == NULL))
{
Return NULL;
}
cont:
C = *sc++;
 For ( spanp  = (char*) delim, (sc = *spanp++) != 0;)
     If ( c == sc) goto cont;
}
If ( c == 0){
Last = Null:
Return NULL;
}
Tok = s - 1;
For (;;){
C = *sc++;
SC = (char*) delim;

 For ( spanp  = (char*) delim, (sc = *spanp++) != 0;)
{
 If (sc == c)
{
  If ( c == 0)
     s = Null;
  Else
     s [-1] == 0;
  Last = s;
  Return (tok);
}

}

}

// unreached

}

Courtesy: Apple


My back is hurting for some time now

No comments:

Post a Comment