Friday, June 28, 2013

A frequently occurring interview question is to print the nodes of a tree using the breadth first traversal. The breadth first traversal of a tree involves visiting each node from left to right at each level on the tree starting from the root and descending to the leaves. The order in which each sibling occurs is the order in which their parents occur. Dijkstra's algorithm uses this to order the visits in the breadth first traversal by maintaining a queue. The root is added to the queue. There are no more nodes at this level. When the root is dequeued, its siblings if any are added to the queue. When one of the siblings is dequeue, its siblings are enqueued. When we dequeue the last node at any level and add its siblings to queue, the next node to be dequeued is the first node of the next level. We repeat this until the queue is empty. The queue is empty when all of the nodes have been added and removed. Since the visits are ordered the time complexity is linear. 

        static void PrintBreadthNodes(BinaryTreeNode root)
        {
            if (root == null) return;
            var list = new List<BinaryTreeNode?>(){null};           
            list.Add(root);
            while (list.Count > 0)
            {
                var item = list.First();
                list.RemoveAt(0);
                if (item != null)
                {
                    Console.Write(item);
                    if (item.left != null) list.Add(item.left);
                    if (item.right != null) list.Add(item.right);
                }
                else
                {
                    if (list.Count == 0) return;
                    list.Add(null);
                    Console.WriteLine();
                }
            }
        }

Thursday, June 27, 2013

Mining multimedia data on the web
Multimedia can be embedded on the web pages and associated with link information. These texts and link information can be regarded as features. Using some web page layout mining techniques, a web page can be partitioned into a set of semantic blocks, then the block that contains the semantic data can be considered a whole. Searching and organizing multimedia data can be considered as searching the multimedia as a whole. VIPS discussed earlier can help identify the surrounding text and this text can be used to build an image index. Then image search is partially completed using traditional text search techniques. To construct a web-image search, in addition to the block to page and page to block relations, a block to image relation is also used.
For the automatic classification of web documents, different vendors maintain their own taxonomy of web document and new documents are classified based on this taxonomy. Otherwise the procedures described earlier for keyword based document classification and keyword based association analysis are also used here.
We discussed web usage  and structure based mining, and from the previous post, we now discuss web usage mining. Here web log records is used to discover who accesses what pages and their access patterns. This can give clues to who the potential customers are, enhance the quality and delivery of internet information services to end users and improve performance. Web log records can quickly grow in size and can amount to huge data size. There is a lot of information that can be mined but valid but making sure its valid and reliable is equally important.  So we begin with cleaning, condensing and transforming as preprocessing methods. Next with the available URL, time, IP address and web page content information, a weblog database is populated. This is used with a very typical data warehouse based multidimensional OLAP analysis. Then we use association mining as discussed earlier to find patterns and trends of web access. For example, user browsing sequences of web pages can be helpful to improve their web experience. Systems can be improved with better web caching, web page prefetching, web page swapping, understanding the nature of web traffic and understanding the user reaction and motivation. Web sites can improve themselves based on the access pattern and they are referred to as adaptive sites.
Web usage together with web content and web structure together help with web page ranking. Hence the quality of search is improved, because search is contextualized and personalized.
Having talked about the VIPS algorithm and structural relationships, it may be interesting to note that we extract page to block and block to page relationship to construct a graph model with page graph and block graph. Based on this graph model, new link analysis algorithms capable of discovering the intrinsic semantic structure of the web. The block to page (link structure) and the page to block(page layout) relationships are used in block level link analysis. The block to page relationship is obtained from link analysis, using a matrix for distances, which gives more accurate and robust representation of the link structures of the web. The page to block relationships are obtained from the page layout analysis which can be segmented into blocks. Then we construct the block graph and the image graph in turn. This image graph better reflects the semantic relationships between the images. 

Wednesday, June 26, 2013

Mining the world wide web
The world wide web serves as a huge, widely distributed global information service center for a variety of market such as finance, consumer, education, government, e-commerce, and many others. In addition to the content on the webpage, web page access, usage information and hyperlinks also provide additional sources for data mining.  The challenges are : 1) The web is too huge for effective data warehousing and data mining. 2) The web pages are complex, lack a unifying structure and vary quite a bit. 3) The web is a highly dynamic information source. 4) The web serves a broad diversity of user communities. 5) The bulk of the information is considered irrelevant. Web search engines serve up resources on the internet and they usually index the web pages and store huge keyword based indices. However, such approach has had limitations. First a topic can contains hundreds of thousands of documents. This can lead to a number of document entries returned by a search engine. Second relevant documents may not be retrieved because they don't contain the keywords.  Since keyword based web search engine is not sufficient for web resource discovery, web mining has to address search on web structures, rank the contents, discover patterns etc. This is typically categorized as web content mining, web structure mining, and web usage mining.  Web pages are supposed to have a DOM tree structure. and have a layout that's generally shared by many. Hence page segmentation based on Vision is a common technique. The page layout is taken and the blocks are extracted while finding the separators so that the web page can be represented as a semantic tree. Search engines also automatically identify authoritative web pages. This is done based on the collective endorsement of web pages by way of hyperlinks pointing from one page to another. These hyper links infer the notion of authority. Also, a so called hub can provide a collection of links to authorities. Hub pages may not be prominent, or there may exist few links pointing to them. but they can be mined to find authoritative pages. This algorithm called HITS (Hyperlinked Induced topic search involves using the query terms to collect a starting set, also called the root set, Since many of the pages are presumably relevant, the root set can be expanded to a base set based on a threshold of additional pages to include. And a weight propagation method is initiated. Finally the HITS algorithm outputs a short list of pages with large hub weights. To workaround the size of the hub pages, the weights of the links are adjusted and the hub pages are broken down into smaller units. Google's page rank uses something similar. Other newer algorithms include block-level link analysis and block to page relationship. Next we will look at mining multimedia data on the web.
Comparison of decision tree versus vector space model for document databases
A decision tree operates well on structured data where each tuple is defined by a set of attribute value pairs. This analysis decides which features are more discriminating and build the tree top down. Once the tree is constructed, it is used across all records one by one. Alternatively, the rules can be derived and ordered in a flat list based on their discriminative power and occurrence frequency.  Either way, both the attribute value pairs and the order of evaluation are set prior to evaluating new data. This is efficient for large data sets where each data object can be evaluated against the tree to find the label to assign.
But for document databases, this is not the case. The set of attributes or dimensions is not fixed and furthermore, some attributes may be more relevant than others. Support vector space model work well in such high-dimensional space since they use a mapping function that maps term space to a quantitative variable. However, vector space model may be assigning improper weights to rare items since it does not take into account the distribution characteristics of other data points. This can be addressed with a preprocessing step called feature selection which removes the terms in the training documents that are statistically uncorrelated with the class labels.
To recap, here are the advantages and disadvantages of using decision trees versus vector space model.
 
Advantages
Disadvantages
Decision Tree
·         Useful for structured data
·         Applicable to hierarchical categorical distinctions.
·         Automated tree induction is straightforward.
·         Accuracy at each level can be determined before growing the tree
·         Can be printed out and interpreted
·         Amount of training data reduced as we descend the decision stumps
·         Force features to be checked in a specific order even if the features may act relatively independent.
·         Doesn’t work well with weak predictors of correct label.
Vector space model
·         Works with varied length attribute value pairs
·         Allows quantitative representation of data points
·         Allows all features to act in parallel.
·         Ordering is not required
·         Overcomes weak predictors with feature selection
·         Disregards class distributions
·         May assign large weights to rare items.
·         Can be extremely slow

For example,  the following code attempts to find the topic as a single keyword in the provided text.

import nltk.corpus
from nltk.text import TextCollection
from nltk import cluster
from numpy import array

# method to get the topic of a given text
def getTopic(text):
    # clean input
    stop = open('stopwords.txt').read()
    l = []
    src = [w.strip(" .,?!") for w in nltk.word_tokenize(text.lower()) if w not in stop]
    candidates = nltk.FreqDist(w for w in src if w.__len__ > 3)
    candidates = candidates.keys()[:10]

    # initialize vectors
    brown = TextCollection(nltk.corpus.brown)
    for w in candidates:
        l.append((w,brown.tf_idf(w, candidates)))
    vectors = [array(l)]

    # initialize the clusterer
    clusterer = nltk.cluster.kmeans.KMeansClusterer(10, euclidean_distance)
    clusterer.cluster(vectors, True)

    #pick the one closest to the center of the largest
    o = [(clusterer.classify(l.index(i)), l.index(i)) for i in range(l.__len__)]
    o.reverse()
    print o.pop().index(1)
 

Tuesday, June 25, 2013

Keyword based association analysis:Such analysis collects sets of keywords or terms that occur together and finds correlations between them. Each document is considered a transaction and the set of keywords in that documents are considered a set of items in a transaction. The association mining process can detect compound associations, such as domain dependent terms or phrases and non-compound associations such as units of measure. These are called term level association mining  as opposed to mining on individual words and consist of bigrams and trigrams. Terms and phrases are automatically tagged and the number of meaningless results are greatly reduced. With such term and phrase recognition, standard association mining or max pattern mining may be evoked.
Document classification analysis: Automated document classification assigns labels to documents for easy lookup and retrieval This is used for tagging topics, constructing topic directory, identifying document writing styles, and the purposes of grouping hyperlinks associated with a set of documents. Document classification is automated in the following way: First, a set of pre-classified documents is used as the training set. The training set is then analyzed in order to derive a classification scheme. Such a classification scheme often needs to be refined with a testing process, then it can be used for classifying other documents. This is different from classifying relational data which is structured. Each tuple in the relational data is defined by a set of attribute value pairs and the classification analysis decides which attribute value pair is most discriminating. Document databases on the other hand contain a set of a keywords per document and don't have any fixed set of attributes and treating each keyword as a dimension results in a large dimension set. Decision tree analysis is not suitable for document classification.
Techniques commonly used for document classification include nearest neighbor classification, feature selection methods, Bayesian classification, support vector machines, and association based classification.
The k- nearest neighbor classifier assumes that similar documents share similar document vectors and are expected to be assigned the same class label. We can simply index all of the training documents, each associated with its corresponding class label. For a given test document query, we can retrieve the k - most similar and use their label. To refine the label selection, we can tune k or use weights associated with documents. Furthermore, a feature selection process can be used to remove terms in the training document that are statistically uncorrelated with the class labels. Bayesian classification is another popular technique that involves the statistical distribution of documents in specific classes. This classifier first trains the model by generating a document distribution to each class c of document d and then tests which class is most likely to generate the test document. Another classification method is the support vector machine. Here the classes are represented by numbers and a direct mapping function from term space to the class variable is constructed.  The least square linear regression method can be used to discriminate this classification. Association based classification classifies document  based on a set of associated frequently occurring text patterns. However frequent terms are usually poor discriminators and the not so frequent ones have better discriminative power. Such an association based classification method proceeds as follows First, we extract the keywords and terms by information retrieval and simple association analysis techniques mentioned above. Second, we use a concept hierarchy with term classes such as WordNet, or expert knowledge or keyword classification systems. Documents in the training set can also be classified into class hierarchies. A term association mining can then be applied to discover sets of associated term that can be used to maximally distinguish one class of documents from another. This derives a set of association rules associated with each document class which can be ordered based on their discriminative power and occurrence frequency and then used to classify new documents.

Monday, June 24, 2013

Steps for processing text:
Text cleaning : clean the data, remove stop words and resolve inconsistencies and stem the terms.
Keywords are extracted based on their conditional probability distributions.
Text is usually analyzed for topics by clustering with K-means using a cosine based distance vector.
Another data mining tool is to use the clustering for high-dimensional data.  For example, text documents may contain thousands of terms or keywords as features. Document classification differ from topic analysis in that the document is high-dimensional where there are many keywords as features. Text documents are clustered on the frequent terms they contain. Terms are extracted and stemmed to reduce the term to its basic stem. This reduces the document to a bag of words.  If each term is treated as a dimension, the document becomes high-dimensional where there are as many features as keywords. That is, by using a frequent term based analysis, a well selected subset of all documents can be considered as a clustering. The frequent term based cluster is used to refer to a cluster which consists of the set of documents containing all of the terms of the frequent term set. A good subset of all the frequent term sets is selected. A good subset is one that covers all of the documents. Cluster overlap is measured by the distribution of documents supporting some clusters over the remaining cluster candidates. Clusters are automatically described by their frequent set.
Information retrieval is concerned with the organization and retrieval of information from a large number of documents usually with metrics such as Term-frequency-inverse-document-frequency and Shannon information. Two documents are considered to be similar if they have a high cosine measure. Unlike database systems that address concurrency control, recovery, transaction management and update, information retrieval is concerned with keywords and the notion of relevance.
Popular Text Indexing techinques include inverted indices and signature files. An inverted index comprises of two indexed tables (hash or B+-tree): document_table and term_table. The document table comprises of the document_id and a list of terms for each document aka posting_list. The term table consists of the term_id and the list of  document identifiers in which the term appears. The list of terms can be quite large so a hashing technique and a superimposed coding technique to encode a list of terms into bit representation.
 A signature file stores a signature record for each document in the database. Signatures are usually b bits long, initialized to zero, and a bit is set to 1 if the term appears in the document.  To reduce the high-dimensionality of documents, reduction techniques such as latent semantic indexing, probabilistic latent semantic indexing and locality preserving indexing can be used.
Latent semantic indexing decomposes the document matrix using singular value decomposition. i.e. it extracts the most representative features while minimizing the reconstruction error. If the rank of the term document X is r, then LSI decomposes X using X=USigmaVsuffixT where Sigma is the representation of the singular values of X. The LSI uses the first K vectors to embed the documents in a k-dimensional subspace.
Locality preserving index - extracts the most discriminative features by preserving the locality of documents in the reduced dimensionality space. Locality implies semantic closeness. LPI uses a minimization function with constraints.
While LSI seeks to uncover the most representative features, LPI aims to discover the geometrical structure of the document space.
Probabilistic Latent Semantic indexing is similar to LSI but achieves reduction through a probabilistic mixture model. Specifically there are k latent common themes in the document collection and each is characterized by a multinomial word distribution.
There are other approaches to text mining as well in addition to keyword based approach. These are 1) tagging approach and 2) information extraction approach. Tagging may be manual or done by some automated categorization algorithm where the tag set is small and predefined. Information extraction is deeper in that it requires semantic analysis of text by natural language understanding and machine learning methods. Other text mining tasks may include document clustering, classification, information extraction, association analysis, and trend analysis.
Self organizing feature maps is a neural network method for cluster analysis. A neural network is a set of connected input/output units, where each connection has a weight associated with it. They are popular for clustering because 1) they are inherently parallel and distributed processing architectures 2) they learn by adjusting their inter-connection weights so as to best fit the data. With this, they normalize the patterns and act as feature extractors for the various clusters. 3) They process numerical vectors and require object patterns to be represented by quantitative patterns.
Each cluster is represented as an exemplar which means a prototype and does not have to match a data example. Data points are assigned to cluster that is most similar to an exemplar based on a distance measure. The attributes are then predicted from the attributes of the exemplar.
Self-organizing feature maps represent all points in a high-dimensional source space by a points in 2-D or 3-D space such that distance and closeness are maintained. The method is useful when the problem is inherently non-linear.
SOM can be viewed as constrained versions of k-means clustering where the cluster centers are in low dimensional space.
Clustering is performed with several units competing for the current object. The unit whose weight vector is closest to the current object becomes the winning unit. The weights of this unit and those of its neighbors are adjusted to get closer to the input object. The assumption is there is a topology or ordering in the input that the units will eventually take shape. This is called a feature map. Such processing is applied to web mining but is costly for large databases.