Wednesday, June 26, 2013

Mining the world wide web
The world wide web serves as a huge, widely distributed global information service center for a variety of market such as finance, consumer, education, government, e-commerce, and many others. In addition to the content on the webpage, web page access, usage information and hyperlinks also provide additional sources for data mining.  The challenges are : 1) The web is too huge for effective data warehousing and data mining. 2) The web pages are complex, lack a unifying structure and vary quite a bit. 3) The web is a highly dynamic information source. 4) The web serves a broad diversity of user communities. 5) The bulk of the information is considered irrelevant. Web search engines serve up resources on the internet and they usually index the web pages and store huge keyword based indices. However, such approach has had limitations. First a topic can contains hundreds of thousands of documents. This can lead to a number of document entries returned by a search engine. Second relevant documents may not be retrieved because they don't contain the keywords.  Since keyword based web search engine is not sufficient for web resource discovery, web mining has to address search on web structures, rank the contents, discover patterns etc. This is typically categorized as web content mining, web structure mining, and web usage mining.  Web pages are supposed to have a DOM tree structure. and have a layout that's generally shared by many. Hence page segmentation based on Vision is a common technique. The page layout is taken and the blocks are extracted while finding the separators so that the web page can be represented as a semantic tree. Search engines also automatically identify authoritative web pages. This is done based on the collective endorsement of web pages by way of hyperlinks pointing from one page to another. These hyper links infer the notion of authority. Also, a so called hub can provide a collection of links to authorities. Hub pages may not be prominent, or there may exist few links pointing to them. but they can be mined to find authoritative pages. This algorithm called HITS (Hyperlinked Induced topic search involves using the query terms to collect a starting set, also called the root set, Since many of the pages are presumably relevant, the root set can be expanded to a base set based on a threshold of additional pages to include. And a weight propagation method is initiated. Finally the HITS algorithm outputs a short list of pages with large hub weights. To workaround the size of the hub pages, the weights of the links are adjusted and the hub pages are broken down into smaller units. Google's page rank uses something similar. Other newer algorithms include block-level link analysis and block to page relationship. Next we will look at mining multimedia data on the web.

No comments:

Post a Comment