Thursday, June 27, 2013

Mining multimedia data on the web
Multimedia can be embedded on the web pages and associated with link information. These texts and link information can be regarded as features. Using some web page layout mining techniques, a web page can be partitioned into a set of semantic blocks, then the block that contains the semantic data can be considered a whole. Searching and organizing multimedia data can be considered as searching the multimedia as a whole. VIPS discussed earlier can help identify the surrounding text and this text can be used to build an image index. Then image search is partially completed using traditional text search techniques. To construct a web-image search, in addition to the block to page and page to block relations, a block to image relation is also used.
For the automatic classification of web documents, different vendors maintain their own taxonomy of web document and new documents are classified based on this taxonomy. Otherwise the procedures described earlier for keyword based document classification and keyword based association analysis are also used here.
We discussed web usage  and structure based mining, and from the previous post, we now discuss web usage mining. Here web log records is used to discover who accesses what pages and their access patterns. This can give clues to who the potential customers are, enhance the quality and delivery of internet information services to end users and improve performance. Web log records can quickly grow in size and can amount to huge data size. There is a lot of information that can be mined but valid but making sure its valid and reliable is equally important.  So we begin with cleaning, condensing and transforming as preprocessing methods. Next with the available URL, time, IP address and web page content information, a weblog database is populated. This is used with a very typical data warehouse based multidimensional OLAP analysis. Then we use association mining as discussed earlier to find patterns and trends of web access. For example, user browsing sequences of web pages can be helpful to improve their web experience. Systems can be improved with better web caching, web page prefetching, web page swapping, understanding the nature of web traffic and understanding the user reaction and motivation. Web sites can improve themselves based on the access pattern and they are referred to as adaptive sites.
Web usage together with web content and web structure together help with web page ranking. Hence the quality of search is improved, because search is contextualized and personalized.
Having talked about the VIPS algorithm and structural relationships, it may be interesting to note that we extract page to block and block to page relationship to construct a graph model with page graph and block graph. Based on this graph model, new link analysis algorithms capable of discovering the intrinsic semantic structure of the web. The block to page (link structure) and the page to block(page layout) relationships are used in block level link analysis. The block to page relationship is obtained from link analysis, using a matrix for distances, which gives more accurate and robust representation of the link structures of the web. The page to block relationships are obtained from the page layout analysis which can be segmented into blocks. Then we construct the block graph and the image graph in turn. This image graph better reflects the semantic relationships between the images. 

No comments:

Post a Comment