Advanced Web Page Categorization Techniques for Efficient Information Retrieval

Topic Distillation and Web Page Categorization Prasanna K. Desikan (05/29/2002)

Motivation The web is a huge repository of information. • Categorizing webdocuments facilitates the search and retrieval of pages. • Topic distillation is the process of finding authoritative Web pages and comprehensive ‘hubs’ which reciprocally endorse each other and are relevant to a given query.

Approaches for Categorization • Text based Categorization • Structure or link based Categorization • Combination of link and text information

Web Page Categorization Algorithms • Manual categorization by domain specific experts. • Categorization would involve the analysis of the contents of the web page by a number of domain experts and classification based on the textual content as by Yahoo. • Content-based categorization - solely on document content or a combination of document content and META tags. • To classify adocument, all the stop words are removed and the remaining keywords/phrases arerepresented in the form of a feature vector.

Web Page Categorization Algorithms • Link and Content Analysis. • Based on the fact that a web page that refers to a document must contain enough hints about its content to induce someone to read it . Such hints can be used to classify the document being referred.

Topic Distillation in Hyperlinked Environment [1] • Aim: To find quality documents related to a query topic. • Problems encountered with HITS approach. • Mutually reinforcing relationships between hosts. • Automatically generated links. • Non Relevant Nodes (documents not relevant to the query topic) .

Topic Distillation in Hyperlinked Environment[1] Let the Web be represented as a graph with the node as a web page and the edge as a link. Approaches: • If there are k edges (an edge here is a link) from documents on a first host to a single document on a second host we give each edge an authority weight of 1/k.

Topic Distillation in Hyperlinked Environment[1] Approaches (contd…). • Compute the Relevance Weight for each node. • Eliminate non-relevant nodes from the graph by setting a threshold on the relevance weight . • Regulate the influence of a node based on its relevance.

Topic Distillation in Hyperlinked Environment[1] Approaches (contd…). • Partial Content Analysis. • Content Pruning by analyzing only a part of the graph- i.e. the nodes which are most influential in the outcome.

Automatic Resource Compilation [2] • Goal: Automatically compile a resource list on any topic that is broad and well-represented on the Web. • Approach. • search-and-growth phase. • a weighting phase. • w(p,q) = 1 + n(t). w(p,q) -measure of the authority on the topic invested by page ‘p’ in page ‘q’. n(t) - number of matches between terms in the topic description in the anchor window of width ‘B’. • an iteration-and-reporting phase.

Relaxation Labeling Technique[3] • First Classify the unclassified documents from the neighborhood (using terms only classifier -i.e using the text from the neighboring documents). • Iterate until convergence. • Recompute the class for each document using both the local text and the class information of the neighbors. • The relaxation is guaranteed to converge to a consistent state.

Probabilistic Relational Model[4] • Web Pages and Links are modeled as entities and relationships respectively, while each of them is represented as a class. • Create Bayesian network using the attributes from entity-relationship model in order to model uncertainty and make inference.

Probabilistic Relational Model • By belief propagation, an approximation inference approach, we can use our prior knowledge to infer the unobserved case. • Given new data with some unobserved variables, first assign most likely values to them. • Based on the estimation of those marginal probabilities, we predict the correct classification.

Probabilistic Relational Model • This approach proved to be effective when applied to hypertext classification problem, by utilizing both information from the content and the link structure, it provides more accurate classification and ability to do probabilistic reasoning.

Integrating the DOM With Hyperlinks for Enhanced Topic Distillation [6] A uniform grained model. • Web pages are represented by their tag trees (also called their Document Object Models (DOMs)). • DOM trees are interconnected by ordinary hyperlinks. • dis-aggregate mixed hubs.

html DocumentObject Model(DOM) body head Frontier ofdifferentiation table tr td tr td table ul Relevantsubtree … tr tr tr … li li li td td td a a a a Irrelevantsubtree ski.qaz.com Toncheese.co.uk art.qaz.com www.fromages.com A new fine grained model [7] <html>…<body>… <table …> <tr><td> <table …> <tr><td><a href=“http://art.qaz.com”>art</a></td></tr> <tr><td><a href=“http://ski.qaz.com”>ski</a></td></tr>… </table> </td></tr> <tr><td> <ul> <li><a href=“http://www.fromages.com”>Fromages.com</a> French cheese…</li> <li><a href=“http://www.teddingtoncheese.co.uk”>Teddington…</a> Buy online…</li> … </ul>… </td></tr> </table>… </body></html>

Integrating the DOM With Hyperlinks for Enhanced Topic Distillation Figure 6: The fine-grained model of Web linkage which unifies hyperlinks and DOM structure

Integrating the DOM With Hyperlinks for Enhanced Topic Distillation Benefits • Reduces Topic Drift • Identifies and extracts regions (DOM Subtrees) relevant to the query out of the following: • Broader hub • Hub with additional less-relevant contents and links

Web Page Classification Based on Document Structure Web pages that belong to a particular category have some similarity in theirstructure. • Information Pages. • Research Pages. • Personal Home Pages. The general structural information of any page can be deduced from the placement of links, text and images – including equations and graphs.

Web Page Categories Based on Structural Similarities • Information Pages • a logo on the top followed by a navigation barlinking the page to other important pages • the ratio of link text (amount of text with links) to normal text also tends to be relatively high • Research Pages • contain huge amounts of text, equations and graphs in the form of images • The number of distinctive gray levels/color shades in the images also provides a cue

Web Page Categories Based on Structural Similarities • Personal Pages. • The name and address of the person appear prominently at the top of the page. • A photograph of the person concerned. • towards the bottom of the page, the person provides links to his publications if there are any and other useful references or links to his favorite destinations on the web.

Feature Extraction • Textual Information. • The number and placement of links in a page provides valuable information about the broad category the page belongs to . • The ratio of number of characters in links to the total number of characters in the page.

Feature Extraction • Image Information • Information pages have more colors than personal homepages, which in turn have more colors than research pages • The histogram of synthetic images generally tends to concentrate at a few bands of color shades. In contrast, the histogram of natural images is spread over a larger area • Information pages usually contain many natural images, while research pages contain a number of synthetic images

Feature Extraction • Other Information • Approaches using classificationbased on video and other multimedia content presently not implemented

Details Results No. of Pages in which we tested our implementation ~4000 Pages Categorized ~3700 Pages categorized correctly ~3250 % categorized correctly 87.83% Results

Web Page Categories Based on Structural Similarities Conclusions and Future work for the approach: • This approach augmented with traditional text based approaches couldbe used for effective categorization of web pages. • Improvement in feature selection. • Automate the training process. • Has to be experimented on more data sets.

References [1]K.Bharat and M. Henzinger, Improved Algorithms for Topic Distillation in a hyperlinked environment, In 21st International ACM SIGIR Conference on Research and Development in Information Retrieval. [2] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Proceedings of the 7th World-Wide Web conference, 1998. [3] S. Chakrabarti, B. Dom and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proceedings of ACM SIGMOD 1998.

References [4] L.Getoor, E.Segal, B.Tasker, D.Koller. Probabilistic Models of Text and Link Structure for Hypertext Classification. IJCAI Workshop on "Text Learning: Beyond Supervision", Seattle, WA, August 2001. [5] Arul Prakash Asirvatham, Kranthi Kumar Ravi, C.V.Jawahar, 'Web Page Classification based on Document Structure‘. [6] Soumen Chakrabarti, ‘Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction ‘10th International World Wide Web Conference, Hong Kong, May 2001. [7] Soumen Chakrabarti, Mukul M. Joshi , Vivek B. Tawde, ‘Enhanced topic distillation using text, markup tags, and hyperlinks.’ SIGIR 2001, New Orleans, LA, Sep 2001.

Advanced Web Page Categorization Techniques for Efficient Information Retrieval

Advanced Web Page Categorization Techniques for Efficient Information Retrieval

Presentation Transcript

Web Page as User Interface: Form and Web Application Research Topic Presentation

Automatic Web Page Categorization by Link and Context Analysis

Distillation

Fermentation and Distillation

Distillation

Web Categorization Crawler – Part I

DISTILLATION

Web Page

Opinion Mining and Topic Categorization with Novel Term Weighting

Web page

Web Page

Web Page Categorization without the Web Page

Web page

Web Page

Distillation

Distillation

Enhanced topic distillation using text, markup tags, and hyperlinks

Distillation and Chromatography

Automatic Web Page Categorization by Link and Context Analysis

Distillation

Distillation

Mini-Case / Quiz Topic : EA Categorization and EMP Requirements