1 / 15

Hypertext Categorization using Hyperlink Patterns and Meta Data

Hypertext Categorization using Hyperlink Patterns and Meta Data. Rayid Ghani S é an Slattery Yiming Yang. Carnegie Mellon University. How is hypertext different?. Link Information (possibly useful but noisy) Diverse Authorship Short text - topic not obvious from the text

twatt
Download Presentation

Hypertext Categorization using Hyperlink Patterns and Meta Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid GhaniSéan SlatteryYiming Yang Carnegie Mellon University

  2. How is hypertext different? • Link Information (possibly useful but noisy) • Diverse Authorship • Short text - topic not obvious from the text • Structure / position within the web graph • Author-supplied features (meta-tags) • External Sources of Information (Meta-Data) • Bold , italics, heading etc.

  3. Goal • Present several hypothesis about regularities in hypertext classification tasks • Describe methods to exploit these regularities • Evaluate the different methods and regularities on real-world hypertext datasets

  4. Regularities in Hypertext • No Regularity • “Encyclopedia” Regularity • “Co-Referencing” Regularity • Partial “Co-Referencing” Regularity • Preclassified Regularity • Meta-Data Regularity

  5. No Regularity The documents are linked at random, or at least independent of the document class “Encyclopedia” Regularity The majority of linked documents share the same class as a document. Encyclopedia articles generally reference other articles which are topically similar. “Co-Referencing” Regularity Documents with the same class tend to link to documents not of that class, but which are topically similar to each other. University student index pages which tend not to link to other student index pages, but do link mostly to home pages of students.

  6. Partial “Co-Referencing” Regularity “Co­referencing” regularity where we might have more than a few “noisy” links Many students may point to pages about their hobbies, but also link to a wide variety of other pages which are less unique to student home pages Pre-Classified Regularity Either one page, or some small set of pages, may contain lists of hyperlinks to pages that are mostly members of the same class Any page from the Yahoo topic hierarchy Meta-Data Regularity Meta­data available from external sources on the web that can be exploited in the form of additional features. Movie reviews for movie classification, online discussion boards for various other topic classification tasks (such as stock market predictions or competitive analysis).

  7. Ignore Links Use standard text classifiers on the text of the document itself Also serves as baseline Use All the Text From Neighbors Augment the text of each document with the text of its neighbors Adding more topic­related words to the document. Use All the Text From Neighbors Separately Add the words of linked documents, but treating them as if they come from a separate vocabulary. A simple way to do this is to prefix the words in the linked documents with a tag, such as linked­word:

  8. Look for Linked document subsets Search for the topically similar linked pages At the top level, this is a clustering problem to find similar documents among all the documents linked to documents in the same class. Use the identity of the linked documents Search for these pages by representing each page with only the names of the pages it links with. Use External Features / Meta-Data Collect features that relate two or more entities/documents being classified using information extraction techniques. These extracted features can then be used in a similar fashion by using the identity of the related documents and by using the text of related documents in various ways.

  9. Look for Linked document subsets Search for the topically similar linked pages At the top level, this is a clustering problem to find similar documents among all the documents linked to documents in the same class. Use the identity of the linked documents Search for these pages by representing each page with only the names of the pages it links with. Use External Features / Meta-Data Collect features that relate two or more entities/documents being classified using information extraction techniques. These extracted features can then be used in a similar fashion by using the identity of the related documents and by using the text of related documents in various ways.

  10. Learning Algorithms Used • Naïve Bayes (NB) • Probabilistic, Builds a Generative Model • k Nearest Neighbor (kNN) • Example-based • First Order Inductive Learner (FOIL) • Relational Learner

  11. Datasets • A collection of up to 50 web pages from 4285 companies (as used in Ghani et al. 2000) • Two types of classifications (labels obtained from www.hoovers.com) • Coarse-grained Classification - 28 classes • Fine-grained Classification – 255 classes • Classification is at the level of Companies so that task is to classify the company by collapsing all of the web pages in a corporate website.

  12. Accuracy for 28 Class Task

  13. Accuracy for 255 Class Task

  14. Accuracy Vs. Feature Size

  15. Conclusions • Hyperlinks can be extremely noisy and harmful for classification • Meta-Data about websites can be useful and techniques for automatically finding meta-data should be explored • Naïve Bayes and kNN are suitable since they scale up well for the noise and feature-set size while FOIL has the power to discover relational regularities that cannot be explicitly identified by others.

More Related