Hypertext Categorization using Hyperlink Patterns and Meta Data

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid GhaniSéan SlatteryYiming Yang Carnegie Mellon University

How is hypertext different? • Link Information (possibly useful but noisy) • Diverse Authorship • Short text - topic not obvious from the text • Structure / position within the web graph • Author-supplied features (meta-tags) • External Sources of Information (Meta-Data) • Bold , italics, heading etc.

Goal • Present several hypothesis about regularities in hypertext classification tasks • Describe methods to exploit these regularities • Evaluate the different methods and regularities on real-world hypertext datasets

Regularities in Hypertext • No Regularity • “Encyclopedia” Regularity • “Co-Referencing” Regularity • Partial “Co-Referencing” Regularity • Preclassified Regularity • Meta-Data Regularity

No Regularity The documents are linked at random, or at least independent of the document class “Encyclopedia” Regularity The majority of linked documents share the same class as a document. Encyclopedia articles generally reference other articles which are topically similar. “Co-Referencing” Regularity Documents with the same class tend to link to documents not of that class, but which are topically similar to each other. University student index pages which tend not to link to other student index pages, but do link mostly to home pages of students.

Partial “Co-Referencing” Regularity “Coreferencing” regularity where we might have more than a few “noisy” links Many students may point to pages about their hobbies, but also link to a wide variety of other pages which are less unique to student home pages Pre-Classified Regularity Either one page, or some small set of pages, may contain lists of hyperlinks to pages that are mostly members of the same class Any page from the Yahoo topic hierarchy Meta-Data Regularity Metadata available from external sources on the web that can be exploited in the form of additional features. Movie reviews for movie classification, online discussion boards for various other topic classification tasks (such as stock market predictions or competitive analysis).

Ignore Links Use standard text classifiers on the text of the document itself Also serves as baseline Use All the Text From Neighbors Augment the text of each document with the text of its neighbors Adding more topicrelated words to the document. Use All the Text From Neighbors Separately Add the words of linked documents, but treating them as if they come from a separate vocabulary. A simple way to do this is to prefix the words in the linked documents with a tag, such as linkedword:

Look for Linked document subsets Search for the topically similar linked pages At the top level, this is a clustering problem to find similar documents among all the documents linked to documents in the same class. Use the identity of the linked documents Search for these pages by representing each page with only the names of the pages it links with. Use External Features / Meta-Data Collect features that relate two or more entities/documents being classified using information extraction techniques. These extracted features can then be used in a similar fashion by using the identity of the related documents and by using the text of related documents in various ways.

Learning Algorithms Used • Naïve Bayes (NB) • Probabilistic, Builds a Generative Model • k Nearest Neighbor (kNN) • Example-based • First Order Inductive Learner (FOIL) • Relational Learner

Datasets • A collection of up to 50 web pages from 4285 companies (as used in Ghani et al. 2000) • Two types of classifications (labels obtained from www.hoovers.com) • Coarse-grained Classification - 28 classes • Fine-grained Classification – 255 classes • Classification is at the level of Companies so that task is to classify the company by collapsing all of the web pages in a corporate website.

Accuracy for 28 Class Task

Accuracy for 255 Class Task

Accuracy Vs. Feature Size

Conclusions • Hyperlinks can be extremely noisy and harmful for classification • Meta-Data about websites can be useful and techniques for automatically finding meta-data should be explored • Naïve Bayes and kNN are suitable since they scale up well for the noise and feature-set size while FOIL has the power to discover relational regularities that cannot be explicitly identified by others.

Hypertext Categorization using Hyperlink Patterns and Meta Data

Hypertext Categorization using Hyperlink Patterns and Meta Data

Presentation Transcript

HYPERTEXT and HYPERMEDIA

Meta Data Architectures

Using LEHD data to examine turnover patterns

Service categorization and SOA patterns 911

Using Hyperlink structure information for web search

Text and Hypertext

Hyperlink 1

Linked Data Patterns using the CRM

Using Hypertext and Action Buttons to categorise...!

Categorization of Library Function Call Patterns

Representing Data using Static and Moving Patterns

Hyperlink

IMS Meta-Data

Meta Data

Face Categorization using SIFT features

Sending Data Using a Hyperlink

Delivering flexible applications using data-management patterns

Using Patterns and Inductive Reasoning

HYPERTEXT DATA BASES

Meta Data Services

Hyperlink Analysis

Representing Data using Static and Moving Patterns