Automatic Web Page Categorization by Link and Context Analysis

Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani

Introduction • Document retrieval on the Web • Search engines – keyword-based searches • Classified categories – each category lists Web sites relevant to that category

Introduction • Document d Category c • Requires understanding of both d and c • Has traditionally been accomplished manually • Disadvantages • Growth rate, number of web pages • Highly subjective, lesser quality

Introduction • Automatic classification • Text categorization • Build the representation of a category using a training set of documents pre-categorized under it • Compare representation of a given document d with representation of the category c to decide if d belongs to c • Other approaches • Basic idea – categorization by content

Introduction • Categorization by context • Uses the context surrounding a link • Uses relevance hints that are present in the structure of HTML documents • Advantage • Ability to deal with multimedia material since it analyzes context and not content • Theseus [Teseo]

Improving Web search engines • AltaVista: “refine” capability • Infoseek: grouping of query results, retrieving similar pages • Automatic categorization techniques  better Web retrieval tools, organized material e.g. Lycos, Infoseek (Content Classification Engine - CCE)

Categorization by context • Basic idea • The referring Web page must contain enough hints about the document’s content • These hints are sufficient to classify the document • What are these hints? • Anchor text of a link: <A>…</A> • Page title • Section titles

Architecture • Tasks performed • Spidering • Structure analysis • URL categorization • Weight combination • Catalog update

Spidering and HTML Structure Analysis <html> <head> <title> Yahoo! – Science: Biology </title> </head> <body> ... <ul> <li> <a href=“esg-www.mit.edu:8001/esgbio/”>MIT Biology Hypertextbook</a> - introductory resource including information on chemistry, biochemistry, genetics, cell and molecular biology, and immunology. <li> ...

Spidering and HTML Structure Analysis • The following URL context path is created http://esg-www.mit.edu:8001/esgbio: “MIT Biology Hypertextbook”: “introductory resource including information on chemistry, biochemistry, genetics, cell and molecular biology, and immunology”: “Yahoo! – Science: Biology”

URL Categorization • One URL may have several context paths • Category tree – each node identifies a category • URL categorization finds the most appropriate categories to which the URL should belong • Produces a sequence of weights associated to each node in the category tree • URL: N1=w1, N2=w2, N3=w3, …, Nn=wn • Each weight wi degree of confidence

Weight Combination • Weights from all context paths for a URL are added and normalized • If the weight of a node is greater than a certain threshold, the URL is categorized under that node

Theseus • Theseus is a tool built to verify validity of the method • Components • TreeTagger: a part-of-speech tagger • HTML parser written in Perl • HTML structure analyzer (produces the context tree) written in Java • Experimented using the Arianna catalog

Theseus: Exploiting Noun Phrases • What is noun-phrase analysis? • “a high school female student” • without noun-phrase analysis  “high school” • with noun-phrase analysis  detects that the subject of the phrase is not “high school” • Does it improve the effectiveness of classification? • Lesser number of documents per category • Overall improvement of about 5%

Theseus: Identifying Site Structure, Link Identification • Performs initial breadth-first analysis to a depth of 3 • Repeated links (occurrence of 90% or more) are considered structural links and eventually get discarded • Link identification is performed in the initial phase of site analysis • Ability to recognize CGI references

Theseus: Integration With a Search Engine • Example: Yahoo! • Several benefits • avoid separate spidering of Web documents • provide support for queries within categories – “Search within this category” • Vice-versa • category information can be used to group query results – improved presentation

Theseus: Assessment • Experiment: Categorize a subset of Yahoo! pages • Obtained the same categorization in most cases • Classifies approximately 500 sites per hour • Is more precise • “microbiology journals” instead of “biology journals”

Theseus: Assessment

Open Issues • Building category profiles • By hand • Learning techniques • Possible solution: minimal category profiles, to be extended in the learning phase • Proper ranking of documents in the catalog

Part-of-speech Tagging • The task of POS-tagging is to assign part of speech tags to words reflecting their syntactic category. But often, words can belong to different syntactic categories in different contexts. For instance, the string "books" can have two readings: in the sentence he books tickets the word "books" is a third person singular verb, but in the sentence he reads books it is a plural noun. A POS-tagger should segment a word, determine its possible readings, and assign the right reading given the context.

Automatic Web Page Categorization by Link and Context Analysis