Automatic classification of text databases through query probing
Sponsored Links
This presentation is the property of its rightful owner.
1 / 27

Automatic Classification of Text Databases Through Query Probing PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on
  • Presentation posted in: General

Automatic Classification of Text Databases Through Query Probing. Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc. Search-only Text Databases. Sources of valuable information Hidden behind search interfaces Non-crawlable Example: Microsoft Support KB.

Download Presentation

Automatic Classification of Text Databases Through Query Probing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Automatic Classification of Text Databases Through Query Probing

Panagiotis G. Ipeirotis

Luis Gravano

Columbia University

Mehran Sahami

E.piphany Inc.


Search-only Text Databases

  • Sources of valuable information

  • Hidden behind search interfaces

  • Non-crawlable

    Example: Microsoft Support KB


Interacting With Searchable Text Databases

  • Searching: Metasearchers

  • Browsing: Use Yahoo-like directories

  • Browse & search: “Category-enabled” metasearchers


Searching Text Databases: Metasearchers

  • Select the good databases for a query

  • Evaluate the query at these databases

  • Combine the query results from the databases

    Examples: MetaCrawler, SavvySearch, Profusion


Browsing Through Text Databases

  • Yahoo-like web directories:

    • InvisibleWeb.com

    • SearchEngineGuide.com

    • TheBigHub.com

      Example from InvisibleWeb.com

      Computers > Publications > ACM DL

  • Category-enabled metasearchers

    • User-defined category (e.g. Recipes)


Problem With Current Classification Approach

  • Classification of databases is done manually

  • This requires a lot of human effort!


How to Classify Text Databases Automatically: Outline

  • Definition of classification

  • Strategies for classifying searchable databases through query probing

  • Initial experiments


Database Classification: Two Definitions

  • Coverage-based classification:

    • The database contains many documents about the category (e.g. Basketball)

    • Coverage: #docs about this category

  • Specificity-based classification:

    • The database contains mainly documents about this category

    • Specificity: #docs/|DB|


Database Classification: An Example

  • Category: Basketball

  • Coverage-based classification

    • ESPN.com, NBA.com

  • Specificity-based classification

    • NBA.com, but not ESPN.com


Categorizing a Text Database:Two Problems

  • Find the category of a given document

  • Find the category of all the documents inside the database


Categorizing Documents

  • Several text classifiers available

  • RIPPER (AT&T Research, William Cohen 1995)

    • Input: A set of pre-classified, labeled documents

    • Output: A set of classification rules


Categorizing Documents: RIPPER

  • Training set: Preclassified documents

    • “Linux as a web server”: Computers

    • “Linux vs. Windows: …”: Computers

    • “Jordan was the leader of Chicago Bulls”: Sports

    • “Smoking causes lung cancer”: Health

  • Output: Rule-based classifier

    • IF linux THEN Computers

    • IF jordan AND bulls THEN Sports

    • IF lung AND cancer THEN Health


Precision and Recall of Document Classifier

During the training phase:

  • 100 documents about computers

  • “Computer” rules matched 50 docs

  • From these 50 docs 40 were about computers

    • Precision = 40/50 = 0.8

    • Recall = 40/100 = 0.4


From Document to Database Classification

  • If we know the categories of all the documents, we are done!

  • But databases do not export such data!

    How can we extract this information?


Our Approach: Query Probing

  • Design a small set of queries to probe the databases

  • Categorize the database based on the probing results


Designing and Implementing Query Probes

The probes should extract information about the categories of the documents in the database

  • Start with a document classifier (RIPPER)

  • Transform each rule into a query

    IF lung AND cancer THEN health  +lung +cancer

    IF linux THEN computers  +linux

  • Get number of matches for each query


Three Categories and Three Databases

linux computers

ACM DL

jordan AND bulls sports

lung AND cancer health

NBA.com

PubMED


Using the Results for Classification

We use the results to estimatecoverage and specificity values


Adjusting Query Results

  • Classifiers are not perfect!

    • Queries do not “retrieve” all the documents that belong to a category

    • Queries for one category “match” documents that do not belong to this category

  • From the training phase of classifier we use precision and recall


Precision & Recall Adjustment

  • Computer-category:

    • Rule: “linux”, Precision = 0.7

    • Rule: “cpu”, Precision = 0.9

    • Recall (for all the rules) = 0.4

  • Probing with queries for “Computers”:

    • Query: +linux  X1 matches  0.7X1 correct matches

    • Query: +cpu  X2 matches  0.9X2 correct matches

  • From X1+X2documents found:

    • Expect 0.7 X1+0.9 X2to be correct

    • Expect (0.7 X1+0.9 X2)/0.4 total computer docs


Initial Experiments

  • Used a collection of 20,000 newsgroup articles

  • Formed 5 categories:

    • Computers (comp.*)

    • Science (sci.*)

    • Hobbies (rec.*)

    • Society (soc.* + alt.atheism)

    • Misc (misc.sale)

  • RIPPER trained with 10,000 newsgroup articles

  • Classifier: 29 rules, 32 words used

    • IF windows AND pc THEN Computers (precision~0.75)

    • IF satellite AND space THEN Science (precision~0.9)


Web-databases Probed

  • Using the newsgroup classifier we probed four web databases:

    • Cora (www.cora.jprc.com)

      CS Papers archive (Computers)

    • American Scientist (www.amsci.org)

      Science and technology magazine (Science)

    • All Outdoors (www.alloutdoors.com)

      Articles about outdoor activities (Hobbies)

    • Religion Today (www.religiontoday.com)

      News and discussion about religions (Society)


Results

  • Only 29 queries per web site

  • No need for document retrieval!


Conclusions

  • Easy classification using only a small number of queries

  • No need for document retrieval

    • Only need a result like: “X matches found”

  • Not limited to search-only databases

    • Every searchable database can be classified this way

  • Not limited to topical classification


Current Issues

  • Comprehensive classification scheme

  • Representative training data


Future Work

  • Use a hierarchical classification scheme

  • Test different search interfaces

    • Boolean model

    • Vector-space model

    • Different capabilities

  • Compare with document sampling (Callan et al.’s work – SIGMOD99, adapted for the classification task)

  • Study classification efficiency when documents are accessible


Related Work

  • Gauch (JUCS 1996)

  • Etzioni et al. (JIIS 1997)

  • Hawking & Thistlewaite (TOIS 1999)

  • Callan et al. (SIGMOD 1999)

  • Meng et al. (CoopIS 1999)


  • Login