Automatic classification of text databases through query probing
Download
1 / 27

Automatic Classification of Text Databases Through Query Probing - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

Automatic Classification of Text Databases Through Query Probing. Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc. Search-only Text Databases. Sources of valuable information Hidden behind search interfaces Non-crawlable Example: Microsoft Support KB.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Automatic Classification of Text Databases Through Query Probing' - atara


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Automatic classification of text databases through query probing

Automatic Classification of Text Databases Through Query Probing

Panagiotis G. Ipeirotis

Luis Gravano

Columbia University

Mehran Sahami

E.piphany Inc.


Search only text databases
Search-only Text Databases Probing

  • Sources of valuable information

  • Hidden behind search interfaces

  • Non-crawlable

    Example: Microsoft Support KB


Interacting with searchable text databases
Interacting With Searchable Text Databases Probing

  • Searching: Metasearchers

  • Browsing: Use Yahoo-like directories

  • Browse & search: “Category-enabled” metasearchers


Searching text databases metasearchers
Searching Text Databases: Metasearchers Probing

  • Select the good databases for a query

  • Evaluate the query at these databases

  • Combine the query results from the databases

    Examples: MetaCrawler, SavvySearch, Profusion


Browsing through text databases
Browsing Through Text Databases Probing

  • Yahoo-like web directories:

    • InvisibleWeb.com

    • SearchEngineGuide.com

    • TheBigHub.com

      Example from InvisibleWeb.com

      Computers > Publications > ACM DL

  • Category-enabled metasearchers

    • User-defined category (e.g. Recipes)


Problem with current classification approach
Problem With Current Classification Approach Probing

  • Classification of databases is done manually

  • This requires a lot of human effort!


How to classify text databases automatically outline
How to Classify Text Databases Automatically: Outline Probing

  • Definition of classification

  • Strategies for classifying searchable databases through query probing

  • Initial experiments


Database classification two definitions
Database Classification: ProbingTwo Definitions

  • Coverage-based classification:

    • The database contains many documents about the category (e.g. Basketball)

    • Coverage: #docs about this category

  • Specificity-based classification:

    • The database contains mainly documents about this category

    • Specificity: #docs/|DB|


Database classification an example
Database Classification: Probing An Example

  • Category: Basketball

  • Coverage-based classification

    • ESPN.com, NBA.com

  • Specificity-based classification

    • NBA.com, but not ESPN.com


Categorizing a text database two problems
Categorizing a Text Database: ProbingTwo Problems

  • Find the category of a given document

  • Find the category of all the documents inside the database


Categorizing documents
Categorizing Documents Probing

  • Several text classifiers available

  • RIPPER (AT&T Research, William Cohen 1995)

    • Input: A set of pre-classified, labeled documents

    • Output: A set of classification rules


Categorizing documents ripper
Categorizing Documents: RIPPER Probing

  • Training set: Preclassified documents

    • “Linux as a web server”: Computers

    • “Linux vs. Windows: …”: Computers

    • “Jordan was the leader of Chicago Bulls”: Sports

    • “Smoking causes lung cancer”: Health

  • Output: Rule-based classifier

    • IF linux THEN Computers

    • IF jordan AND bulls THEN Sports

    • IF lung AND cancer THEN Health


Precision and recall of document classifier
Precision and Recall of Document Classifier Probing

During the training phase:

  • 100 documents about computers

  • “Computer” rules matched 50 docs

  • From these 50 docs 40 were about computers

    • Precision = 40/50 = 0.8

    • Recall = 40/100 = 0.4


From document to database classification
From Document Probingto Database Classification

  • If we know the categories of all the documents, we are done!

  • But databases do not export such data!

    How can we extract this information?


Our approach query probing
Our Approach: Query Probing Probing

  • Design a small set of queries to probe the databases

  • Categorize the database based on the probing results


Designing and implementing query probes
Designing and Implementing ProbingQuery Probes

The probes should extract information about the categories of the documents in the database

  • Start with a document classifier (RIPPER)

  • Transform each rule into a query

    IF lung AND cancer THEN health  +lung +cancer

    IF linux THEN computers  +linux

  • Get number of matches for each query


Three categories and three databases
Three Categories and Three Databases Probing

linux computers

ACM DL

jordan AND bulls sports

lung AND cancer health

NBA.com

PubMED


Using the results for classification
Using the Results for Classification Probing

We use the results to estimatecoverage and specificity values


Adjusting query results
Adjusting Query Results Probing

  • Classifiers are not perfect!

    • Queries do not “retrieve” all the documents that belong to a category

    • Queries for one category “match” documents that do not belong to this category

  • From the training phase of classifier we use precision and recall


Precision recall adjustment
Precision & Recall Adjustment Probing

  • Computer-category:

    • Rule: “linux”, Precision = 0.7

    • Rule: “cpu”, Precision = 0.9

    • Recall (for all the rules) = 0.4

  • Probing with queries for “Computers”:

    • Query: +linux  X1 matches  0.7X1 correct matches

    • Query: +cpu  X2 matches  0.9X2 correct matches

  • From X1+X2documents found:

    • Expect 0.7 X1+0.9 X2to be correct

    • Expect (0.7 X1+0.9 X2)/0.4 total computer docs


Initial experiments
Initial Experiments Probing

  • Used a collection of 20,000 newsgroup articles

  • Formed 5 categories:

    • Computers (comp.*)

    • Science (sci.*)

    • Hobbies (rec.*)

    • Society (soc.* + alt.atheism)

    • Misc (misc.sale)

  • RIPPER trained with 10,000 newsgroup articles

  • Classifier: 29 rules, 32 words used

    • IF windows AND pc THEN Computers (precision~0.75)

    • IF satellite AND space THEN Science (precision~0.9)


Web databases probed
Web-databases Probed Probing

  • Using the newsgroup classifier we probed four web databases:

    • Cora (www.cora.jprc.com)

      CS Papers archive (Computers)

    • American Scientist (www.amsci.org)

      Science and technology magazine (Science)

    • All Outdoors (www.alloutdoors.com)

      Articles about outdoor activities (Hobbies)

    • Religion Today (www.religiontoday.com)

      News and discussion about religions (Society)


Results
Results Probing

  • Only 29 queries per web site

  • No need for document retrieval!


Conclusions
Conclusions Probing

  • Easy classification using only a small number of queries

  • No need for document retrieval

    • Only need a result like: “X matches found”

  • Not limited to search-only databases

    • Every searchable database can be classified this way

  • Not limited to topical classification


Current issues
Current Issues Probing

  • Comprehensive classification scheme

  • Representative training data


Future work
Future Work Probing

  • Use a hierarchical classification scheme

  • Test different search interfaces

    • Boolean model

    • Vector-space model

    • Different capabilities

  • Compare with document sampling (Callan et al.’s work – SIGMOD99, adapted for the classification task)

  • Study classification efficiency when documents are accessible


Related work
Related Work Probing

  • Gauch (JUCS 1996)

  • Etzioni et al. (JIIS 1997)

  • Hawking & Thistlewaite (TOIS 1999)

  • Callan et al. (SIGMOD 1999)

  • Meng et al. (CoopIS 1999)


ad