1 / 36

Probe, Count, and Classify: Categorizing Hidden Web Databases

Probe, Count, and Classify: Categorizing Hidden Web Databases. Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc. DIMACS Summer School Tutorial on New Frontiers in Data Mining T heme: Web -- Thursday, August 16, 2001. Surface Web Link structure

Download Presentation

Probe, Count, and Classify: Categorizing Hidden Web Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probe, Count, and Classify:Categorizing Hidden Web Databases Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc. DIMACS Summer School Tutorial on New Frontiers in Data Mining Theme: Web -- Thursday, August 16, 2001

  2. Surface Web Link structure Crawlable Hidden Web No link structure Documents “hidden” behind search forms Surface Web vs. Hidden Web

  3. Do We Need the Hidden Web? Example: PubMed/MEDLINE • PubMed: (www.ncbi.nlm.nih.gov/PubMed) search: “cancer”  1,341,586 matches • AltaVista: “cancer site:www.ncbi.nlm.nih.gov”  21,830 matches

  4. Interacting With Searchable Text Databases • Searching: Metasearchers • Browsing: Yahoo!-like web directories: • InvisibleWeb.com • SearchEngineGuide.com Example from InvisibleWeb.com Health > Publications > PubMED Created Manually!

  5. Classifying Text Databases Automatically: Outline • Definition of classification • Classification through query probing • Experiments

  6. Database Classification: Two Definitions • Coverage-based classification: • Database contains many documents about a category Coverage: #docs about this category • Specificity-based classification: • Database contains mainly documents about a category • Specificity: #docs/|DB|

  7. Database Classification: An Example • Category: Basketball • Coverage-based classification • ESPN.com, NBA.com, not KnicksTerritory.com • Specificity-based classification • NBA.com, KnicksTerritory.com, not ESPN.com

  8. Database Classification: More Details Tc, Ts “editorial” choices Thresholds for coverage and specificity • Tc: coverage threshold (e.g., 100) • Ts: specificity threshold (e.g., 0.5) Ideal(D) Root Ideal(D): set of classes for database D Class C is in Ideal(D) if: • D has “enough” coverage and specificity (Tc, Ts) for C and all of C’s ancestors and • D fails to have both “enough” coverage and specificity for each child of C SPORTS C=800 S=0.8 HEALTH C=200 S=0.2 BASKETBALL S=0.5 BASEBALL S=0.5

  9. From Document to Database Classification • If we know the categories of all documents inside the database, we are done! • We do not have direct access to the documents. • Databases do not export such data! How can we extract this information?

  10. Our Approach: Query Probing • Train a rule-based document classifier. • Transform classifier rules into queries. • Adaptively sendqueries to databases. • Categorize the databases based on adjusted number of query matches.

  11. Training a Rule-based Document Classifier • Feature Selection: Zipf’s law pruning, followed by information-theoretic feature selection [Koller & Sahami’96] • Classifier Learning: AT&T’s RIPPER [Cohen 1995] • Input: A set of pre-classified, labeled documents • Output: A set of classification rules • IF linux THEN Computers • IF jordan AND bulls THEN Sports • IF lung AND cancer THEN Health

  12. Constructing Query Probes • Transform each rule into a query IF lung AND cancer THEN health  +lung +cancer IF linux THEN computers  +linux • Send the queries to the database • Get number of matches for each query, NOT the documents (i.e., number of documents that match each rule) • These documents would have been classified by the rule under its associated category!

  13. Adjusting Query Results • Classifiers are not perfect! • Queries do not “retrieve” all the documents in a category • Queries for one category “match” documents not in this category • From the classifier’s training phase we know its “confusion matrix”

  14. Confusion Matrix Correct class 10% of “Sports” classified as “Computers” X = 10% of the 5000 “Sports” docs to “Computers” Classified into M . Coverage(D) ~ ECoverage(D)

  15. Confusion Matrix Adjustment:Compensating for Classifier’s Errors -1 = X M is diagonally dominant, hence invertible Coverage(D) ~ M-1 . ECoverage(D) Multiplication better approximates the correct result

  16. Classifying a Database • Send the query probes for the top-level categories • Get the number of matches for each probe • Calculate Specificity and Coverage for each category • “Push” the database to the qualifying categories (with Specificity>Ts and Coverage>Tc) • Repeat for each of the qualifying categories • Return the classes that satisfy the coverage/specificity conditions The result is the Approximation of the Ideal classification

  17. Real Example: ACM Digital Library(Tc=100, Ts=0.5)

  18. Experiments: Data • 72-node 4-level topic hierarchy from InvisibleWeb/Yahoo! (54 leaf nodes) • 500,000 Usenet articles (April-May 2000): • Newsgroups assigned by hand to hierarchy nodes • RIPPER trained with 54,000 articles (1,000 articles per leaf) • 27,000 articles used to construct estimations of the confusion matrices • Remaining 419,000 articles used to build 500 Controlled Databases of varying category mixes, size

  19. Comparison With Alternatives • DS: Random sampling of documents via query probes • Callan et al., SIGMOD’99 • Different task: Gather vocabulary statistics • We adapted it for database classification • TQ: Title-based Probing • Yu et al., WISE 2000 • Query probes are simply the category names

  20. Experiments: Metrics Expanded(N) • Accuracy of classification results: • Expanded(N) = N and all descendants • Correct = Expanded(Ideal(D)) • Classified = Expanded(Approximate(D)) • Precision = |Correct /\ Classified|/|Classified| • Recall = |Correct /\ Classified|/|Correct| • F-measure = 2.Precision.Recall/(Precision + Recall) • Costof classification: Number of queries to database N

  21. Average F-measure, Controlled Databases PnC =Probe & Count, DS=Document Sampling, TQ=Title-based probing

  22. Experimental Results: Controlled Databases • Feature selection helps. • Confusion-matrix adjustment helps. • F-measure above 0.8 for most <Tc, Ts> combinations. • Results degrade gracefully with hierarchy depth. • Relatively small number of probes needed for most <Tc, Ts> combinations tried. • Also, probes are short: 1.5 words on average; 4 words maximum. • Both better performance and lower cost than DS [Callan et al. adaptation] and TQ [Yu et al.]

  23. Web Databases • 130 real databases classified from InvisibleWeb™. • Used InvisibleWeb’s categorization as correct. • Simple “wrappers” for querying (only # of matches needed). • The Ts, Tc thresholds are not known (unlike with the Controlled databases)but implicit in the IWeb categorization. • We can learn/validate the thresholds (tricky but easy!). • More details in the paper!

  24. Web Databases: Learning Thresholds

  25. Experimental Results:Web Databases • 130 RealWeb Databases. • F-measure above 0.7 for best <Tc, Ts> combination learned. • 185 query probes per database on average needed for classification. • Also, probes are short: 1.5 words on average; 4 words maximum.

  26. Conclusions • Accurate classification using only a small number of short queries • No need for document retrieval • Only need a result like: “X matches found” • No need for any cooperation or special metadata from databases • URL: http://qprober.cs.columbia.edu

  27. Current and Future Work • Build “wrappers” automatically • Extend to non-topical categories • Evaluate impact of varying search interfaces (e.g., Boolean vs. ranked) • Extend to other classifiers (e.g., SVMs or Bayesian models) • Integrate with searching (connection with database selection?)

  28. Questions?

  29. Contributions Easy, inexpensive method for database classification Uses results from document classification “Indirect” classification of the documents in a database Does not inspect documents, only number of matches Adjustment of results according to classifier’s performance Easy wrapper construction No need for any metadata from the database

  30. Related Work • Callan et al., SIGMOD 1999 • Gauch et al., Profusion • Dolin et al., Pharos • Yu et al., WISE 2000 • Raghavan and Garcia Molina, VLDB 2001

  31. Controlled Databases 500 databases built using 419,000 newsgroup articles • One label per document • 350 databases with single (not necessarily leaf) category • 150 databases with varying category mixes • Database size ranges from 25 to 25,000 articles • Indexed and queries using SMART

  32. F-measure for Different Hierarchy Depths PnC =Probe & Count, DS=Document Sampling, TQ=Title-based probing Tc=8, Ts=0.3

  33. Query Probes Per Controlled Database

  34. Web Databases: Number of Query Probes

  35. 3-fold Cross-validation

  36. Real Confusion Matrix for Top Node of Hierarchy

More Related