1 / 24

When one Sample is not Enough: Improving Text Database Selection using Shrinkage

When one Sample is not Enough: Improving Text Database Selection using Shrinkage. Panos Ipeirotis Luis Gravano. Computer Science Department Columbia University. “Regular” Web Pages and Text Databases. “Regular” Web Link structure Crawlable Documents indexed by search engines.

farrar
Download Presentation

When one Sample is not Enough: Improving Text Database Selection using Shrinkage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. When one Sample is not Enough:Improving Text Database Selection using Shrinkage Panos Ipeirotis Luis Gravano Computer Science Department Columbia University

  2. “Regular” Web Pages and Text Databases • “Regular” Web • Link structure • Crawlable • Documents indexed by search engines • Text Databases (a.k.a. “Hidden Web”, “Deep Web”…) • Usually no link structure • Documents “hidden” in databases • Documents not indexed by search engines • Need to query each collection individually Panos Ipeirotis - Columbia University

  3. Text Databases: Examples • Search on U.S. Patent and Trademark Office (USPTO) database: • [wireless network]  26,012 matches • (USPTO database is at http://patft.uspto.gov/netahtml/search-bool.html) • Search on Google restricted to USPTO database site: • [wireless network site:patft.uspto.gov]  0 matches as of June 10th, 2004 Panos Ipeirotis - Columbia University

  4.  ? ... thrombopenia 27,960 ... ... thrombopenia 42 ... ... thrombopenia 0 ... Metasearchers Provide Access to Distributed Text Databases Database selection relies on simple content summaries: vocabulary, word frequencies thrombopenia PubMed (11,868,552 documents) … aids 121,491 cancer 1,562,477 heart 691,360hepatitis121,129 thrombopenia 27,960 … Metasearcher PubMed NYTimesArchives USPTO Panos Ipeirotis - Columbia University

  5. Extracting Content Summaries from Autonomous Text Databases • Send queries to databases • Retrieve top matching documents • If “stopping criterion met” (e.g., sample>300 docs) then exit; else go to Step 1 Content summary contains words in sample and document frequency of each word Problem: Summaries from small samples are highly incomplete Panos Ipeirotis - Columbia University

  6. Problem: Summaries Derived from Small Samples Fundamentally Incomplete • Many words appear in “relatively few” documents (Zipf’s law) • Low-frequency words are often important • Small document samples miss many low-frequency words Sample=300 Log(Frequency) 107 106 10% most frequent words in PubMed database 9,000 . . ……………………………………… endocarditis ~9,000 docs / ~0.1% 103 102 Rank 2·104 4·104 105 Panos Ipeirotis - Columbia University

  7. Improving Sample-based Content Summaries Main Idea: Database Classification Helps • Similar topics ↔ Similar content summaries • Extracted content summaries complement each other • Classification available from directories (e.g., Open Directory) or derived automatically (e.g., QProber) Challenge: Improve content summary quality without increasing sample size Panos Ipeirotis - Columbia University

  8. Databases with Similar Topics • Cancerlit contains “metastasis”, not found during sampling • CancerBacup contains “metastasis” • Databases under same category have similar vocabularies, and can complement each other Panos Ipeirotis - Columbia University

  9. Content Summaries for Categories • Databases under same category share similar vocabulary • Higher-level category content summaries provide additional useful estimates of “word probabilities” • Can use all estimates in category path Panos Ipeirotis - Columbia University

  10. Enhancing Summaries Using “Shrinkage” • Word-probability estimates from database content summaries can be unreliable • Category content summaries are more reliable (based on larger samples) but less specific to database • By combining estimates from category and database content summaries we get better estimates Panos Ipeirotis - Columbia University

  11. Shrinkage-based Estimations Adjust probability estimates Pr [metastasis | D] = λ1 * 0.002 +λ2 * 0.05 + λ3 * 0.092+ λ4 * 0.000 Select λi weights to maximize the probability that the summary of D is from a database under all its parent categories  Panos Ipeirotis - Columbia University

  12. Computing Shrinkage-based Summaries Root Health Cancer D Pr [metastasis | D] =λ1 * 0.002 +λ2 * 0.05 + λ3 * 0.092+ λ4 * 0.000 Pr [treatment | D] =λ1 * 0.015 +λ2 * 0.12 + λ3 * 0.179+ λ4 * 0.184 … • Automatic computation of λi weights using an EM algorithm • Computation performed offline  No query overhead Avoids “sparse data” problem and decreases estimation risk Panos Ipeirotis - Columbia University

  13. Shrinkage Weights and Summary new estimates old estimates Shrinkage: • Increases estimations for underestimates (e.g., metastasis) • Decreases word-probability estimates for overestimates (e.g., aids) • …it also introduces (with small probabilities) spurious words (e.g., football) Panos Ipeirotis - Columbia University

  14. Is Shrinkage Always Necessary? • Shrinkage used to reduce uncertainty (variance) of estimations • Small samples of large databases  high variance • In sample: 10 out of 100 documents contain metastasis • In database: ? out of 10,000,000 documents? • Small samples of small databases  small variance • In sample: 10 out of 100 documents contain metastasis • In database: ? out of 200 documents? • Shrinkage less useful (or even harmful) when uncertainty is low Panos Ipeirotis - Columbia University

  15. Adaptive Application of Shrinkage • Database selection algorithms assign scores to databases for each query • When word frequency estimates are uncertain, assigned score has high variance • shrinkage improves score estimates • When word frequency estimates are reliable, assigned score has small variance • shrinkage unnecessary Unreliable Score Estimate: Use shrinkage Probability 0 1 Database Score for a Query Reliable Score Estimate: Shrinkage might hurt Probability Solution: Use shrinkage adaptively in a query- and database-specific manner 0 1 Database Score for a Query Panos Ipeirotis - Columbia University

  16. Searching Algorithm Extract document samples Get database classification Compute shrinkage-based summaries One-time process To process a query Q: • For each database D: • Use a regular database selection algorithm to compute query score for D using old, “unshrunk” summary • Analyze uncertainty of score • If uncertainty high, use new, shrinkage-based summary instead and compute new query score for D • Evaluate Q over top-k scoring databases For every query Panos Ipeirotis - Columbia University

  17. Evaluation: Goals • Examine quality of shrinkage-based summaries • Examine effect of shrinkage on database selection Panos Ipeirotis - Columbia University

  18. Experimental Setup • Three data sets: • Two standard testbeds from TREC (“Text Retrieval Conference”): • 200 databases • 100 queries with associated human-assigned document relevance judgments • 315 real Web databases • Two sets of experiments: • Content summary quality • Database selection accuracy Panos Ipeirotis - Columbia University

  19. Results: Content Summary Quality • Recall: How many words in database also in summary? Shrinkage-based summaries include 10-90% more words than unshrunk summaries • Precision: How many words in the summary also in database? Shrinkage-based summaries include 5%-15% words not in actual database Panos Ipeirotis - Columbia University

  20. Results: Content Summary Quality • Rank correlation: Is word ranking in summary similar to ranking in database? Shrinkage-based summaries demonstrate better word rankings than unshrunk summaries • Kullback-Leibler divergence: Is probability distribution in summary similar to distribution in database? Shrinkage improves bad cases, making very good ones worse  Motivates adaptive application of shrinkage! Panos Ipeirotis - Columbia University

  21. Results: Database Selection • Metric: R(K) = Χ / Υ • X = # of relevant documents in the selected K databases • Y = # of relevant documents in the best K databases For CORI (a state-of-the-art database selection algorithm) with stemming over one TREC testbed Panos Ipeirotis - Columbia University

  22. Other Experiments • Choice of database selection algorithm (CORI, bGlOSS, Language Modeling) • Comparison with VLDB’02 hierarchical database selection algorithm • Universal vs. adaptive application of shrinkage • Effect of stemming • Effect of stop-word elimination Panos Ipeirotis - Columbia University

  23. Conclusions Developed strategy to automatically summarize contents of hidden-web text databases • Content summaries are critical for efficient metasearching • Strategy assumes no cooperation from databases • Shrinkage improves content summary quality by exploiting topical similarity • Shrinkage is efficient: no increase in document sample size required Developed adaptive database selection strategy that decides whether to apply shrinkage on a database- and query-specific way Panos Ipeirotis - Columbia University

  24. Thank you! Shrinkage-based content summary generation implemented and available for download at: http://sdarts.cs.columbia.edu Questions? Panos Ipeirotis - Columbia University

More Related