1 / 25

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection. Panagiotis G. Ipeirotis Luis Gravano. Computer Science Department Columbia University. Distributed Search? Why? “Surface” Web vs. “Hidden” Web. “Surface” Web Link structure Crawlable

byrons
Download Presentation

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Search over the Hidden WebHierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia University

  2. Distributed Search? Why?“Surface” Web vs. “Hidden” Web • “Surface” Web • Link structure • Crawlable • Documents indexed by search engines • “Hidden” Web • No link structure • Documents “hidden” in databases • Documents not indexed by search engines • Need to query each collection individually Columbia University

  3. Hidden Web: Examples • PubMed search: [diabetes] • 178,975 matches PubMed is at http://www.ncbi.nlm.nih.gov/PubMed • Google search: [diabetes site:www.ncbi.nlm.nih.gov] •  119 matches Columbia University

  4. Content summaries of databases (vocabulary, word frequencies) kidneys 220,000 stones 40,000 ... kidneys 20 stones 950 ... kidneys 5 stones 40 ... Distributed Search: Challenges • Select good databases for query • Evaluate query at these databases • Merge results from databases Hidden Web Metasearcher PubMed Library of Congress ESPN Columbia University

  5. Database Selection Problems • How to extract content summaries? • How to use the extracted content summaries? basketball 4 cancer 4,532 cpu 23 Web Database basketball 4 cancer 4,532 cpu 23 Web Database 1 basketball 4 cancer 60,298 cpu 0 Web Database 2 cancer Metasearcher Web Database 3 basketball 6,340 cancer 2 cpu 0 Columbia University

  6. Extracting Content Summariesfrom Web Databases • No direct access to remote documents other than by querying • Resort to query-based document sampling: • Send queries to database • Retrieve document sample • Use sample to create approximate content summary Columbia University

  7. “Random” Query-Based Sampling • Pick a word and send it as a query to database • Retrieve top-k documents returned (e.g., k=4) • Repeat until “enough” (e.g., 300) documents are retrieved Callan et al., SIGMOD’99, TOIS 2001 Word Frequency in Sample cancer 150 (out of 300) aids 114 (out of 300) heart 98 (out of 300) … basketball 2 (out of 300) Use word frequencies in sample to create content summary Columbia University

  8. # documents word rank Zipf’s law Random Sampling: Problems • No actual word frequencies computed for content summaries, only a “ranking” of words • Many words missing from content summaries (many rare words) • Many queries return very few or no matches Many words appear in only one or two documents Columbia University

  9. Our Technique: Focused Probing • Train document classifiers • Find representative words for each category • Use classifier rules to derive a topically-focused sample from database • Estimate actual document frequencies for all discovered words Columbia University

  10. Focused Probing: Training • Start with a predefined topic hierarchy and preclassified documents • Train document classifiers for each node • Extract rules from classifiers: • ibm AND computers → Computers • lung AND cancer → Health • … • angina → Heart • hepatitis AND liver → Hepatitis • … } Root SIGMOD 2001 } Health Columbia University

  11. Representative document sample • Actual frequencies for some “important” words Output: Focused Probing: Sampling • Transform each rule into a query • For each query: • Send to database • Record number of matches • Retrieve top-k matching documents • At the end of round: • Analyze matches for each category • Choose category to focus on Sampling proceeds in rounds: In each round, the rules associated with each node are turned into queries for the database Columbia University

  12. Sample Frequencies and Actual Frequencies • “liver” appears in 200 out of 300 documents in sample • “kidney” appears in 100 out of 300 documents in sample • “hepatitis” appears in 30 out of 300 documents in sample Document frequencies in actual database? • Query “liver” returned 140,000 matches • Query “hepatitis” returned 20,000 matches • “kidney” was not a query probe… Can exploit number of matches from one-word queries Columbia University

  13. Adjusting Document Frequencies • We know ranking r of words according to document frequency in sample • We know absolute document frequencyf of some words from one-word queries • Mandelbrot’s formula connects empirically word frequency f and ranking r • We use curve-fitting to estimate the absolute frequency of all words in sample f r Columbia University

  14. Actual PubMed Content Summary PubMed content summary Number of Documents: 3,868,552 category: Health, Diseases … cancer 1,398,178 aids 106,512 heart 281,506 hepatitis 23,481 … basketball 907 cpu 487 • Extracted automatically • ~ 27,500 words in extracted content summary • Fewer than 200 queries sent • At most 4 documents retrieved per query The extracted content summary accurately represents size, contents, and classification of the database Columbia University

  15. Focused Probing: Contributions • Focuses database sampling on dense topic areas • Estimates absolute document frequencies of words • Classifies databases along the way • Classification useful for database selection Columbia University

  16. Database Selection Problems • How to extract content summaries? • How to use the extracted content summaries? basketball 4 cancer 4,532 cpu 23 Web Database basketball 4 cancer 4,532 cpu 23 Web Database 1 basketball 4 cancer 60,298 cpu 0 Web Database 2 cancer Metasearcher Web Database 3 basketball 6,340 cancer 2 cpu 0 Columbia University

  17. Database Selection and Extracted Content Summaries • Database selection algorithms assume complete content summaries • Content summaries extracted by (small-scale) sampling are inherently incomplete (Zipf's law) • Queries with undiscovered words are problematic Database Classification Helps: Similar topics ↔ Similar content summaries Extracted content summaries complement each other Columbia University

  18. Content Summaries for Categories: Example • Cancerlit contains “metastasis”, not found during sampling • CancerBacup contains “diabetes”, not found during sampling • Cancer category content summary contains both Columbia University

  19. Hierarchical DB Selection: Outline • Create aggregated content summaries for categories • Hierarchically direct queries using categories Category content summaries are more complete than database content summaries Various traversal techniques possible Columbia University

  20. Hierarchical DB Selection: Example To select D databases: • Use a “flat” DB selection algorithm to score categories • Proceed to category with highest score • Repeat until category is a leaf, or category has fewer than D databases Columbia University

  21. Actual cancer pneumonia aids heart … basketball Sample aids basketball cancer heart … pneumonia Retrieves same number of documents using fewer queries Topic detection helps Ignores “off-topic” documents Better sample: Each retrieved document “represents” many unretrieved, so “on-topic” sampling helps Actual aids basketball cancer heart … pneumonia Sample aids basketball cancer heart … pneumonia Experiments: Content Summary Extraction Focused Probing compared to Random Sampling: • Better vocabulary coverage • Better word ranking • More efficient for same sample size • More effective for same sample size More results in the paper! 4 types of classifiers (SVM, Ripper, C4.5, Bayes), frequency estimation, different data sets… Columbia University

  22. Experiments: Database Selection Data set and workload: • 50 real Web databases • 50 TREC Web Track queries Metric: Precision @ 15 • For each query pick 3 databases • Retrieve 5 documents from each database • Return 15 documents to user • Mark “relevant” and “irrelevant” documents LoC Query LoC LoCc LoC LoC Database Selection LoC LoC LoC LoC LoC LoC LoC LoC LoC LoC LoC LoC LoC LoC LoC LoC Good database selection algorithms choose databases with relevant documents Columbia University

  23. Experiments: Precision of Database Selection Algorithms Hierarchical database selection improves precision drastically • Category content summaries more complete • Topic-based database clustering helps More results in the paper! (different flat selection algorithms, more content summary extraction algorithms…) Best result for centralized search ~ 0.35 Not an option for Hidden Web! Columbia University

  24. Contributions • Technique for extracting content summaries from completely autonomous Hidden-Web databases • Technique for estimating frequencies: Possible to distinguish large from small databases • Hierarchical database selection exploits classification improving drastically precision of distributed search Content summary extraction implemented and available for download at:http://sdarts.cs.columbia.edu Columbia University

  25. Future Work • Different techniques for merging content summaries for category content summary creation • Effect of frequency estimation on database selection • Different hierarchy “traversing” algorithms for hierarchical database selection Columbia University

More Related