Distributed Search over the Hidden Web

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis & Luis Gravano

Outline • Introduction • Background • Focused Probing for Content Summary Construction • Exploiting Topic Hierarchies for Database Selection • Experiments: Data and Metrics • Experimental Results • Conclusion and Future Work

Introduction • Search engines create their indexes by spidering orcrawling Web pages • Hidden Web sources store their content in searchable databases

An Example… • Searching in the medical database CANCERLIT – www.cancer.gov, the query [ lung and cancer ] returns 68,430 matches. • Searching Google with the query [“lung and cancer” site: www.cancer.gov ] returns 23 matches

Meta - searchers • Tools for searching Hidden data sources • Relies on statistical or content summaries • Performs three main tasks: • Database Selection • Query Translation • Result Merging

This Paper Presents • An algorithm to derive content summaries from “uncooperative” databases • A Database Selection Algorithm that exploits: • Extracted content summaries • Hierarchical classification of the databases

Background:Existing Database Selection Algorithms

Contd. : Database Selection Algorithms • Assumption: Query words are independently distributed over database documents. • The answer to a query is the set of all the documents that satisfy the Boolean expression. • Deficiency: Are the content summaries accurate and up to date?

Uniform Probing for Content Summary Construction • Extracts a document sample from a given database, D and computes the frequency of each observed word w in the sample, SampleDF(w)

The Algorithm: • Start with an empty content summary where SampleDF(w) = 0 for each word w, and a general (i.e., not specific to D), comprehensive word dictionary. • Pick a word and send it as a query to database D. • Retrieve the top-k documents returned. • If the number of retrieved documents exceeds a pre-specified threshold, stop. Else continue the sampling process by returning to Step 2.

2 Versions of Algorithm • RS-Ord : RandomSampling-OtherResource • RS-Lrd : RandomSampling-LearnedResource

Deficiencies: • ActualDF(w) for each word w is not revealed • RS-Ord tends to produce inefficient executions in which it repeatedly issues queries to databases that produce no matches

Database Classification • Rationale : Queries closely associated with topical categories retrieve mainly documents about that category • Place the database in a classification scheme based on the number of matches

Automation andHierarchical Classification • Automates classification by queries derived automatically from a rule-based document classifier. • A rule-based classifier is a set of logical rules defining classification decisions. • jordan AND bulls-->Sports, hepatitis-->Health • Apply this principle recursively to create a hierarchical classifier.

Focused Probing • Sends query probes, and extracts number of matches without retrieving any documents. • Calculates two metrics, the Coverage(Ci) and Specificity(Ci) for the subcategory Ci • If the values of Coverage(Ci) and Specificity(Ci) exceed two pre-specified thresholds Tc and Ts, respectively, classify the database into a category Ci

Author’s Algorithm • Exploit Topic Hierarchy • Produce a document sample that • Is topically representative of the contents • Gives accurate and efficient content summary

Content-Summary Construction • Steps of the algorithm: • Query the database using focused probing to: • Retrieve a document sample. • Generate a preliminary content summary • Categorize the database. • Estimate the absolute frequencies of the words retrieved from the database.

Building Content Summaries from Extracted Documents • ActualDF(w): • The actual number of documents in the database that contain word w. • The algorithm knows this number only if [w] is a single word query probe that was issued to the database • SampleDF(w): • The number of documents in the extracted sample that contain word w.

Focused Probing for Content Summary Construction

Estimating Absolute Document Frequencies • Use Mandelbrot’s equation P(r+p)-B for distribution of words for estimating unknown ActualDF (¢) frequencies. • Sort words in descending order of their SampleDF(¢) frequencies • Focus on words with known ActualDF (¢) frequencies. • Find the P, B, and p parameter values that best fit the data. • Estimate ActualDF (wi) for all words wi with unknown ActualDF (wi) as P(ri+p)-B

Example

Creating Content Summaries for Topic Categories Example: • “metastasis” did not appear in any of the documents sampled from CANCERLIT during probing • Cancer-BACUP classified under “Cancer”, has a high ActualDFest(metastasis) = 3, 569 • Convey this information by associating a content summary with category “Cancer” that is obtained by merging the summaries of all databases under this category • In merged summary, ActualDFest(w) is sum of the document frequency of w for databases under this category

Creating Content Summaries for Topic Categories

Selecting Databases Hierarchically: Algorithm • Inputs : a query Q, target databases K, top category C • Steps: HierSelect(Query Q, Category C, int K) 1: Use a flat database selection algorithm to assign a score for Q to each subcategory of C 2: if there is a subcategory C with a non-zero score 3: Pick the subcategory Cj with the highest score 4: ifNumDBs(Cj) >= K //Cj has enough databases 5: return HierSelect(Q,Cj ,K) 6: else // Cj does not have enough databases 7: return DBs(Cj)  FlatSelect(Q,C-Cj,K-NumDBs(Cj)) 8: else // no subcategory C has non-zero score 9: return FlatSelect(Q,C,K)

Example: Topic hierarchy for database selection (babe AND ruth ,k=3)

Experiments :Data and Metrics • Evaluate two main sets of techniques: 1.Content-summary construction techniques 2. Database selection techniques • Evaluate the algorithms, using two data sets • Controlled Database Set • Web Database Set

Data Sets • Controlled Database Set • 500,000 newsgroup articles from 54 newsgroup • 81,000 articles to train documents classifiers over the 72 – node topic hierarchy • 419,000 articles to build the set of Controlled Databases • Contained 500 databases ranging in size from 25 to 25,000 documents.

Data Sets • Web Database Set • 50 real web accessible databases with no control over it. • Databases picked randomly from two directories of hidden-web databases, namely InvisibleWeb and Complete Planet

Content-summary construction • Test variations of Focused Probing technique against RS-Ord and RS-Lrd. • Focused Probing: • Evaluated configurations with different underlying document classifiers for query-probe creation. • Different values for the thresholds Ts and Tc • Varied the specificity threshold Ts from 0 to1 • Fixed coverage threshold to Tc = 10.

Database Selection Effectiveness • Underlying Database selection algorithm: Hierarchical algorithm • Relies on a “flat” database selection algorithm. • Chose algorithms: CORI, bGlOSS • Adapted both algorithms to work with category content summary.

Database Selection Effectiveness • Content Summary Construction • Evaluated how the hierarchical database selection algorithm behaved over content summaries generated by different techniques • Also studied QPilot Strategy • Exploits HTML links to characterize text databases.

Content Summary Quality • Metric : content summaries coverage of the actual database vocabulary • ctf = ΣwєTrActualDF(w) / ΣwєTdActualDF(w) • Tr = set of terms in content summary, Td = complete set of words in vocabulary • Results: • Focused Probing techniques achieve much higher ctf ratios than RS-Ord and RS-Lrd. • The coverage of the Focused Probing summaries increases for lower thresholds of Ts

Content Summary Quality • Correlation of word rankings: UsedSpearman Rank Correlation Coefficient (SRCC ) – • to measure how well a content summary orders words by frequencies with respect to the actual word frequency order in the database. • Result :The Focused Probing method have higher SRCC values than the RS-Ord and RS-Lrd.

Content Summary Quality

Content Summary Quality - Efficiency • Focused Probing techniques on average retrieve one document per query sent • RS-Lrd retrieves about one document per two queries. • RS-Ord unnecessarily issues many queries that produce no document matches.

Content Summary Quality • Produce significantly better-quality summaries than RS-Ord and RS-Lrd do • in terms of vocabulary coverage • and word ranking preservation.

Database Selection Effectiveness • Methodology: • Web set of real web-accessible databases • 50 queries from the Web Track of TREC • Each database selection algorithm picked 3 databases for the query • Retrieved the top 5 documents for the query. • Human evaluators to judge • the relevance of each retrieved document for the query

Database Selection Effectiveness • Measured the precision of a technique for each query q as : Average precision of different database selection algorithms.

Database Selection Effectiveness • Analysis: • All the flat selection techniques suffer from incomplete coverage of the underlying probing-generated summaries. • QPilot summaries do not work well for database selection because they generally contain only a few words and are hence highly incomplete.

Hierarchical vs. flat database selection • The hierarchical algorithm using CORI as flat database selection has 50% better precision than CORI for flat selection with the same content summaries. • For bGlOSS, the improvement is 92%. • Reason: Topic hierarchy compensates for incomplete content summaries.

Hierarchical vs. flat database selection • Measured fraction of times that hierarchical database selection algorithm picked a database for a query • That produced matches for the query • And was given a zero score by the flat database selection algorithm of choice.

Conclusion • Presented a novel and efficient method for the construction of content summaries of web accessible text databases • Presented a hierarchical database selection algorithm that exploits the database content summaries • Algorithm generated classification to produce accurate results even for imperfect content summaries.

Future Work • Alternative hierarchy traversing techniques. For example, “route” queries to multiple categories if appropriate. • Examine the effect of absolute frequency estimation on database selection. • Alternative methods for creating content summaries.

Distributed Search over the Hidden Web

Distributed Search over the Hidden Web

Presentation Transcript

CRAWLING THE HIDDEN WEB

Crawling the Hidden Web

The Hidden Soil Food Web

Searching the Hidden Web

Distributed Search with Rendezvous Search Systems

Crawling the Hidden Web

Web Search Clustering and Labeling with Hidden Topics

Updating Web views distributed over wide area networks

Crawling the Hidden Web

Crawling the Hidden Web

Accessing the Hidden Web

Crawling the Hidden Web

Web Search

BondFlow: A Platform for Distributed Workflows over Web Services

Search The Web Example

Distributed Search over the Hidden Web:

Distributed Web Crawling over DHTs

Trustworthy Distributed Search and Retrieval over the Internet

Voice Over Hidden Talents

Web Search

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection