1 / 78

Distributed Search over the Hidden Web:

Distributed Search over the Hidden Web:. Hierarchical Database Sampling and Selection. Agenda. The Hidden – Web Database selection algorithms An algorithm to extracts a document sample Database classification An algorithm for content summary construction

basil
Download Presentation

Distributed Search over the Hidden Web:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection

  2. Agenda • The Hidden – Web • Database selection algorithms • An algorithm to extracts a document sample • Database classification • An algorithm for content summary construction • Estimating document frequencies, ActualDF() frequencies • Database selection algorithms using categorization and content summary • Experiments

  3. The Hidden Web • Also know as the Deep Web • Most of the Web's information is buried far down on dynamically generated sites • Standard search engines never find it.

  4. The Hidden Web • Search engines create their indexes by spidering or crawling Web pages. • To be discovered, the page must be static and linked to other pages. • Search engines can not retrieve content in the Hidden Web

  5. The Hidden Web • Those pages do not exist until they are created dynamically as the result of a specific search. • Hidden Web sources store their content in searchable databases • Those databases only produce results dynamically in response to a direct request. • A direct query is a "one at a time" laborious way to search.

  6. The Size of the Hidden Web According to a study based on data collected between March 13 and 30, 2000 : • Public information on the hidden Web is currently 400 to 550 times larger than the commonly defined World Wide Web. • Total quality content of the hidden Web is 1,000 to 2,000 times greater than that of the Web • The hidden Web is the largest growing category of new information on the Internet.

  7. Putting those Findings in Perspective • Highest indexed search engines (Google , Northern Light etc.) index up to 16% of the Web. • Since they are missing the hidden Web when they use such search engines • Internet searchers are searching only 0.03% of the pages available to them today.

  8. 10-yr. Growth Trends in Cumulative Original Information Content

  9. The main Issues • An algorithm to derive content summaries from “uncooperative” databases by using “focused query probes” • A novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases.

  10. Some Techniques we will discussed • A document sampling technique for text databases that results in higher quality database content summaries • A technique to estimate the absolute document frequencies of the words in the content summaries

  11. Some techniques we will discussed(2) • A database selection algorithm that proceeds hierarchically over a topical classification scheme. • Experimental evaluation of the new algorithms using both “controlled” databases and 50 real web accessible databases.

  12. An Example • Searching in the medical database CANCERLIT – www.cancer.gov the query [lung and cancer] returns 68,430 matches • Searching Google with the query [“lung” and “cancer” site:www.cancer.gov] returns 23 matches • None of the pages which return corresponds to the database documents.

  13. An Example (2) • The results shows that the valuable CANCERLIT content is not indexed by this search engine

  14. Metasearchers • One-stop access to the information in text databases. • Performs three main tasks: • After receiving a query, it finds the best databases to evaluate the query (database selection) • It translates the query in a suitable form for each database (query translation) • It retrieves and merges the results from the different databases (result merging) and returns them to the user.

  15. Database selection algorithms • Based on statistics that characterize each database’s content • They refer to the statistics as content summaries • Usually include the document frequencies of the words that appear in the database

  16. Database selection algorithms(2) • Provide sufficient information to the database selection component of a metasearcher to decide • which databases are the most promising to evaluate a given query.

  17. Database selection algorithms(3) • Tries to find the best databases to evaluate a given query • Uses the document frequency of the word : • the number of different documents that contain each word • Uses the NumDocs – the number of documents stored in the database

  18. An example - bGlOSS – Boolean Glossary-of- Servers Server A Flat Selection Algorithm • Documents are represented as words with position information. • Queries are expressions composed of words, and connectives such as \and," \or," \not," and proximity operations such as \within k words of.“ • The answer to a query is the set of all the documents that satisfy the Boolean expression.

  19. An example - bGlOSS – Boolean Glossary-of- Servers Server • Giving bGlOSS the query [breast AND cancer] returns: • |C|* df(breast)* df(cancer) »= 74; 569 documents in database CANCERLIT. • |c| - the number of documents in the database • Df() – the number of documents that contain a given word |c| |c|

  20. Supply the Content Summary • The metasearcher rely on the database to supply the content summary • If the databases do not report any detailed metadata about their contents - then • Metasearcher will rely on manually generated descriptions of the database contents. • This doesn’t match thousands of text databases

  21. START – Stanford Protocol for Internet Retrieval and Search • An emerging protocol for Internet retrieval and search that facilitated the task of querying multiple document sources. .

  22. START – Stanford Protocol for Internet Retrieval and Search(2) • The goal – to facilitate the main three task a metasearcher preforms: • Choosing the best source to evaluate a query • Evaluating the query at these sources • Merging the query results from these sources. Mainly deals with what information needs to exchanged between sources and metasearchers

  23. An Algorithm to extracts a document sample from a given database • SampleDF(w) - Computes the frequency of each observed word w in the sample,

  24. An Algorithm to extracts a document sample from a given database(2) • Starts with an empty content summary where SampleDF(w) = 0 for each word w, and a general comprehensive word dictionary. • Pick a word and sent it as a query to database D • Retrieve the top-k documents returned • If the number of retrieved documents exceeds a prespecified threshold • then stop. • else continue the sampling process – return to step 2

  25. 2 Versions of the algorithm • RS-Ord – RandomSampling- OtherResource • Picks a random word from the dictionary for step 2. • RS-Lrd –RandomSampling- LearnedResource • Selects the next query from among the words that have been already discovered during sampling.

  26. More about the algorithm • The actual frequency ActualDF(w) for each word w • is not revealed by this processes • The calculated document frequencies contain • information about the relative ordering of the words in the database.

  27. More about the algorithm (2) • Two databases with the same focus but differing significantly in size might be assigned similar content summaries. • A word that is randomly picked from the dictionary, is likely not to occur in any document of arbitrary database

  28. Database Classification • A way to characterize the contents of a text database is: • To classify it in hierarchy of topics according to the type of the documents that it contains.

  29. Database Classification • A method to automate the classification of web accessible databases based on the principle of “focused probing” • A rule based document classifier – a set of logical rules defining classification decisions

  30. Hierarchy Classification • Categories can be further divided into subcategories, • Resulting in multiple levels of classifiers, one of each internal node of a classification hierarchy • It is possible to create a hierarchical classifier that will recursively divide the space into successively smaller topics.

  31. Classify a database An algorithm that • uses a hierarchical scheme, • automatically maps rule based document classifiers into queries, • which are then used to probe and classify text databases.

  32. Classify a database(2) The algorithm provides a way to zoom in • on the topics that are most representative of a given database’s contents • we can then exploit it for accurate and efficient content summary construction

  33. Focused Probing for content Summary Construction - Algorithm The algorithm consists of two main steps: • Query the database using focused probing in order to: • Retrieve a document sample. • Generate a preliminary content summary • Categorize the database. • Estimate the absolute frequencies of the words retrieved from the database.

  34. Generating a content summary for a database using focused query probing

  35. Generating a content summary for a database using focused query probing(2)

  36. Generating a content summary for a database using focused query probing(3)

  37. Building Content Summaries from Extracted Documents • ActualDF(w): • The actual number of documents in the database that contain word w. • The algorithm knows this number only if [w] is a single word query probe that was issued to the database • SampleDF(w): • The number of documents in the extracted sample that contain word w.

  38. Building Content Summaries from Extracted Documents(2) • Retrieves the top-k documents returned by each query . • Computes SampleDF(w). • If a word w appears in document samples retrieved during later phases of the algorithm • then all SampleDF(w) values are added together • Keeps track of the number of matches produced by each single word query[w] – ActualDF(w) frequency.

  39. Estimating Absolute Document Frequencies • Exploit the SampleDF(.) frequencies • derived from the document sample to rank all observed words • from most frequent to least frequent. • Exploit the ActualDF(.) frequencies • derived from one word query probes to potentially boost the document frequencies • of “nearby” words w for which we only know SampleDF(w) but not ActualDF(w)

  40. Focused Probing Technique for Content Summary Construction - Summary The technique • Estimates the absolute document frequency of the words in a database. • Automatically classifies the database in a hierarchical classification scheme along the way.

  41. Estimating Unknown ActualDF (¢) Frequencies • After probing we get • The rank of all observed words in the sample documents retrieved. • The actual frequencies of some of those words in the database

  42. Estimating Unknown ActualDF (¢) Frequencies(2) • A relationship between the rank r and the frequencies f of a word • f= P(r+p)-B • P, B and p are parameters of the specific documents collection

  43. Estimating Unknown ActualDF (¢) Frequencies(3) • Sort words in descending order of their SampleDF(¢) frequencies • to determine the rank ri of each word wi. • Focus on words with known ActualDF (¢) frequencies. • Use the SampleDF-based rank and ActualDF frequencies to find the P, B, and p parameter values that best fit the data.

  44. Estimating Unknown ActualDF (¢) Frequencies(4) • Estimate ActualDF (wi) • for all words wi with unknown ActualDF (wi) as P(ri+p) -B, • where ri is the rank of word wi as computed in Step 1.

  45. A Database Selection Algorithm that Exploits the Database Categorization and Content Summaries Selection • “Propagate” the database content summaries • to the categories of the hierarchical classification scheme • Use the content summaries of categories and databases • to perform database selection hierarchically by zooming in on the most relevant portions of the topic hierarchy

  46. Creating Content Summaries for Topic Categories • Assumption – • Databases classified under similar topics tend to have similar vocabularies. • Problem – • Database selection algorithms might produce inaccurate conclusions for queries with one or more words missing from relevant content summaries.

  47. Creating Content Summaries for Topic Categories (2) • Solution – • Associate content summaries with the categories of the topic hierarchy used by the probing algorithm. • Treat each category as a large “database” and perform database selection hierarchically

  48. Selecting Database Hierarchically • The algorithm chooses the best databases for a query. • By exploiting the database categorization, this hierarchical algorithm manages to • Compensate for the necessarily incomplete database content summaries produced by query probing

  49. Selecting the K most specific databases for a query hierarchically HierSelect(Query Q, Category C, int K) 1: Use a database selection algorithm to assign a score for Q to each subcategory of C 2: if there is a subcategory C with a non-zero score 3: Pick the subcategory Cj with the highest score

  50. Selecting the K most specific databases for a query hierarchically(2) 4: if NumDBs(Cj) >= K //Cj has enough databases 5: return HierSelect(Q,Cj ,K) 6: else // Cj does not have enough databases 7: return DBs(Cj)  FlatSelect(Q,C-Cj,K-NumDBs(Cj)) 8: else // no subcategory C has non-zero score 9: return FlatSelect(Q,C,K)

More Related