1 / 26

Metasearch

Metasearch. Thanks to Eric Glover. Outline. The general search tool Three C’s of Metasearch and other important issues Metasearch engine architecture Current metasearch projects Advanced metasearch. A generic search tool. Interface. Database. Query entered into an interface

Download Presentation

Metasearch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Metasearch Thanks to Eric Glover

  2. Outline • The general search tool • Three C’s of Metasearch and other important issues • Metasearch engine architecture • Current metasearch projects • Advanced metasearch

  3. A generic search tool Interface Database • Query entered into an interface • Applied to a database of content • Results are scored/ordered • Displayed through the interface Ordering Policy

  4. Interface 1 Database1 Ordering Policy1 Interface 2 Database2 Interface 3 Database3 Ordering Policy3 Why do metasearch? 2 3 1 Ordering Policy2

  5. Definition The word meta comes from the Greek meaning, “situated beyond, transcending.” A metasearch tool is a tool which “transcends” the offerings of a single service by allowing you to access many different Web search tools from one site. More specifically, a metasearch tool permits you to launch queries to multiple search services via a common interface. From Carleton College Library Note: Metasearch is not necessary limited to the Web

  6. The three C’s of metasearch • Coverage - How much of the total is accessible from a single interface • Consistency - One interface and resistant to single search service failures • Convenience - One stop shop Service1 Service2 Interface Service3

  7. Coverage • Coverage refers to the total accessible content, specifically what percentage of the total content • According to the Journal Nature in July 1999 (Lawrence and Giles 99): • Web search engines collectively covered 42% of web (estimated at 800 Million total indexable pages). • Most for a single engine only 16% • Some search services have proprietary content, accessible only through their interface, I.e. an auction site • Search services have different rates of “refresh” • Special purpose search engines are more “up to date” on a special topic

  8. Consistency and Convenience • Consistency • One interface • User can access multiple DIFFERENT search services, using the same interface • Will work even if one search service goes down, or is inconsistent • Convenience • One stop shop • User need not know about all possible sources • Metasearch system owners will automatically add new sources as needed • One interface improves convenience (as well as consistency) • User does not have to manually change their query for each source

  9. Metasearch issues • What do you search • Source selection • How do you search • Query translation -- syntax/semantics • Query submission • Use of specialized knowledge or actions • How to process results • Fusion of results • Actions to improve quality • How to evaluate • Did the individual query succeed • Were the correct decisions made • Does this architecture work • Decisions after search • Search again, do something differently

  10. Metasearch issues • Performance of underlying services • Time, reliability, consistency • Result quality - of a single search service • How “relevant” are results on average • Duplicate detection • Freshness -- how often are results updated • Update rate, dead links, changed content • Probability of finding relevant results • For the given query for each potential source • For the given need category • GLOSS, and similar models • How to evaluate • Is it better than some other service • Especially important with source selection • Feedback for improving fusion policy • Learning for personalization

  11. www.beaucoup.com

  12. Some metasearch engines • Ask Jeeves -- Natural language, and combines local and outside content with a very simple interface • Sherlock -- Integrates web and local searching • Metacrawler -- Early web metasearch engine, some formalizations • SavvySearch -- Research on various methods of source selection • ProFusion -- Attempted to predict subject of query and considered expected performance, both for relevance and time of search engines • Inquirus -- Content-Based Metasearch Engine • Inquirus2 -- Preference based metasearch engine

  13. www.ixquick.com

  14. Architecture Service1 Query User Interface Service2 Dispatcher Results Service3 Fusion Policy Result Retriever

  15. Architecture -- Dispatcher • Query translation • Each search service has a different interface (for queries) • Preserve semantics, while converting the syntax • Could result in loss of expressiveness • Query submission • sends query to the service • Source selection • Choose sources to be queried • Some systems use wrapper code, or agents as their dispatch mechanism

  16. Result processor • Accept results from search service • Parse results, and relevant information -- i.e. title, URL, etc… • Can request more results (feedback to dispatcher) • Advanced processors could get more information about the document, I.e., use special purpose tools

  17. Result Fusion • How to merge results in a meaningful manner? ? } { A = [a1, a2, a3] B = [b1, b2, b3] C = [c1, c2, c3]

  18. Result Fusion • Not comparing apples to apples • Incomplete information, only have a summary • Each search service has their own ranking policy • Which is better AltaVista #2 or Google #3? • Summaries and titles are not consistent between search services • Don’t have complete information • Questions/issues • How much do you trust regular search engine ranks? • Could AltaVista #3 be better than AltaVista #2? • Is one search engine always better than another? • How do you integrate between search engines? • What about results returned by more than one search service?

  19. Fusion policy - a typical approach • First determine the “score” on a fixed range for each result from each search engine • In the early days most search engines returned their scores • Score could be a function of the rank, or occasionally based on the keywords in the title/summary/URL • Second, assign a weight for each search engine • Could be based on predicted subject, stated preferences, special knowledge about a particular search engine • Example: • Service1: A1=1.000, A2=1.000, A3=.95,A4=0.5 • Service2: B1=.95, B2=.95, B3=.89, B4=.8 • W1 = 0.9, W2=1.0 -- final ordering would be: • [B1,B2,A1,A2,B3,A3,B4,A4]

  20. Source Selection • Choosing which services to query • GLOSS -- Choose the databases most likely to contain “relevant” materials • SavvySearch (old) -- Choose sources most likely to have the most valuable results based on past responses • SavvySearch (new) -- Choose sources most appropriate for the user’s category • ProFusion -- User chooses: • 1: Fastest sources • 2: Most likely to have “relevant” based on predicted subject • 3: Explicit user selection • Watson -- Choose both general purpose sources and most “appropriate” special purpose sources

  21. Metacrawler • MetaCrawler • Used user result clicks -- implicit vs. explicit • Not all pages clicked are relevant • Assumed pages not clicked were not relevant • Parameters examined • Comprehensiveness of individual search engines -- considered Unique document percentage UDP related to coverage • As expected, low overlap (assuming first ten documents only) • Relative contribution of each search engine, Viewed Document Share (VDS) • As expected, all services used contributed to the VDS -- the maximum of the eight search services was 23%, and the minimum 4%, with four of them contributing 15% or more

  22. ProFusion • ProFusion • Focus was primarily on source selection • Profusion’s considered: • Performance (time) • Ability to locate relevant results (sources) by query subject prediction • Design • A set of 13 categories and a dictionary of 4000 terms used to predict subject • For each category each search engine (of the six) is “tested” and scored based on the number of relevant results retrieved • Individual search engine “scores” are used to produce a rank ordering by subject, and to fuse results • Parameters examined • Human judgements of some queries • Every search engine was different • ProFusion demonstrated improvements when the “auto-pick” was used

  23. SavvySearch (early work) • Similar to ProFusion, choose sources based on query • Assign “scores” for each query term based on previous query results • Formed a txn matrix (terms by search engines) called a meta-index • Score is based on performance for each term: • Two “events” - No Results and Visits • Scores are adjusted for the number of query terms • Response time is also stored for each search engine • Search engines are chosen based on query terms, and past performance • System load determines the total number queried • Evaluated via a pilot study • Compared various variations of the sources chosen and their rank • As predicted using the meta-index method was better than random • Also examined improvements over time in the no result count • New version user chooses a “category” and appropriate services are used

  24. Advanced metasearch • Improvements • Ordering policy verses a fusion policy -- Inquirus, and some personal search agents • Using complete information -- download document before scoring • All documents scored consistently regardless of source • Allows for improved relevance judgements • Can hurt performance • Query modification -- Inquirus 2, Watson, others? • User queries are modified to improve ability to find “category specific” results • Query modification allows general purpose search engines for special needs • Need based scoring -- Inquirus 2, Learning systems • Documents scores are based on more than query • Can use past user history, or other factors such as subject

  25. Design/Research areas • Source selection • With and without complete information • Learning interfaces and predicting contents of sources • Intelligent Fusion • Without complete information • Consider user’s preferences • Resource efficiency • Knowing how to “plan” a search • User interfaces • How to minimize loss of expressiveness • How to preserve the capabilities of select services • To hide slow performance

  26. Business issues for metasearch • How does one use others resources and not get blocked? • Skimming!

More Related