1 / 51

Choosing and Using the Best Metas

Choosing and Using the Best Metas. Hyper-searching the Web. Michael Hunter Reference Librarian Hobart and William Smith Colleges for Rochester Regional Library Council Member Libraries’ Staff Sponsored by the Rochester Regional Library Council

devaki
Download Presentation

Choosing and Using the Best Metas

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Choosing and Using the Best Metas Hyper-searching the Web Michael Hunter Reference Librarian Hobart and William Smith Colleges for Rochester Regional Library Council Member Libraries’ Staff Sponsored by the Rochester Regional Library Council Supported by Library Services and Technology Act (LSTA) and/or Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2002

  2. For Today … • Metas: History and Functions • Search and Retrieval Issues • Major Players in 2003 • Clustering Technology • More Good Metas • Web Search Agents • Evaluating Metasearch Services

  3. Metasearch defined . . . • Group of search engines, subject directories and/or databases made searchable through a common interface. • Results may or may not follow the original source’s rankings • Today our focus is free metaengines using subject directories (Yahoo, LII, OD) and crawler-based engines as sources (Google, FAST, Teoma) • We will NOT examine specialized or Deep Web metas

  4. A GOOD Meta will . . . • Re-format queries to be compatible with search syntax of each source • Enable searchers to use advanced features (when the sources support them) • Indicate overlapping results without repeating them • Perform additional processing of results, eg. ranking for appropriateness, catagorization, etc. • Use only sources with unique databases

  5. The beginnings of metasearch • A conceptual descendant of Veronica • March 1995 –Harvest (later Savvysearch, now Search.com) developed at Colorado State by Daniel Dreilinger • July 1995 – Metacrawler developed at U. of Washington by Selberg and Etzioni • “Metacrawler Architecture for Resource Aggregation on the Web” 1996

  6. The beginnings of metasearch • 1996 - Dogpile • 1998 - Ixquick • 1999 - Kartoo • 2000 - Ithaki • 2001 - Vivisimo

  7. More facts about metas • “Flavor” determined by choice of sources • Comprehensive • Vivisimo, Ixquick, Metacrawler • General Lifestyle, popular culture • Dogpile, Profusion • Commercial • Search.com, Excite@home

  8. Metas and retrieval • Metas search quickly but not deeply • Search time or a quantity of searches are purchased from sources (typically top 10-50 hits from each) • Metas are subject to time-out limits from their sources • Each source is usually NOT searched for each query

  9. Metas and retrieval • “Dumbing Down the Query” • Advanced features are often not available, and then only those that are shared among sources • Default setting for time-out is the shortest; set to maximum for more comprehensive searches (when available) • For most metas, advertising is the only source of revenue; software sales are rare

  10. Metas and retrieval • What is their place in my search strategy? • Metas best used for simple searches, with little (or no) syntactic complexity • Use them to find the top few sites on a topic • For a quick overview of a topic’s coverage on the Web in general • Use them “as a last resort” for highly focused topics that elude your usual search tools • As a possible indication of coverage of a topic among several engines (NOTE: problematic)

  11. Searching the metas • Results depend on • Choice of sources • Query processing speed OF THE SOURCE • Length of time spent at each source

  12. A search comparison . . . • Searched heterotropia (abnormal binocular vision) on 4/21/03 • Vivisimo 77 Shortest 126 Longest • Ixquick 37 “from at least 450 results” • Profusion 30 Shortest 39 Longest • Metacrawler 42 Shortest 61 Longest • Webcrawler 31 Shortest 80 Longest • Dogpile 29 (no time-out option) • Excite 41 Shortest 31 Longest

  13. Stability of ResultsSearched “kids of survival” (modern art group) as a phrase at 3-minute intervals (time-outs at default setting) 4/21/03

  14. Metas and ranking options • Listing by SOURCE • Usually retains ranking of source • COMBINED Listing options • Indicate source of each result • Indicate duplicates without repeating them • Indicate position in original source’s ranking • “Most duplicated hits” listed first • Disclose paid listings (if disclosed by source)

  15. Vivisimo • http://vivisimo.com • Sources: Altavista, Yahoo, MSN, Netscape, Lycos, LookSmart, Gigablast, Vizzavi, BBC, Librarian’s Index to the Internet plus 11 specialized news sources and 7 specialized business, medical and governmental sources • Offers full Boolean and phrase search (if supported by the source)

  16. Vivisimo • Offers the following customizations: • Selection of sources searched • Total number of results retrieved • Length of search (“time-out period”) • Results combined • Source for each result given • Ranking data from that source given • Duplicates noted, but not repeated

  17. Vivisimo • Other features: • Results are clustered by keyword prevalence or website of origin • Offers a preview of each result in a separate window • Offers vertical searches: Top News, Business News, Tech News, Sports News

  18. Clustering results (“folders”) • Automated “subject analysis” • Facilitates navigation and query refinement • Can be hierarchical (folders within folders) • One document may appear in several folders • Northern Light first public search engine to make use of folders

  19. Clustering technology in a metasearch environment • Real-time processing of results retrieved from sources • Variety of data can be returned from each source • Url • Title • First few sentences • Human-created summary • Folder creation varies according to data from sources and processing time available at the moment of the query

  20. Clustering -- Step 1 • Significant terms are identified from all results based on • Frequency of term(s) • Position of term(s) • Normalization algorithms applied • Documents analyzed for word variants (stemming) • Norms set (“authority control”) “game downloads” “download games” “downloading games” • Folder “labels” created

  21. Clustering – Step 2 • Each result from the sources is matched against the set of folder labels and assigned to one or more folders • By linguistic analysis (term position, predictive descriptive importance) • By statistical analysis (term frequency) • Final, proprietary analysis combines these (and more) • Remember: The full documents are not available to a meta for this type of processing

  22. Profusion • http://profusion.com • Sources: Altavista, Yahoo, MSN, About.com, Adobe PDF, AOL, LookSmart, Lycos, Netscape, Raging Search, Teoma, WiseNut • Offers full Boolean and phrase search (if supported by the source)

  23. Profusion • Offers the following customizations: • Selection of sources searched • Total number of results retrieved • Length of search (“time-out period”) • Offers option of results listed by source or combined listing • Source for each result given • Ranking data from that source given • Duplicates noted, but not repeated

  24. Profusion • Other features: • Results can be sorted by relevance score, title or URL • “Similar Result” enhancement • Profusion Relevance Score shown • Search terms highlighted in results listing • “Set Search Alert” feature stores searches and alerts user to page changes; requires setting up a (free) account • Search Analysis available • Offer vertical searches: Deep Web content in 21 broad categories; News

  25. Ixquick • http://ixquick.com • Sources: Altavista, Netscape, Gigablast, Adobe PDF, Avaya PDF, AskJeeves, Teoma, Go, Open Directory, Overture, Kanoodle, LookSmart, WiseNut, FindWhat, Yahoo, MSN • Offers full Boolean and phrase search (if supported by the source) • Offers the following customizations: • Selection of sources searched • Length of search (“time-out period”)

  26. Ixquick • Results combined • Source for each result given • Ranking data from that source given • Duplicates noted, but not repeated

  27. Ixquick • Other features: • Offers 7 field searches (when supported by sources) • Clusters hits from same site • Highlights search terms in each hit • Offers “Related Searches” • Offers vertical searches: MP3, News, Pictures

  28. iBoogie • http://iboogie.com • Sources: Altavista, Yahoo, MSN, FAST, FindWhat, Teoma, WiseNut, OpenFind • Boolean and phrase search somewhat unreliable • Offers the following customizations: • Selection of sources searched • Total number of results retrieved • Length of search (“time-out period”)

  29. iBoogie • Results combined • Source for each result given • Duplicates noted, but not repeated • Other features: • Adult content filter (when supported by source) • Language limit (when supported by source) • Clusters results by keyword and/or website • Offers “Similar Pages” enhancement • Offers vertical searches: Newspapers, Bookstores, Reference, Shopping

  30. Metacrawler • http://metacrawler.com • Sources: FAST, Google, About.com, AskJeeves, FindWhat, LookSmart, Inktomi (?), Open Directory, Overture, Search Hippo, Sprinks, Teoma • Offers Boolean “and”, “or” (no “not”) and phrase search (if supported by the source) • Offers the following customizations: • Selection of sources searched • Total number of results retrieved • Length of search (“time-out period”)

  31. Metacrawler • Offers option of results listed by source or combined listing • Source for each result given • Duplicates noted, but not repeated • Other features: • Offers Related Searches • “More like this” results enhancement • Offers a wide range of vertical searches: Images, MP3, Shopping, Subject Directory, Multimedia, News, Message Boards

  32. Dogpile • http://dogpile.com • Sources: Google, Fast, About.com, Ah-ha, AskJeeves, FindWhat, LookSmart, Open Directory, Search Hippo, Sprinks, Overture, Inktomi (?) • Offers Boolean “and”, “or” (no “not”) and phrase search (if supported by the source) • Offers the following customization: • Selection of sources searched

  33. Dogpile • Results listed ONLY by source • Source for each result given • Other features: • Offers Related Searches • Offers a wide range of vertical searches, similar to Metacrawler: Images, MP3, Shopping, Subject Directory, Multimedia, News, Message Boards

  34. Web Search Agentsaka desktop client search programs • Software must be purchased • Queries a fixed set of engines, directories, news and other databases • Sites that review and feature search agents • Searchenginewatch.com • Searchengineshowdown.com • www.botspot.com • www.agentland.com

  35. Web Search Agentstypical features • Queries are re-formulated to follow syntax of source databases • Duplicates removed • Additional ranking performed • Source given • Optional sort orders • Optional grouping of results into “folders” • Many output options (html, word processor, xml, e-mail and more)

  36. Web Search Agentsdifferent from other metas? • Differences from the (good) free metas • Many more sources queried • Several output options • Update option (re-running the search at specified intervals) • Customizable search parameters

  37. Web Search Agents • BullsEye Pro 3.0 $199 • BullsEye Plus $49.99 • Covers 1000+ sources • Removes dead links • Multiple language capability • Government and News search groups • Customization of sources available for an additional fee • All other “typical features” • Available at intelliseek.com

  38. Web Search Agents • Copernic Pro 5.02 $79.95 • Copernic 2001 Plus $39.95 • Copernic Plus Basic Free • Pro version covers 1000+ sources • Removes dead links • Post-search refinement and processing of retrieved results • Automatic document summarizations (requires more software) • All other “typical features” • Available at www.copernic.com

  39. Ultrabar: choosing your own sources • Free download • Searches a small set of pre-selected engines and allows more to be added, including Deep Web resources • Offers search term highlighting • Does not re-formulate queries for each source • No output options • Available at ultrabar.com

  40. Evaluating metasearch services • What are the sources for the results? • Good general search engines and high-quality directories? Shopping engines? Do any sources share the same database? • What search features are offered? • Remember, these are only in effect for the sources that support them. • What results-based enhancements are offered? • Clustering? “More like this”? Highlighting of search terms? “Related Searches”?

  41. Evaluating metasearch services • What factors determine the ranking of results? • Is there any processing of results after retrieval from the sources? • Is the source and/or ranking in that source given for each hit? • Can the user expand the number of sources searched and/or the search time?

  42. Evaluating metasearch services • Use your own test-drive questions and compare with results from other meta-engines and good single engines and directories. • Search for questions in specialized subject areas you are familiar with (tests database depth). • Search for very recent topics (tests database freshness)

  43. Evaluating metasearch services • Check its popularity through an independent rating or popularity monitoring service • Media Metrix http://www.mediametrix.com/ • The oldest user-based rating service on the Web: lists top 50 most visited sites. • PC Data Online http://www.pcdataonline.com/reports/ • Check for information at the site • About, FAQ, Contact Us

  44. A GOOD meta will . . . • Re-format queries to be compatible with search syntax of each source • Enable searchers to use advanced features (when the sources support them) • Indicate overlapping results without repeating them • Perform additional processing of results, eg. ranking for appropriateness, catagorization, etc. • Use only sources with unique databases

  45. In conclusion . . . • How do metas fit into my search strategy? • Metas best used for simple searches, with little (or no) syntactic complexity • Use them to find the top few sites on a topic • For a quick overview of a topic’s coverage on the Web in general • Use them “as a last resort” for highly focused topics that elude your usual search tools • As a possible indication of coverage of a topic among several engines (NOTE: problematic) • Other uses??

More Related