1 / 27

Search Engines: The players and the field

Search Engines: The players and the field. The mechanics of a typical search. The search engine wars. Statistics from search engine logs. The architecture of a search engine. The query engine. Mechanics of a typical search. Results & ads returned ranked. Category of first result.

meriel
Download Presentation

Search Engines: The players and the field

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search Engines: The players and the field • The mechanics of a typical search. • The search engine wars. • Statistics from search engine logs. • The architecture of a search engine. • The query engine.

  2. Mechanics of a typical search

  3. Results & ads returned ranked

  4. Category of first result

  5. Result for phrase query

  6. Tampere weather Mars surface images Nikon CoolPix Search on the Web • Corpus: The publicly accessible Web: static + dynamic • Goal: Retrieve high quality results relevant to the user’s need • (not docs!) • Need • Informational – want to learn about something • Navigational – want to go to that page • Transactional – want to do something (web-mediated) • Access a service • Downloads • Shop • Gray areas • Find a good hub • Exploratory search “see what’s there” Low hemoglobin United Airlines Car rental Finland Abortion morality

  7. Search Engines as Info Gatekeepers • Search engines are becoming the primary entry point for discovering web pages. • Ranking of web pages influences which pages users will view. • Exclusion of a site from search engines will cut off the site from its intended audience. • The privacy policy of a search engine is important. Introna & Nissenbaum: Defining the Web: The Politics of Search Engines Hindman et al: Googlearchy: How a few Heavily-Linked Sites Dominate Politics on the Web

  8. Search Engine Wars • The battle for domination of the web search space is heating up! • The competition is good news for users! • Crucial: advertising is combined with search results! • What if one of the search engines will manage to dominate the space?

  9. Yahoo! • Synonymous with the dot-com boom, probably the best known brand on the web. • Started off as a web directory service in 1994,acquired leading search engine technology in 2003. • Has very strong advertising and e-commerce partners

  10. Lycos! • One of the pioneers of the field • Introduced innovations that inspired the creation of Google

  11. Google • Verb “google” has become synonymous with searching for information on the web. • Has raised the bar on search quality • Has been the most popular search engine in the last few years. • Had a very successful IPO in August 2004. • Is innovative and dynamic. • Has restored glamour in CS lost in dot-com-bust

  12. Live Search(was: MSN Search) • Synonymous with PC software. • Remember its victory in the browser wars with Netscape. • Developed its own search engine technology only recently, officially launched in Feb. 2005. • May link web search into its next version of Windows.

  13. Ask Jeeves • Specialises in natural language question answering. • Search driven by Teoma.

  14. Cuil • The latest kid on the block • Claims to have indexed 120B pages! • So far, it does not rank!

  15. Experiment with query syntax • Default is AND, e.g. “computer chess” normally interpreted as “computer AND chess”, i.e. both keywords must be present in all hits. • “+chess” in a query means the user insists that “chess” be present in all hits. • “computer OR chess” means either keywords must be present in all hits. • “”computer chess”” means that the phrase “computer chess” must be present in all hits.

  16. Statistics from search engine logs

  17. The most popular search keywords

  18. Web search Users • Ill-defined queries • Short length • Imprecise terms • Sub-optimal syntax (80% queries without operator) • Low effort in defining queries • Wide variance in • Needs • Expectations • Knowledge • Bandwidth • Specific behavior • 85% look over one result screen only • mostly above the fold • 78% of queries are not modified • 1 query/session • Follow links – “the scent of information” ...

  19. Query Distribution Power law: few popular broad queries, many rare specific queries

  20. How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)

  21. User Web spider Search Indexer The Web Indexes Ad indexes Architecture of a Search Engine

  22. Rate of web content change 720K pages from 270 popular sites sampled daily from Feb 17 – Jun 14, 1999 [Cho00] Mathematically, what does this seem to be? What does this suggest for crawling policy?

  23. Arts 14.6% Arts: Music 6.1% Computers 13.8% Regional: North America 5.3% Regional 10.3% Adult: Image Galleries 4.4% Society 8.7% Computers: Software 3.4% Adult 8% Computers: Internet 3.2% Recreation 7.3% Business: Industries 2.3% Business 7.2% Regional: Europe 1.8% … … … … Diversity • Languages/Encodings • Hundreds of languages, W3C encodings: 55 (Jul01) [W3C01] • Home pages (1997): English 82%, Next 15: 13% [Babe97] • Google (mid 2001): English: 53%, JGCFSKRIP: 30% • Document & query topic Popular Query Topics (from 1 million Google queries, Apr 2000)

  24. Search Index - Inverted File Frequency • Also store position of word in web page (“offset”) and information on HTML structure.

  25. The query engine • The interface between the search index, the user and the web. • Algorithmic details of commercial search engines are kept as trade secrets. • First step is retrieval of potential results from the index. • Second step is the ranking of the results based on their “relevance” to the query.

  26. Portal User Interface

  27. Crawling the Web Mode of crawl: BFS Frequency of crawl: important robots.txt gives explicit directions on what not to crawl Parallel machines crawl all the time

More Related