1 / 32

Search Engine Survey

Search Engine Survey. Hongfei Yan 2/15/2007. Outline. Background Information Definition, history, how search engines work General Search Engines Interface, databases, features Google, Yahoo!, Baidu, Live Open Source Search Engines Lucence, SWISH-E

elliot
Download Presentation

Search Engine Survey

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search Engine Survey Hongfei Yan 2/15/2007

  2. Outline • Background Information • Definition, history, how search engines work • General Search Engines • Interface, databases, features • Google, Yahoo!, Baidu, Live • Open Source Search Engines • Lucence, SWISH-E • Metasearch, Visual, and Answer Search Engines

  3. Definition of Search Engine • A search engine is an information retrieval system designed to help find information stored on a computer system, such as on the Web, inside a corporate or proprietary network, or in a personal computer. • The search engine allows one to ask for content meeting specific criteria (typically those containing a given word or phrase) and retrieves a list of items that match those criteria. • This list is often sorted with respect to some measure of relevance of the results. • Search engines use regularly updated indexes to operate quickly and efficiently. • search engine usually refers to a Web search engine, which searches for information on the public Web.

  4. Timeline of Search Engines “Full text” crawler-based Link popularity and PageRank

  5. How search engines work • Web crawling • an automated Web browser which follows every link it sees. Exclusions can be made by the use of robots.txt. • Indexing • The contents of each page are analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called meta tags). • Searching • When a user comes to the search engine and makes a query, the engine looks up the index and provides a listing of best-matching web pages according to its criteria

  6. Storage costs and crawling time • Storage costs are not the limiting resource in search engine implementation. • Simply storing 10 billion pages of 10 kbytes each (compressed) requires 100TB and another 100TB or so for indexes, giving a total hardware cost of under $200k: 100 cheap PCs each with four 500GB disk drives. • a public search engine requires considerably more resources than this to calculate query results and to provide high availability. • Also, the costs of operating a large server farm are not trivial. • Crawling 10B pages with 100 machines crawling at 100 pages/second would take 1M seconds, or 11.6 days on a very high capacity Internet connection.

  7. Outline • Background Information • Definition, history, how search engines work • General Search Engines • Interface, databases, features • Google, Yahoo!, Baidu, Live • Open Source Search Engines • Lucence, SWISH-E • Metasearch, Visual, and Answer Search Engines

  8. General Search Engine • Primary Search Engines • they are either well-known and well-used. • they can potentially generate so much traffic. * Google * Yahoo! * Baidu * Live • Secondary Web Search Engines • These are either smaller or not the primary search engine for access to databases from the Providers of Search listed below. * Exalead * Gigablast * WiseNut • Dead Search Engines • These search engines used to offer their own database or unique search features. They have all abandoned their position in search, although they still may have some kind of search functionality. * AlltheWeb * AltaVista *Excite * Infoseek * Inktomi

  9. GSE: Minimalist User Interface

  10. GSE: Databases • Web: • Indexed Web pages (also includes URLs that it has not fully indexed) • and additional file types in the Web database include PDF, .ps, .doc, .xls, .txt, .ppt, .rtf, .asp and more. • Ads: Paid advertisements usually shown on the right side (or top) under a "Sponsored Links" heading

  11. GSE: Google Database Components

  12. GSE: Features • A large, unique search engine database • Includes cached copies of pages • utilize not only PageRank but more than 150 criteria to determine relevancy • Default Operation: Multiple search terms are processed as an AND operation by default. Phrase matches are ranked higher(Proximity Searching). • No truncation is available. • Case Sensitivity: using either lower or upper case results in the same hits.

  13. GSE: Features contd. • Field searching • Language Limits: Default is all languages. 30+ language limits are available. • Stop Words: searches almost all words except for operators like AND. • Display: • The display includes the title, • URL, • a brief extract showing text near the search terms, • the file size, • and for many hits, a link to a cached copy of the page.

  14. Review of Google • In Feb. 1999 Google moved from Alpha test version to Beta and officially launched Sept. 21, 1999. • Since that time it has made its mark with its relevance ranking based on link analysis, cached pages, and aggressive growth. • Since its beta release, it has had phrase searching and the - for NOT, but it did not add an OR operation until Oct. 2000. • In Dec. 2000, it added title searching. • In June 2000 it announced a database of over 560 million pages, which grew to over 600 million by the end of 2000 and then 1.5 billion in Dec. 2001. • The 2+ billion reported on their home page as of April 2002 includes indexed pages, unindexed URLs, and other file formats. By Nov. 2002, they moved their claim up to 3 billion, and in Feb. 2004 it went to 4 billion. • While no official claim is given, 20+ billion is once current estimate.

  15. Review of Yahoo! • The two founders of Yahoo!, David Filo and Jerry Yang, Ph.D. candidates in Electrical Engineering at Stanford University, started their guide in a campus trailer in February 1994 as a way to keep track of their personal interests on the Internet. Before long they were spending more time on their home-brewed lists of favourite links than on their doctoral dissertations. Eventually, Jerry and David's lists became too long and unwieldy, and they broke them out into categories. When the categories became too full, they developed subcategories ... and the core concept behind Yahoo! was born. • In 2002, Yahoo! acquired Inktomi and in 2003, Yahoo! acquired Overture, which owned AlltheWeb and AltaVista. • in 2004, Yahoo! launched its own search engine based on the combined technologies of its acquisitions and providing a service that gave pre-eminence to the Web search engine over the directory..

  16. Review of Live • Live Search is the successor to MSN Search. This is the Microsoft Web search engine. Launched in September 2006, it uses its own, unique database. • In 2004 it debuted a beta version of its own results, powered by its own web crawler (called msnbot). • In early 2005 it started showing its own results live. At the same time, Microsoft ceased using results from Inktomi, now owned by Yahoo!. • In 2006, Microsoft migrated to a new search platform - Windows Live Search, retiring the "MSN Search" name in the process.

  17. Review of Badu • Baidu (Chinese: 百度; pinyin: bǎi dù) is a popular Chinese search engine which launched in 2000 and can search text and images. As of January 2007, since at least as early as May 2006, it is fourth in Alexa's internet rankings with a market share of 52 percent. • Baidu provides an index of over 1 billion web pages.

  18. Outline • Background Information • Definition, history, how search engines work • General Search Engines • Interface, databases, features • Google, Yahoo!, Baidu, Live • Open Source Search Engines • Lucence, SWISH-E • Metasearch, Visual, and Answer Search Engines

  19. Lucene, lucene.apache.org • Lucene is a free and open source information retrieval API, originally implemented in Java by Doug Cutting. Lucene has been ported to programming languages including Perl, C#, C++, Python, Ruby and PHP. • While suitable for any application which requires full text indexing and searching capability. • At the core of Lucene's logical architecture is a notion of a document containing fields of text. This flexibility allows Lucene's API to be agnostic of file format. Text from PDFs, HTML, Microsoft Word documents, as well as many others can all be indexed so long as their textual information can be extracted.

  20. SWISH-E, swish-e.org • Swish-e stands for Simple Web Indexing System for Humans - Enhanced. It is used to index collections of documents ranging up to one million documents in size and includes import filters for many document types. • Many sites use Swish-e

  21. Outline • Background knowledge • Definition, history, how search engines work • General Search Engines • Interface, databases, features • Google, Yahoo!, Baidu, Live • Open Source Search Engines • Lucence, SWISH-E • Metasearch, Visual, and Answer Search Engines

  22. Visual Search Engine • A search returns both a list of search results and a tag cloud. The tag cloud contains the original search terms surrounded by related tags. The closer to the search terms, the larger the keyword suggestions (both in terms of font size and boldness), the more relevant they are deemed. Holding the mouse over a term will display a new set of results in the bottom window and will also show another keyword cloud overlaying the original.

  23. VSE: Quintura.com

  24. Metasearch Engines • Unlike search engines, metacrawlers don't crawl the web themselves to build listings. Instead, they allow searches to be sent to several search engines all at once. The results are then blended together onto one page.

  25. MSE: vivisimo

  26. MSE: Kartoo.com

  27. Answer-based search engines • Answers.com:presents reference content in over four million entries, collected from multiple sources.

  28. Reference • http://en.wikipedia.org/wiki/Search_engine • http://www.searchengineshowdown.com/ • http://searchenginewatch.com/ • http://www.searchtools.com/tools/tools.html • ……

More Related