How to Build a Search Engine - PowerPoint PPT Presentation

how to build a search engine n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
How to Build a Search Engine PowerPoint Presentation
Download Presentation
How to Build a Search Engine

play fullscreen
1 / 72
How to Build a Search Engine
175 Views
Download Presentation
plato
Download Presentation

How to Build a Search Engine

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷(Kung-Ming Fung) kmfung@doubleservice.com 2008/04/01

  2. Outline • Introduction • Different Kinds of Search Engine • Architecture • Robot, Spider, Crawler • HTML and HTTP • Indexing • Keyword Search • Evaluation Criteria • Related Work • Discussion • About Google • Ajax: A New Approach to Web Applications • References

  3. Introduction

  4. Different Kinds of Search Engine • Directory Search • Full Text Search • Web pages • News • Images • … • Meta Search

  5. Number of Page:Directory < Full text < Meta • Directory Search 目錄式 • ODP:Open Directory Project,http://dmoz.org/ • Full-Text Search 全文檢索 • Google,http://www.google.com/

  6. Meta Search 整合型 • MetaCrawler,http://www.metacrawler.com/ • 愛幫,http://www.aibang.com/

  7. Simplified control flow of the meta search engineReference: Context and Page Analysis for Improved Web Search,http://www.neci.nec.com/~lawrence/papers.html

  8. Architecture WWW Robot, Spider, Crawler Database Indexing Simple Architecture Keyword Search

  9. Typical high-level architecture of a Web crawler Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

  10. Typical anatomy of a large-scale crawler. Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

  11. High Level Google Architecture Reference: A Survey On Web Information Retrieval Technologies

  12. The architecture of a standard meta search engine. Reference: Web Search – Your Way

  13. The architecture of a meta search engine. Reference: Web Search – Your Way

  14. Cyclic architecture for search engines Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

  15. Robot, Spider, Crawler • Robot是Search Engine中負責資料收集的軟體,又稱為Spider、或Crawler,他可以自動在設定的期限內定時自各網站收集網頁資料,而且通常是由一些預定的起始網站開始遊歷其所連結的網站,如此反覆不斷(recursive)的串連收集。 • A major performance stress is DNS lookup.

  16. Goal • Resolving the hostname in the URL to an IP address using DNS(Domain Name Service). • Connecting a socket to the server and sending the request. • Receiving the request page in response.

  17. Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

  18. Amount of static and dynamic pages at a given depth Dynamic pages: 5 levels Static pages: 15 levels Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

  19. Policy • A selection policy that states which pages to download. • A re-visit policy that states when to check for changes to the pages. • A politeness policy that states how to avoid overloading Web sites. • A parallelization policy that states how to coordinate distributed Web crawlers. Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

  20. The view of Web Crawler Reference: Structural abstractions of hypertext documents for Web-based retrieval

  21. Flow of a basic sequential crawler Reference: Crawling the Web.

  22. A multi-threaded crawler model Reference: Crawling the Web.

  23. HTML and HTTP • HTML – Hypertext Markup Language • HTTP – Hypertext Transport Protocol • TCP – Transport Control Protocol • HTTP is built on top of TCP. • Hyperlink • A hyperlink is expressed as an anchor tag with an href attribute. • <a href=“http://www.ntust.edu.tw/”>NTUST</a> • URL – Uniform Resource Locator(http://www.ntust.edu.tw/)

  24. GET / Http/1.0 Http/1.1 200 OK Date: Sat, 13 Jan 2001 09:01:02 GMT Server: Apache/1.3.0 (Unix) PHP/3.0.4 Last-Modified: Wed, 20 Dec 2000 13:18:38 GMT Accept-Ranges: bytes Content-Length: 5437 Connection: Close Content-Type: text/html <html> <head> <title>NTUST</title> </head> <body> … </body></html>

  25. For checking a URL Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

  26. Operation of a crawler Reference: Crawling a Country Better Strategies than Breadth-First for Web Page Ordering.

  27. Get new URLs Reference: Crawling on the World Wide Web.

  28. HTML Tag Tree Reference: Crawling the Web.

  29. HTML Tag Tree Reference: Crawling the Web.

  30. Strategies • Breadth-first • Backlink-count • Batch-pagerank • Partial-pagerank • OPIC(On-line Page Importance Computation ) • Larger-sites-first Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

  31. Re-visit policy • Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a page p in the repository at time t is defined as: Reference: Web crawler, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Web_crawler.

  32. Age: This is a measure that indicates how outdated the local copy is. The age of a page p in the repository, at time t is defined as: Reference: Web crawler, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Web_crawler.

  33. Robot Exclusion http://www.robotstxt.org/wc/exclusion.html • The robots exclusion protocol • The robots META tag

  34. The Robots Exclusion Protocol - /robots.txt • Where to create the robots.txt file?EX:

  35. URL's are case sensitive, and "/robots.txt" must be all lower-case • Examples: • To exclude all robots from the entire server User-agent: * Disallow: / • To exclude all robots from part of the server User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/

  36. To exclude a single robot User-agent: BadBot Disallow: / • To allow a single robot User-agent: WebCrawler Disallow: User-agent: * Disallow: /

  37. To exclude all files except one User-agent: * Disallow: /~joe/docs/User-agent: * Disallow: /~joe/private.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html

  38. A sample robots.txt file# AltaVista SearchUser-agent: AltaVista Intranet V2.0 W3C WebreqDisallow: /Out-Of-Date/# Exclude some access-controlled areasUser-agent: *Disallow: /Team/Disallow: /Project/Disallow: /Sytems/

  39. The Robots META Tag • <meta name="robots" content="noindex,nofollow"> • Like any META tag it should be placed in the HEAD section of an HTML page<html> <head> <meta name="robots" content="noindex,nofollow"> <meta name="description" content="This page ...."> <title>...</title> </head> <body>...

  40. Examples: • <meta name="robots" content="index,follow"> • <meta name="robots" content="noindex,follow"> • <meta name="robots" content="index,nofollow"> • <meta name="robots" content="noindex,nofollow"> • Index: if an indexing robot should index the page • Follow: if a robot is to follow links on the page • The defaults are INDEX and FOLLOW.

  41. Indexing 索引 • 一般而言,索引的產生是將網頁中每個Word或者Phrase存入Keyword索引檔中,另外除了來自網頁內容外,網頁作者所自行定義Meta Tag中的Keyword也常被納入索引範圍。 • TF, IDF, Reverse(Inverted) Index • Stop words

  42. (b) is a inverted index of (a) Reference: Supporting web query expansion efficiently using multi-granularity indexing and query processing

  43. d1: My1 care2 is3 loss4 of5 care6 with7 old8 care9 done10. • d2: Your1 care2 is3 gain4 of5 care6 with7 new8 care9 won10. • tid: token ID • did: document ID • pos: position Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

  44. my -> d1 care -> d1; d2 is -> d1; d2 loss -> d1 of -> d1; d2 with -> d1; d2 old -> d1 done -> d1 your -> d2 gain -> d2 new -> d2 won -> d2 my -> d1/1 care -> d1/2,6,9; d2/2,6,9 is -> d1/3; d2/3 loss -> d1/4 of -> d1/5; d2/5 with -> d1/7; d2/7 old -> d1/8 done -> d1/10 your -> d2/1 gain -> d2/4 new -> d2/8 won -> d2/10 My care is loss of care with old care done. d1 Your care is gain of care with new care won. d2 Two variants of the inverted index data structure. Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

  45. Usually stored on disk • Implemented using a B-tree or a hash table

  46. Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled. Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

  47. Keyword Search 查詢 • 檢索軟體是決定Search Engine是否能普遍為人使用的關鍵要素,因為使用者多只能藉由搜尋速度、搜尋結果來判斷一個系統的好壞,而這些工作都屬於檢索軟體的範圍。 • 人工智慧、自然語言 • Ranking:PageRank、HITS • Query Expansion

  48. WAIS: • 廣域資訊服務(Wide Area Information System;WAIS)是一套可以建立全文索引,並提供網路資源全文檢索功能的軟體,其主要由伺服器(Server)、用戶端(Client)、協定(Protocol)等三部份組成 。 • 查詢方式: • 關鍵字(Keyword) • 以概念為基礎的(Concept-based) • 模糊(Fuzzy) • 自然語言(Natural Language)

  49. PageRankA page can have a high PageRank if there are many pages pointing to it, or if there are same pages that point to it and have a high PageRank. Reference: A Survey On Web Information Retrieval Technologies

  50. We assume page A has pages T1…Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. Google usually sets d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: Reference: A Survey On Web Information Retrieval Technologies