How to Build a Search Engine

How to Build a Search Engine 樂倍達數位科技股份有限公司范綱岷（Kung-Ming Fung） kmfung@doubleservice.com 2008/04/01

Outline • Introduction • Different Kinds of Search Engine • Architecture • Robot, Spider, Crawler • HTML and HTTP • Indexing • Keyword Search • Evaluation Criteria • Related Work • Discussion • About Google • Ajax： A New Approach to Web Applications • References

Introduction

Different Kinds of Search Engine • Directory Search • Full Text Search • Web pages • News • Images • … • Meta Search

Number of Page：Directory < Full text < Meta • Directory Search 目錄式 • ODP：Open Directory Project，http://dmoz.org/ • Full-Text Search 全文檢索 • Google，http://www.google.com/

Meta Search 整合型 • MetaCrawler，http://www.metacrawler.com/ • 愛幫，http://www.aibang.com/

Simplified control flow of the meta search engineReference: Context and Page Analysis for Improved Web Search，http://www.neci.nec.com/~lawrence/papers.html

Architecture WWW Robot, Spider, Crawler Database Indexing Simple Architecture Keyword Search

Typical high-level architecture of a Web crawler Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

Typical anatomy of a large-scale crawler. Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

High Level Google Architecture Reference: A Survey On Web Information Retrieval Technologies

The architecture of a standard meta search engine. Reference: Web Search – Your Way

The architecture of a meta search engine. Reference: Web Search – Your Way

Cyclic architecture for search engines Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

Robot, Spider, Crawler • Robot是Search Engine中負責資料收集的軟體，又稱為Spider、或Crawler，他可以自動在設定的期限內定時自各網站收集網頁資料，而且通常是由一些預定的起始網站開始遊歷其所連結的網站，如此反覆不斷（recursive）的串連收集。 • A major performance stress is DNS lookup.

Goal • Resolving the hostname in the URL to an IP address using DNS（Domain Name Service）. • Connecting a socket to the server and sending the request. • Receiving the request page in response.

Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

Amount of static and dynamic pages at a given depth Dynamic pages: 5 levels Static pages: 15 levels Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

Policy • A selection policy that states which pages to download. • A re-visit policy that states when to check for changes to the pages. • A politeness policy that states how to avoid overloading Web sites. • A parallelization policy that states how to coordinate distributed Web crawlers. Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

The view of Web Crawler Reference: Structural abstractions of hypertext documents for Web-based retrieval

Flow of a basic sequential crawler Reference: Crawling the Web.

A multi-threaded crawler model Reference: Crawling the Web.

HTML and HTTP • HTML – Hypertext Markup Language • HTTP – Hypertext Transport Protocol • TCP – Transport Control Protocol • HTTP is built on top of TCP. • Hyperlink • A hyperlink is expressed as an anchor tag with an href attribute. • <a href=“http://www.ntust.edu.tw/”>NTUST</a> • URL – Uniform Resource Locator（http://www.ntust.edu.tw/）

GET / Http/1.0 Http/1.1 200 OK Date: Sat, 13 Jan 2001 09:01:02 GMT Server: Apache/1.3.0 (Unix) PHP/3.0.4 Last-Modified: Wed, 20 Dec 2000 13:18:38 GMT Accept-Ranges: bytes Content-Length: 5437 Connection: Close Content-Type: text/html <html> <head> <title>NTUST</title> </head> <body> … </body></html>

For checking a URL Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

Operation of a crawler Reference: Crawling a Country Better Strategies than Breadth-First for Web Page Ordering.

Get new URLs Reference: Crawling on the World Wide Web.

HTML Tag Tree Reference: Crawling the Web.

Strategies • Breadth-first • Backlink-count • Batch-pagerank • Partial-pagerank • OPIC（On-line Page Importance Computation ） • Larger-sites-first Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

Re-visit policy • Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a page p in the repository at time t is defined as: Reference: Web crawler, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Web_crawler.

Age: This is a measure that indicates how outdated the local copy is. The age of a page p in the repository, at time t is defined as: Reference: Web crawler, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Web_crawler.

Robot Exclusion http://www.robotstxt.org/wc/exclusion.html • The robots exclusion protocol • The robots META tag

The Robots Exclusion Protocol - /robots.txt • Where to create the robots.txt file?EX:

URL's are case sensitive, and "/robots.txt" must be all lower-case • Examples： • To exclude all robots from the entire server User-agent: * Disallow: / • To exclude all robots from part of the server User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/

To exclude a single robot User-agent: BadBot Disallow: / • To allow a single robot User-agent: WebCrawler Disallow: User-agent: * Disallow: /

To exclude all files except one User-agent: * Disallow: /~joe/docs/User-agent: * Disallow: /~joe/private.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html

A sample robots.txt file# AltaVista SearchUser-agent: AltaVista Intranet V2.0 W3C WebreqDisallow: /Out-Of-Date/# Exclude some access-controlled areasUser-agent: *Disallow: /Team/Disallow: /Project/Disallow: /Sytems/

The Robots META Tag • <meta name="robots" content="noindex,nofollow"> • Like any META tag it should be placed in the HEAD section of an HTML page<html> <head> <meta name="robots" content="noindex,nofollow"> <meta name="description" content="This page ...."> <title>...</title> </head> <body>...

Examples： • <meta name="robots" content="index,follow"> • <meta name="robots" content="noindex,follow"> • <meta name="robots" content="index,nofollow"> • <meta name="robots" content="noindex,nofollow"> • Index: if an indexing robot should index the page • Follow: if a robot is to follow links on the page • The defaults are INDEX and FOLLOW.

Indexing 索引 • 一般而言，索引的產生是將網頁中每個Word或者Phrase存入Keyword索引檔中，另外除了來自網頁內容外，網頁作者所自行定義Meta Tag中的Keyword也常被納入索引範圍。 • TF, IDF, Reverse（Inverted） Index • Stop words

(b) is a inverted index of (a) Reference: Supporting web query expansion efficiently using multi-granularity indexing and query processing

d1: My1 care2 is3 loss4 of5 care6 with7 old8 care9 done10. • d2: Your1 care2 is3 gain4 of5 care6 with7 new8 care9 won10. • tid: token ID • did: document ID • pos: position Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

my -> d1 care -> d1; d2 is -> d1; d2 loss -> d1 of -> d1; d2 with -> d1; d2 old -> d1 done -> d1 your -> d2 gain -> d2 new -> d2 won -> d2 my -> d1/1 care -> d1/2,6,9; d2/2,6,9 is -> d1/3; d2/3 loss -> d1/4 of -> d1/5; d2/5 with -> d1/7; d2/7 old -> d1/8 done -> d1/10 your -> d2/1 gain -> d2/4 new -> d2/8 won -> d2/10 My care is loss of care with old care done. d1 Your care is gain of care with new care won. d2 Two variants of the inverted index data structure. Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

Usually stored on disk • Implemented using a B-tree or a hash table

Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled. Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

Keyword Search 查詢 • 檢索軟體是決定Search Engine是否能普遍為人使用的關鍵要素，因為使用者多只能藉由搜尋速度、搜尋結果來判斷一個系統的好壞，而這些工作都屬於檢索軟體的範圍。 • 人工智慧、自然語言 • Ranking：PageRank、HITS • Query Expansion

WAIS： • 廣域資訊服務(Wide Area Information System；WAIS)是一套可以建立全文索引，並提供網路資源全文檢索功能的軟體，其主要由伺服器(Server)、用戶端(Client)、協定(Protocol)等三部份組成。 • 查詢方式： • 關鍵字(Keyword) • 以概念為基礎的(Concept-based) • 模糊（Fuzzy） • 自然語言（Natural Language）

PageRankA page can have a high PageRank if there are many pages pointing to it, or if there are same pages that point to it and have a high PageRank. Reference: A Survey On Web Information Retrieval Technologies

We assume page A has pages T1…Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. Google usually sets d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: Reference: A Survey On Web Information Retrieval Technologies

How to Build a Search Engine

How to Build a Search Engine

Presentation Transcript

How To Enhance Your Search Engine Rankings

How to Attract Mobile Search Engine Spiders

Search Engine Secrets: Build a Search-Engine Friendly Web Site Samsunshine Levy

How to Use the OPAC Search Engine

HOW SEARCH ENGINE WORKS.

How to use the NP search engine

How to Remove Search.klivs.com Search Engine

How Search Engine Index a website

How to Improve Your Search Engine Ranking

Build A Stronger Business With Better Search Engine Optimization

How To Choose A Search Engine Optimization Company

How to build a Search engine friendly web page and get traffic on it?

How to Become a Search Engine Marketing (SEM) Professional?

How to Improve Search Engine Optimization

how to rank on google search engine

How To Make Search Engine Optimisation A Snap!

How to Optimize Search Engine Marketing Tactics

How To Prepare For Search Engine Optimisation

How To Set Search Engine Optimisation

How to Sell affordable search engine optimization to a Skeptic

Just how to Begin a Search Engine Optimization Project