IST 441:Crawling and Indexing with Nutch

1. IST 441:Crawling and Indexing with Nutch Presented by: Sujatha Das Slides courtesy: Saurabh Kataria Instructor :C. Lee Giles

2. Outline: Brief Overview Nutch as a complete web search engine Installation/Usage (with Demo)

3. Search Engine: Basic Workflow

4. Today�s class Build a complete web search engine with Nutch What is Nutch? Open source search engine Written in Java Built on top of Apache Lucene Nutch = Crawler + Indexer/Searcher (Lucene) + GUI Attractive Features: Customizable Extensible (e.g. extend to Solr for enhanced portability) +Plugins +MapReduce & Distributed FS (Hadoop)

5. Why Nutch? Scalable Index local host or entire Internet Portable Runs anywhere with Java Flexible Plugin system + API Java based, open source, many customizable scripts (http://lucene.apache.org/nutch/) Code pretty easy to read & work with Better than implementing it yourself!

6. Data Structures used by Nutch Web Database or WebDB Mirrors the properties/structure of web graph being crawled Segment Intermediate index Contains pages fetched in a single run Index Final inverted index obtained by �merging� segments (Lucene)

7. WebDB Customized graph database Used by Crawler only Persistent storage for �pages� & �links� Page DB: Indexed by URL and hash; contains content, outlinks, fetch information & score Link DB: contains �source to target� links, anchor text

8. Segment Collection of pages fetched in a single run Contains: Output of the fetcher List of the links to be fetched in the next run called �fetchlist� Limited life span (default 30 days)

9. Index To be discussed later

10. A generic Website Structure

11. Crawling Cyclic process crawler generates a set of fetchlists from the WebDB fetchers downloads the content from the Web the crawler updates the WebDB with new links that were found and then the crawler generates a new set of fetchlists And Repeat till you reach the �depth�

12. Nutch as a crawler

13. Nutch as a complete web search engine

14. Crawling: 10 stage process bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log 1. admin db �create: Create a new WebDB. 2. inject: Inject root URLs into the WebDB. 3. generate: Generate a fetchlist from the WebDB in a new segment. 4. fetch: Fetch content from URLs in the fetchlist. 5. updatedb: Update the WebDB with links from fetched pages. 6. Repeat steps 3-5 until the required depth is reached. 7. updatesegs: Update segments with scores and links from the WebDB. 8. index: Index the fetched pages. 9. dedup: Eliminate duplicate content (and duplicate URLs) from the indexes. 10. merge: Merge the indexes into a single index for searching.

15. Demo: Configuration Configuration files (XML) Required user parameters http.agent.name http.agent.description http.agent.url http.agent.email Adjustable parameters for every component E.g. for fetcher: Threads-per-host Threads-per-ip

16. Configuration URL Filters (Text file) (conf/crawl-urlfilter.txt) Regular expression to filter URLs during crawling E.g. To ignore files with certain suffix: -\.(gif|exe|zip|ico)$ To accept host in a certain domain +^http://([a-z0-9]*\.)*apache.org/

17. Installation & Usage Installation Software needed Nutch release Java Apache Tomcat (for GUI) Cgywin (for windows)

18. Installation & Usage Usage Crawling Initial URLs (text file or DMOZ file) Required parameters (conf/nutch-site.xml) URL filters (conf/crawl-urlfilter.txt) Indexing Automatic Searching Location of files (WAR file, index) The tomcat server

19. Demo: Site we would crawl: http://ist.psu.edu bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log Analyze the database: bin/nutch readdb <db dir> �stats bin/nutch readdb <db dir> �dumppageurl bin/nutch readdb <db dir> �dumplinks bin/nutch readdb <db dir> -linkurl <linkurl> s=`ls -d <segment dir> /* | head -1` bin/nutch segread -dump $s

20. References http://lucene.apache.org/nutch/ -- Official website http://wiki.apache.org/nutch/ -- Nutch wiki http://lucene.apache.org/nutch/release/ Nutch source code www.nutchinstall.blogspot.com Installation guide http://www.robotstxt.org/wc/robots.html The web robot pages

IST 441:Crawling and Indexing with Nutch

IST 441:Crawling and Indexing with Nutch

Presentation Transcript

Lucene & Nutch

Indexing with

IST 441 Example Projects

Crawling

Crawling

Crawling and Ranking

Crawling

Introduction to Nutch

CRAWLING

Introduction to Nutch

Crawling

Review for IST 441 exam

Crawling

Crawling, Ranking and Indexing

All about crawling, indexing and Ranking

Google Tips: Crawling and Indexing Pages

How Do Search Engine Indexing, Crawling, and Ranking work

How Do Search Engine Indexing, Crawling, and Ranking work?

Crawling and Ranking

Review for IST 441 exam

Nutch Tutorial

Document Indexing and Scoring in Lucene and Nutch

IST 441:Crawling and Indexing with Nutch