E N D
1. IST 441:Crawling and Indexing with Nutch Presented by: Sujatha Das
Slides courtesy: Saurabh Kataria
Instructor :C. Lee Giles
2. Outline: Brief Overview
Nutch as a complete web search engine
Installation/Usage (with Demo)
3. Search Engine: Basic Workflow
4. Todays class Build a complete web search engine with Nutch
What is Nutch?
Open source search engine
Written in Java
Built on top of Apache Lucene
Nutch = Crawler + Indexer/Searcher (Lucene) + GUI
Attractive Features:
Customizable
Extensible (e.g. extend to Solr for enhanced portability)
+Plugins
+MapReduce & Distributed FS (Hadoop)
5. Why Nutch? Scalable
Index local host or entire Internet
Portable
Runs anywhere with Java
Flexible
Plugin system + API
Java based, open source, many customizable scripts (http://lucene.apache.org/nutch/)
Code pretty easy to read & work with
Better than implementing it yourself!
6. Data Structures used by Nutch Web Database or WebDB
Mirrors the properties/structure of web graph being crawled
Segment
Intermediate index
Contains pages fetched in a single run
Index
Final inverted index obtained by merging segments (Lucene)
7. WebDB Customized graph database
Used by Crawler only
Persistent storage for pages & links
Page DB: Indexed by URL and hash; contains content, outlinks, fetch information & score
Link DB: contains source to target links, anchor text
8. Segment Collection of pages fetched in a single run
Contains:
Output of the fetcher
List of the links to be fetched in the next run called fetchlist
Limited life span (default 30 days)
9. Index To be discussed later
10. A generic Website Structure
11. Crawling Cyclic process
crawler generates a set of fetchlists from the WebDB
fetchers downloads the content from the Web
the crawler updates the WebDB with new links that were found
and then the crawler generates a new set of fetchlists
And Repeat till you reach the depth
12. Nutch as a crawler
13. Nutch as a complete web search engine
14. Crawling: 10 stage process bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
1. admin db create: Create a new WebDB.
2. inject: Inject root URLs into the WebDB.
3. generate: Generate a fetchlist from the WebDB in a new segment.
4. fetch: Fetch content from URLs in the fetchlist.
5. updatedb: Update the WebDB with links from fetched pages.
6. Repeat steps 3-5 until the required depth is reached.
7. updatesegs: Update segments with scores and links from the WebDB.
8. index: Index the fetched pages.
9. dedup: Eliminate duplicate content (and duplicate URLs) from the indexes.
10. merge: Merge the indexes into a single index for searching.
15. Demo: Configuration Configuration files (XML)
Required user parameters
http.agent.name
http.agent.description
http.agent.url
http.agent.email
Adjustable parameters for every component
E.g. for fetcher:
Threads-per-host
Threads-per-ip
16. Configuration URL Filters (Text file) (conf/crawl-urlfilter.txt)
Regular expression to filter URLs during crawling
E.g.
To ignore files with certain suffix:
-\.(gif|exe|zip|ico)$
To accept host in a certain domain
+^http://([a-z0-9]*\.)*apache.org/
17. Installation & Usage Installation
Software needed
Nutch release
Java
Apache Tomcat (for GUI)
Cgywin (for windows)
18. Installation & Usage Usage
Crawling
Initial URLs (text file or DMOZ file)
Required parameters (conf/nutch-site.xml)
URL filters (conf/crawl-urlfilter.txt)
Indexing
Automatic
Searching
Location of files (WAR file, index)
The tomcat server
19. Demo: Site we would crawl: http://ist.psu.edu
bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
Analyze the database:
bin/nutch readdb <db dir> stats
bin/nutch readdb <db dir> dumppageurl
bin/nutch readdb <db dir> dumplinks
bin/nutch readdb <db dir> -linkurl <linkurl>
s=`ls -d <segment dir> /* | head -1`
bin/nutch segread -dump $s
20. References http://lucene.apache.org/nutch/ -- Official website
http://wiki.apache.org/nutch/ -- Nutch wiki
http://lucene.apache.org/nutch/release/ Nutch source code
www.nutchinstall.blogspot.com Installation guide
http://www.robotstxt.org/wc/robots.html The web robot pages