1 / 19

IST 441:Crawling and Indexing with Nutch

ismet
Download Presentation

IST 441:Crawling and Indexing with Nutch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. IST 441:Crawling and Indexing with Nutch Presented by: Sujatha Das Slides courtesy: Saurabh Kataria Instructor :C. Lee Giles

    2. Outline: Brief Overview Nutch as a complete web search engine Installation/Usage (with Demo)

    3. Search Engine: Basic Workflow

    4. Today’s class Build a complete web search engine with Nutch What is Nutch? Open source search engine Written in Java Built on top of Apache Lucene Nutch = Crawler + Indexer/Searcher (Lucene) + GUI Attractive Features: Customizable Extensible (e.g. extend to Solr for enhanced portability) +Plugins +MapReduce & Distributed FS (Hadoop)

    5. Why Nutch? Scalable Index local host or entire Internet Portable Runs anywhere with Java Flexible Plugin system + API Java based, open source, many customizable scripts (http://lucene.apache.org/nutch/) Code pretty easy to read & work with Better than implementing it yourself!

    6. Data Structures used by Nutch Web Database or WebDB Mirrors the properties/structure of web graph being crawled Segment Intermediate index Contains pages fetched in a single run Index Final inverted index obtained by “merging” segments (Lucene)

    7. WebDB Customized graph database Used by Crawler only Persistent storage for “pages” & “links” Page DB: Indexed by URL and hash; contains content, outlinks, fetch information & score Link DB: contains “source to target” links, anchor text

    8. Segment Collection of pages fetched in a single run Contains: Output of the fetcher List of the links to be fetched in the next run called “fetchlist” Limited life span (default 30 days)

    9. Index To be discussed later

    10. A generic Website Structure

    11. Crawling Cyclic process crawler generates a set of fetchlists from the WebDB fetchers downloads the content from the Web the crawler updates the WebDB with new links that were found and then the crawler generates a new set of fetchlists And Repeat till you reach the “depth”

    12. Nutch as a crawler

    13. Nutch as a complete web search engine

    14. Crawling: 10 stage process bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log 1. admin db –create: Create a new WebDB. 2. inject: Inject root URLs into the WebDB. 3. generate: Generate a fetchlist from the WebDB in a new segment. 4. fetch: Fetch content from URLs in the fetchlist. 5. updatedb: Update the WebDB with links from fetched pages. 6. Repeat steps 3-5 until the required depth is reached. 7. updatesegs: Update segments with scores and links from the WebDB. 8. index: Index the fetched pages. 9. dedup: Eliminate duplicate content (and duplicate URLs) from the indexes. 10. merge: Merge the indexes into a single index for searching.

    15. Demo: Configuration Configuration files (XML) Required user parameters http.agent.name http.agent.description http.agent.url http.agent.email Adjustable parameters for every component E.g. for fetcher: Threads-per-host Threads-per-ip

    16. Configuration URL Filters (Text file) (conf/crawl-urlfilter.txt) Regular expression to filter URLs during crawling E.g. To ignore files with certain suffix: -\.(gif|exe|zip|ico)$ To accept host in a certain domain +^http://([a-z0-9]*\.)*apache.org/

    17. Installation & Usage Installation Software needed Nutch release Java Apache Tomcat (for GUI) Cgywin (for windows)

    18. Installation & Usage Usage Crawling Initial URLs (text file or DMOZ file) Required parameters (conf/nutch-site.xml) URL filters (conf/crawl-urlfilter.txt) Indexing Automatic Searching Location of files (WAR file, index) The tomcat server

    19. Demo: Site we would crawl: http://ist.psu.edu bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log Analyze the database: bin/nutch readdb <db dir> –stats bin/nutch readdb <db dir> –dumppageurl bin/nutch readdb <db dir> –dumplinks bin/nutch readdb <db dir> -linkurl <linkurl> s=`ls -d <segment dir> /* | head -1` bin/nutch segread -dump $s

    20. References http://lucene.apache.org/nutch/ -- Official website http://wiki.apache.org/nutch/ -- Nutch wiki http://lucene.apache.org/nutch/release/ Nutch source code www.nutchinstall.blogspot.com Installation guide http://www.robotstxt.org/wc/robots.html The web robot pages

More Related