An introduction to heritrix
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

An Introduction To Heritrix PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on
  • Presentation posted in: General

An Introduction To Heritrix. Gordon Mohr Chief Technologist, Web Projects Internet Archive. Web Collection. Since 1996 Over 4x10 10 resources (URI+time) Over 400TB (compressed). Web Collection: via Alexa. Alexa Internet Private company Crawling for IA since 1996

Download Presentation

An Introduction To Heritrix

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


An introduction to heritrix

An Introduction To Heritrix

Gordon Mohr

Chief Technologist, Web Projects

Internet Archive


Web collection

Web Collection

  • Since 1996

  • Over 4x1010resources(URI+time)

  • Over 400TB(compressed)


Web collection via alexa

Web Collection: via Alexa

  • Alexa Internet

    • Private company

    • Crawling for IA since 1996

  • 2-month rolling snapshots

    • Recent: 3 billion URIs, 35 million websites, 20 TB

  • Crawling software

    • Sophisticated

    • Weighted towards popular sites

    • Proprietary: we only receive the data


Heritrix motivations 1

Heritrix: Motivations #1

  • Deeper, specialized, in-house crawling

    • Sites of topical interest

    • Contractual crawls for libraries and governments

      • US Library of Congress

        • Elections, current events, government websites

      • UK Public Records Office, US National Archives

        • Government websites

    • Using our own software & machines


Heritrix motivations 2

Heritrix: Motivations #2

  • Open source

    • Encourage collaboration on features and best practices

    • Avoid duplication of work, incompatibilities

  • Archival-quality

    • Perfect copies

    • Keep up with changing web

    • Meet evolving needs of Internet Archive and International Internet Preservation Consortium


Heritrix

Heritrix

New

Open-source

Extensible

Web-scale

Archival-quality

Web crawling software


Heritrix use cases

Heritrix: Use Cases

  • Broad Crawling

    • Large, as-much-as-possible

  • Focused Crawling

    • Collect specific sites/topics deeply

  • Continuous Crawling

    • Revisit changed sites

  • Experimental Crawling

    • Novel approaches


Heritrix project

Heritrix: Project

  • Heritrix means heiress

  • Java, modular

  • Project website: http://crawler.archive.org

    • News, downloads, documentation

    • Sourceforge: open source hosting site

      • Source-code control (CVS)

      • Issue databases

  • “Lesser” GPL license

  • Outside contributions


Http crawler archive org

http://crawler.archive.org


Heritrix milestones

Heritrix: Milestones

  • Summer 2003: Prototypes created and tested against existing crawlers; requirements collected from IA and IIPC

  • October 2003-April 2004: Nordic Web Archive programmers join project, add capabilities

  • January 2004: First public beta (0.2.0)

    • Used for all in-house crawling since

  • February & June 2004: Workshops for Heritrix users at national libraries

  • August 2004: Version 1.0.0 released


Heritrix architecture

Heritrix: Architecture

  • Basic loop:

    1. Choose a URI from among all those scheduled

    2. Fetch that URI

    3. Analyze or archive the results

    4. Select discovered URIs of interest, and add to those scheduled

    5. Note that the URI is done and repeat

  • Parallelized across threads (and eventually, machines)


Key components of heritrix

Key components of Heritrix

  • Scope

    which URIs should be included

    (seeds + rules)

  • Frontier

    which URIs are done, or waiting to be done

    (queues and lists/maps)

  • Processor chains

    configurable sequential tasks to do to each URI

    (code modules + configuration)


Heritrix architecture1

Heritrix: Architecture


Heritrix processor chains

Heritrix: Processor Chains

  • Prefetch

    • Ensure conditions are met

  • Fetch

    • Network activity (HTTP, DNS, FTP, etc.)

  • Extract

    • Analyze – especially for new URIs

  • Write

    • Save archival copy to disk

  • Postprocess

    • Feed URIs back to Frontier, update crawler state


Heritrix features limitations

Heritrix: Features & Limitations

  • Other key features:

    • Web UI console to control & monitor crawl

    • Very configurable inclusion, exclusion, politeness policies

  • Limitations:

    • Requires sophisticated operator

    • Large crawls hit single-machine limits

    • No capacity for automatic revisit of changed material

  • Generally:

    • Good for focused & experimental crawling use cases; not yet for broad and continuous


Heritrix console

Heritrix console


Heritrix settings

Heritrix settings


Heritrix logs

Heritrix logs


Heritrix reports

Heritrix reports


Heritrix current uses

Heritrix: Current Uses

  • Weekly, Monthly, 6-monthly, and special one-time crawls

  • Hundreds to thousands of specific target sites

  • Over 20 million collected URIs per crawl

  • Crawls run for 1-2 weeks


Heritrix performance

Heritrix: Performance

  • Not yet stressed, optimized

    • Current crawls limited by material to crawl and chosen politeness, not our performance

  • Typical observed rates (actual focused crawls)

    • 20-40 URIs/sec (peaking over 60)

    • 2-3Mbps (peaking over 20Mbps)

  • Limits imposed by memory usage

    • Over 10,000 hosts/over 10 million URIs (512MB machine, more on larger machines)


Heritrix future plans

Heritrix: Future Plans

  • Larger scale crawl capacity

    • Giant focused crawls

    • Broad whole-web crawls

  • New protocols & formats

  • Automate expert operator tasks

  • Continuous and dynamic crawling

    • Revisit sites as they change

    • Dynamically rank sites and URIs


Latest developments

Latest Developments

  • 1.2 Release (next week)

    • Configurable canonicalization

      • Handles common session-IDs, URI variations

    • Politeness by IP address

    • Experimental more memory-efficient Frontier

    • Bug fixes

  • 1.4 Release (January 2004)

    • Memory robustness

    • Experimental multi-machine distribution support


The end

The End

  • Questions?


  • Login