1 / 14

Crawl RSS

Crawl RSS. Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris. The problem. Certain sites change very frequently News sites especially

beck
Download Presentation

Crawl RSS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris

  2. The problem • Certain sites change very frequently • News sites especially • While we can capture all the stories by visiting once per day, week, month or even year they may have been modified several times and the front page changes will be missed

  3. RSS feed advantages • Changes to the feed is highly likely to signify an actual change has occurred • A single RSS feed informs on changes both to the presumed “front page” as well as article or item pages • RSS feeds are generally smaller (in bytes) then the front page (just html) of a site • Crawling the RSS feed frequently is more likely to be tolerated

  4. How it works 1/4 • On first load all feed elements are loaded • A feed element is uniquely identified by its • URL • Timestamp • Each element plus front page is visited • Embeds are downloaded • No further links are followed • Strict controls need to be in place to halt scope leakage • Each feed element should lead to a very finite number of URLs to crawl • Basically, just get minimal embedds, do not follow links

  5. How it works 2/4 • Once all the URLs generated by the initial feed elements have been crawled the RSS feed may be revisited • IF the minimum wait between visits has elapsed • ELSE wait until the minimum time has elapsed • The second visit will (probably) show many already seen elements • Identified by url+timestamp • If feed is entirely unchanged than the content hash will likely be unchanged • If an url has a new timestamp it is probable that the content of the item has changed • Only load items that have a timestamp that is more recent than the ‚most recently seen‘ timestamp for each feed

  6. How it works 3/4 • If there are changed or new elements • Fetch ‘front page’ URI and URIs of changed and new elements • If they match existing content hashes, they may be discarded, otherwise written to (W)ARCs. • Do not revisit embedded content that we have already crawled • This massively reduces the amount of time it takes to complete each RSS visit

  7. How it works 4/4 • Once visit 2 is over • Check has minimum wait elapsed, • rinse, • repeat

  8. Sites • Many sites have multiple feeds • Sometimes items will appear in more than one feed at a time • It is therefor possible to have multiple related feeds for one site • Such feeds are always crawled jointly and duplicate items are discarded

  9. Example RSS Site: ruv.is State: HOLD_FOR_FEED_EMIT Number of discovered items: 0 Minimum wait between emitting feeds (ms): 600000 Earliest next feed emission: Mon May 12 14:49:48 GMT 2014 URLs being crawled: 0 Feeds last emitted: Mon May 12 14:39:48 GMT 2014 Feeds: Feed: http://www.ruv.is/rss/frettir Most recent seen: Mon May 12 14:24:34 GMT 2014 http://www.ruv.is/ Feed: http://www.ruv.is/rss/erlent Most recent seen: Mon May 12 14:11:50 GMT 2014 http://www.ruv.is/ http://www.ruv.is/erlent Feed: http://www.ruv.is/rss/sport Most recent seen: Sun May 11 22:55:17 GMT 2014 http://www.ruv.is/ http://www.ruv.is/ithrottir Feed: http://www.ruv.is/rss/innlent Most recent seen: Mon May 12 14:24:34 GMT 2014 http://www.ruv.is/ http://www.ruv.is/innlent

  10. Configuration • Either via Heritrix’s CXML • Or using the database interface • Maintaining the DB is outside the scope of the add-on • Easy to add not configuration handlers

  11. Crawl RSS - Heritrix 3 add-on • Available on GitHub: • https://github.com/Landsbokasafn/crawlrss • Requires Heritrix 3.1.2 or newer • Stable, but still technically in ‘beta’ • In use at NULI for almost a year now • First new sites • Now also select blogs and government sites

More Related