1 / 12

Crawl Operators’ Workshop

Crawl Operators’ Workshop. Roger G. Coram. Topics. ExternalGeoLocationDecideRule Sheets IpAddressSetDecideRule. ExternalGeoLocationDecideRule. Legal Deposit legislation passed in April 2013. The Legal Deposit Libraries (Non-Print Works) Regulations 2013:

moanna
Download Presentation

Crawl Operators’ Workshop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crawl Operators’ Workshop Roger G. Coram

  2. Topics • ExternalGeoLocationDecideRule • Sheets • IpAddressSetDecideRule

  3. ExternalGeoLocationDecideRule • Legal Deposit legislation passed in April 2013. • The Legal Deposit Libraries (Non-Print Works) Regulations 2013: • 18 (1) “…a work published on line shall be treated as published in the United Kingdom if: • “(b) it is made available to the public by a person and any of that person’s activities relating to the creation or the publication of the work take place within the United Kingdom.”

  4. Geolocation • ExternalGeoLocationDecideRule requires: • A list of ISO 3166-1 country-codes to be included in the crawl • GB, FR, DE, etc. • An Implementation of ExternalGeoLookupInterface.

  5. ExternalGeoLookupInterface • Our implementation is based on MaxMind’s GeoLite2 database. • Freely available under ‘Creative Commons Attribution-ShareAlike 3.0 Unported License’. • Only ~30MB; can be held in memory.

  6. crawler-beans.cxml <!-- GEO-LOOKUP: specifying location of external database. --> <bean id="externalGeoLookup" class="uk.bl.wap.modules.deciderules.ExternalGeoLookup"> <property name="database" value="/dev/shm/geoip-city.mmdb"/> </bean> <!-- ... ACCEPT those in the UK... --> <bean id="externalGeoLookupRule" class="org.archive.crawler.deciderules.ExternalGeoLocationDecideRule"> <property name="lookup"> <ref bean="externalGeoLookup"/> </property> <property name="countryCodes"> <list> <value>GB</value> </list> </property> </bean> Configuration example:

  7. Results • Short test crawl (1,000,000 seeds) produced: • 89,500,755 URLs in total. • 26,072 non-UK URLs which would not otherwise been in scope. • 137 distinct hosts.

  8. IP-based Sheets “Hi, “I'm a senior system administrator for Webfusion / 123-reg. “We're currently experiencing lots of requests from crawler1.bl.uk to sites hosted on 81.21.76.62 , this is part of our Parking platform, which links into Yahoo to allow customers to park domains and earn money.” • Large number of hosts on a single machine. • Need a way to reduce the load on a specific IP address.

  9. Sheets • “Sheets provide the ability to replace default settings on a per domain basis.” • Allow you to change any value on any named bean for a specific set of URLs. • Actually quite flexible: • SurtPrefixesSheetAssociation • Applied by matching SURT prefixes. • DecideRuledSheetAssociation: • Applied a series of DecideRules. • IpAddressSetDecideRule

  10. 1. crawler-beans.cxml <bean id="extraPolite" class="org.archive.spring.Sheet"> <property name="map"> <map> <entry key="disposition.delayFactor" value="8.0"/> <entry key="disposition.minDelayMs" value="10000"/> <entry key="disposition.maxDelayMs" value="60000"/> <entry key="disposition.respectCrawlDelayUpToSeconds" value="60"/> </map> </property> </bean> <bean id="crawlLimited" class="org.archive.spring.Sheet"> <property name="map"> <map> <entry key="quotaEnforcer.serverMaxFetchResponses" value="25"/> </map> </property> </bean> Configuration example:

  11. 2. crawler-beans.cxml <bean class="org.archive.crawler.spring.DecideRuledSheetAssociation"> <property name="rules"> <bean class="org.archive.modules.deciderules.IpAddressSetDecideRule"> <property name="ipAddresses"> <set> <value>81.21.76.62</value> </set> </property> <property name="decision" value="ACCEPT"/> </bean> </property> <property name="targetSheetNames"> <list> <value>extraPolite</value> <value>crawlLimited</value> </list> </property> </bean> Configuration example:

  12. Thank you GitHub: https://github.com/ukwa/bl-heritrix-modules MaxMind: http://dev.maxmind.com/geoip/geoip2/geolite2/

More Related