1 / 22

Heritrix Mobile

Heritrix Mobile. Keith Enlow. Introduction. Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages using Mobile finder and exclude mobile web pages that use media queries. Experiment. Decision Making Heritrix

buffy
Download Presentation

Heritrix Mobile

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HeritrixMobile Keith Enlow

  2. Introduction • Heritrix 3.1 • Mobile Finder Web Service • 2Options • Crawl desktop web pages (default) • Crawl mobile web pages using Mobile finder and exclude mobile web pages that use media queries.

  3. Experiment • Decision Making Heritrix • Web Service (Mobile Finder) Heritrix • Modified Heritrix 3.1 to include two options for crawling • Option 0: Crawl with desktop user agent • Option 1: Crawl with mobile user agent using Mobile Finder • Added built in mobile user agent adapted from Google Bot • Crawled a small set of URLs • Used Mobile Finder to find if the given URL has mobile version • Wrote a small script to discover differences between the mobile and desktop versions

  4. <property name="userAgentTemplate" value="Mozilla/5.0 (compatible; heritrix/@VERISON@+ @OPERATOR_CONTACT_URL@)"/> <property name="userAgentTemplateMobile" value="Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; heritrix/@VERSION@+ @OPERATOR_CONTACT_URL@"/> <!-- Option # = Description 0 [Default] Crawl using desktop user agent 1 Crawl using mobile user agent + Mobile Finder Web Service --> <property name="CrawlOption" value="0" />

  5. URLs Crawled Desktop URL Mobile URL www.huffingtonpost.com www.foxnews.com www.nbcnews.com www.whitehouse.gov www.nasa.gov www.ssa.gov www.cornell.edu www.stanford.edu www.mit.edu m.huffpost.com foxnews.mobi www.nbcnews.com m.whitehouse.gov mobile.nasa.gov www.ssa.gov/mobile m.cornell.edu/#home m.stanford.edu m.mit.edu/mobile.mit.edu

  6. Redirection/Delivery • 200 Response (server side redirect) • 302 “Temporary” relocation • 301 “Permanent” relocation • JavaScript Redirection (client side redirect) • Media Queries • Style Sheets

  7. Tiny Limits • No JavaScript Engine • Heritrix is unable to perform and execute JavaScript code • Unable to catch client side redirection and will instead continue to crawl the desktop version of the web page. Note: The Mobile Finder Web Service will find the mobile page and therefore Heritrix will continue the crawl. • www.nasa.gov • www.ssa.gov • www.cornell.edu

  8. Desktop vs MobileTotal Link Count

  9. HTML Distribution

  10. JavaScript Distribution

  11. CSS Distribution

  12. Image Distribution

  13. FIN

More Related