Enhancing Web Harvesting Efficiency with Browser-Crawler Integration

Classical Model: Web Harvesting Pull from queue GET / W/ARC - text/cssimage/gif image/jpg videoJavaScript HTTP/1.0 200 OK

Differences Between a Crawler and a Browser Browsers grab all embedded resources as soon as possible Typical behavior is a burst of traffic followed by long pauses. Crawlers have to play by different rules Typical behavior is sustained traffic. Can quickly overwhelm a website Must apply intentional delays Must obey robots.txt rules

In Your Browser: Behind the Scenes

Browser Experiments Integrate a browser into link extraction pipeline: Log all requests and queue in Heritrix Headless browsers Smaller & Faster No need to render the page visually PhantomJS - built on webkit engine (Safari, iOS, Chrome, Android) Auto-QA, snapshot generation, scripted navigation…

A Different Approach Merging browsers & crawlers.…

Recent Use Cases & Implementations Open Planets (browser extractor module for H3): https://www.github.com/openplanets/wap INA (browser w/inline caching proxy; simulates user actions): https://github.com/davidrapin/fantomas NDIIPP/NDSA (a different hybrid approach…): https://github.com/adam-miller/ExternalBrowserExtractorHTML https://github.com/adam-miller/phantomBrowserExtractor (PhantomJS behavior scripts)

How much do we gain? Traditional Link Extraction: Baseline Test 7444 URIs (200 response) 795 URIs (404 response Browser only (full instance or scripted headless): ~30% less content PhantomJS (WITH traditional link extractor): +24% + Significant improvement in unique URI detection - Additional processing overhead …but can distribute load to dedicated browser nodes + Browser downloads in a separate workflow, asynchronous from Heritrix + JavaScript analytics

Other Strategies/Implementations • Data Mining & Analytics • Pre-Crawl Seed & Link Analysis • Link/Script Analysis during an Active Crawl • Post Crawl Link/Script Analysis, Patching & Auto QA • Native Feeds & Alternate Capture Methods • Data format and context is as important as the content • E.g. • Snapshot Generation & Recording

Enhancing Web Harvesting Efficiency with Browser-Crawler Integration

Enhancing Web Harvesting Efficiency with Browser-Crawler Integration

Presentation Transcript

THE CLASSICAL MODEL

Neo-Classical Growth Model

Classical/Neoclassical Model

The Classical Long-Run Model

Chapter 4 The Classical Model

Classical Model

The Dynamic Classical Model

The classical model of economics

Forest Harvesting in CB Model

The Classical Long-Run Model

Neo-classical Growth Model

Hayek Classical Model

Classical Linear Regression Model

Chapter 3 -- Classical Model

A New Model for Web Resource Harvesting

The Classical Long-Run Model

Classical Model: Web Harvesting

Harvesting Climate Model Uncertainty

A New Model for Web Resource Harvesting

Web Application Harvesting

THE CLASSICAL MODEL

A New Model for Web Resource Harvesting