Crawlers
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

Crawlers PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on
  • Presentation posted in: General

Padmini Srinivasan Computer Science Department Department of Management Sciences http:// cs.uiowa.edu / ~ psriniva [email protected] Crawlers. Basics. What is a crawler? HTTP client software that sends out an HTTP request for a page and reads a resppnse . Timeouts

Download Presentation

Crawlers

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Crawlers

Padmini Srinivasan

Computer Science Department

Department of Management Sciences

http://cs.uiowa.edu/~psriniva

[email protected]

Crawlers


Basics

Basics

  • What is a crawler?

  • HTTP client software that sends out an HTTP request for a page and reads a resppnse.

  • Timeouts

  • How much to download?

  • Exception handling

  • Error handling

  • Collect statistics: time-outs, etc.

  • Follows Robot Exclusion Protocol (de facto, 1994 onwards)


Tippie web site

Tippie web site

  • # robots.txt for http://www.tippie.uiowa.edu/ or http://tippie.uiowa.edu/

  • # Rules for all robots accessing the site.

  • User-agent: *

  • Disallow: /error-pages/

  • Disallow: /includes/

  • Disallow: /Redirects/

  • Disallow: /scripts/

  • Disallow: /CFIDE/

  • # Individual folders that should not be indexed

  • Disallow: /vaughan/Board/

  • Disallow: /economics/mwieg/

  • Disallow: /economics/midwesttheory/

  • Disallow: /undergraduate/scholars/

  • Sitemap: http://tippie.uiowa.edu/sitemap.xml


Robots txt

Robots.txt

User-agent: *

Disallow:

User-agent: BadBot

Disallow: /

User-agent: Google

Disallow:

User-agent: *

Disallow: /

<html>

<head>

<meta name="googlebot" content="noindex">

http://www.robotstxt.org/

Legal? No. But has been used in legal cases.


Types of crawlers

Types of crawlers

  • Get everything?

    • Broad…..

  • Get everything within on a topic?

    • Preferential, topical, focused, thematic

    • What are your objectives behind the crawl?

  • Keep it fresh

    • When does one run it? Get new versus check old?

  • How does one evaluate performance?

    • Sometimes? Continuously? What’s the Gold Standard?


Design

Design


Crawler parts

Crawler Parts

  • Frontier

    • List of “to be visited” URLS

    • FIFO (first in first out)

    • Priority queue (preferential)

    • When the Frontier is full

      • Does this happen? What to do?

    • When the Frontier is empty

      • Does this happen?

      • 10,000 pages crawled, average 7 links / page: 60,000 URLS in the frontier, how so?

      • Unique URLS?


Crawler parts1

Crawler Parts

  • History

    • Time-stamped listing of visited URLS – take out of frontier first

    • Can keep other information too: quality estimate, update frequency, rate of errors (in accessing page), last update date, anything you want to track related to the fetching of the page.

    • Fast lookup

      • Hashing scheme on the URL itself

      • Canonicalize:

        • Lowercasing;

        • remove anchor reference parts:

          • http://www……./faq.html#time

          • http://www……./faq.html

        • Remove tildas

        • Add or subtract trailing /

        • Remove default pages: index.html

        • Normalize paths: removing parent pointers in url

          • http://www…./data/../mydata.html

        • Normalize port numbers: default numbers (80)

      • Spider traps: long URLS, limit length.


Crawler parts2

Crawler Parts

  • Page Repository

    • Keep all of it? Some of it? Just the anchor texts?

    • You decide

  • Parse the web page

    • Index and store information (if creating a search engine of some kind)

    • What to index? How to index? How to store?

      • Stopwords, stemming, phrases, Tag tree evidence (DOM),

      • NOISE!

    • Extract URLS

  • Google initially: show you next time.


Crawlers

And…

  • But why do crawlers actually work?

    • Topical locality hypothesis

      • An on topic page tends to link to other on topic pages.

      • Empirical test : that two pages that are topically similar have higher probability of linking to each other than two random pages on the web. (Davison, 2000)

      • And too browsing works!

    • Status locality?

      • high status web pages are more likely to link to other high status pages than to low status pages

      • Rationale from social theories: relationship asymmetry in social groups and the spontaneous development of social hierarchies.


Crawler algorithms

Crawler Algorithms

  • Naïve best-first crawler

    • Best-N-first crawler

  • SharkSearch crawler

    • FishSearch

  • Focused crawler

  • Context Focused crawler

  • InfoSpiders

  • Utility-biased web crawlers


Na ve best first crawler

Naïve Best First Crawler

  • Compute cosine between page and query/description as URL score

  • Term frequency (TF) and Inverse Document Frequency (IDF) weights

  • Multi-threaded: Best-N-crawler (256)


Na ve best first crawler1

Naïve best-first crawler

  • Bayesian classifier to score URLS Chakrabartiet al. 1999

  • SVM (Pant and Srinivasan, 2005) better. Naïve Bayes tends to produce skewed scores.

  • Use PageRank to score URLS?

    • How to compute? Partial data. Based on crawled data – poor results

    • Later: utility-biased crawler


Shark search crawler

Shark Search Crawler

  • From earlier Fish Search (de Bra et al.)

    • Depth bound; anchor text; link context; inherited scores

      score(u) = g * inherited(u) + (1 – g) * neighbourhood (u)

      inherited(u) = x * sim(p, q) if sim(p, q) > 0 else inherited(p) (x < 1).

      neighbourhood(u) = b * anchor(u) + (1-b) * context(u)

      (b < 1)

      context(u) = 1 if anchor(u) > 0 else sim(aug_context, q)

      Depth: controls travel in a sub space; no more ‘relevant’ information found.


Focused crawler

Focused Crawler

  • Chakrabarti et al. Stanford/IIT

    • Topic taxonomy

    • User provided sample URLs

    • Classify these onto the taxonomy

      • (Prob(c|url) where Prob(root|url) = 1.

    • User iterates selecting and deselecting categories

    • Mark the ‘good’ categories

    • When page crawled: relevance(page) = sum(Prob(c|page)) where sum is over the good categories; score URLS

    • When crawling:

      • Soft mode: use this relevance score to rank URLS

      • Hard mode: find leaf node with highest score, if any ancestor marked relevant then add to frontier else not


Context focused crawler

Context Focused Crawler

  • A rather different strategy

    • Topic locality hypothesis somewhat explicitly used here

    • Classifiers estimate distance to relevant page from a crawled page. This estimate scores urls.


Context graph

Context Graph


Context graph1

Context Graph

Levels: L

Probability (page in class, i.e., level x)

x = 1, 2, 3 (other)

Bayes theorem:

Prob(L1|page)

= {Prob(page|L1) * Prob(L1)}/Prob(page)

Prob(L1) = 1/L (number of levels)


Utility biased crawler

Utility-Biased Crawler

  • Considers both topical and status locality.

  • Estimates status via local properties

  • Combines using several functions.

    • One: Cobb-Douglas function

      • Utility(URL) = topicalitya * statusb(a + b = 1)

    • if a page is twice as high in topicality and twice as high in status then twice as high utility as well.

    • Increases in topicality (or status) cause smaller increases in utility as the topicality (or status) increases.


Estimating status cool part

Estimating Status ~ cool part

  • Local properties

    • M5’ decision tree algorithm

      • Information volume

      • Information location

        • http://www.somedomain.com/products/news/daily.html

      • Information specificity

      • Information brokerage

      • Link ratio: # links/ # words

      • Quantitative ratio: # numbers/# words

      • Domain traffic: ‘reach’ data for domain obtained from Alexa Web Information Service

        • Pant & Srinivasan, 2010, ISR


Utility biased crawler1

Utility-Biased Crawler

  • Cobb-Douglas function

    • Utility(URL) = topicalitya * statusb(a + b = 1)

  • Should a be fixed? “one size fits all” Or should it vary based on the subspace?

  • Target topicality level (d)

  • a = a + delta (d – t), 0 <= a <= 1

    • t: average estimated topicality of the last 25 pages fetched

    • Delta is a step size (0.01)

  • Assume a = 0.7, delta = 0.01 and t = 0.9

    • a = 0.7 + 0.01(0.7 – 0.9) = 0.7 – 0.002

  • Assume a = 0.7, delta = 0.01 and t = 0.4

    • a = 0.7 + 0.01(0.7 – 0.4) = 0.7 + 0.003


  • It s a matter of balance

    It’s a matter of balance


    Crawler evaluation

    Crawler Evaluation

    • What are good pages?

    • Web scale is daunting

    • User based crawls are short, but web agents?

    • Page importance assessed

      • Presence of query keywords

      • Similarity of page to query/description

      • Similarity to seed pages (held out sample)

      • Use a classifier – not the same as used in crawler

      • Link-based popularity (but within topic?)


    Summarizing performance

    Summarizing Performance

    • Precision

      • Relevance is Boolean: yes/no

        • Harvest rate: # of good pages/total # pages

      • Relevance is continuous

        • Average relevance over crawled set

      • Recall

        • Target recall: held out seed pages (H)

          • |H ∧ pages crawled|/|pages crawled|

        • Robustness

          • Start same crawler on disjoint seed sets. Examine overlap of fetched pages


    Sample performance graph

    Sample Performance Graph


    Summary

    Summary

    • Crawler architecture

    • Crawler algorithms

    • Crawler evaluation

    • Assignment 1

      • Run two crawlers for 5000 pages.

      • Start with the same set of seed pages for a topic.

      • Look at overlap and report this over time (robustness)


  • Login