crawling the web l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Crawling the Web PowerPoint Presentation
Download Presentation
Crawling the Web

Loading in 2 Seconds...

play fullscreen
1 / 38

Crawling the Web - PowerPoint PPT Presentation


  • 212 Views
  • Uploaded on

Crawling the Web. Web pages Few thousand characters long Served through the internet using the hypertext transport protocol (HTTP) Viewed at client end using `browsers’ Crawler To fetch the pages to the computer At the computer Automatic programs can analyze hypertext documents. HTML.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Crawling the Web' - creola


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
crawling the web

Crawling the Web

Web pages

Few thousand characters long

Served through the internet using the hypertext transport protocol (HTTP)

Viewed at client end using `browsers’

Crawler

To fetch the pages to the computer

At the computer

Automatic programs can analyze hypertext documents

slide2
HTML
  • HyperText Markup Language
  • Lets the author
    • specify layout and typeface
    • embed diagrams
    • create hyperlinks.
      • expressed as an anchor tag with a HREF attribute
      • HREF names another page using a Uniform Resource Locator (URL),
    • URL =
      • protocol field (“HTTP”) +
      • a server hostname (“www.cse.iitb.ac.in”) +
      • file path (/, the `root' of the published file system).

Chakrabarti and Ramakrishnan

http hypertext transport protocol
HTTP(hypertext transport protocol)
  • Built on top of the Transport Control Protocol (TCP)
  • Steps(from client end)
    • resolve the server host name to an Internet address (IP)
      • Use Domain Name Server (DNS)
      • DNS is a distributed database of name-to-IP mappings maintained at a set of known servers
    • contact the server using TCP
      • connect to default HTTP port (80) on the server.
      • Enter the HTTP requests header (E.g.: GET)
      • Fetch the response header
        • MIME (Multipurpose Internet Mail Extensions)
        • A meta-data standard for email and Web content transfer
      • Fetch the HTML page

Chakrabarti and Ramakrishnan

crawl all web pages
Crawl “all” Web pages?
  • Problem: no catalog of all accessible URLs on the Web.
  • Solution:
    • start from a given set of URLs
    • Progressively fetch and scan them for new outlinking URLs
    • fetch these pages in turn…..
    • Submit the text in page to a text indexing system
    • and so on……….

Chakrabarti and Ramakrishnan

crawling procedure
Crawling procedure
  • Simple
    • Great deal of engineering goes into industry-strength crawlers
    • Industry crawlers crawl a substantial fraction of the Web
    • E.g.: Alta Vista, Northern Lights, Inktomi
  • No guarantee that all accessible Web pages will be located in this fashion
  • Crawler may never halt …….
    • pages will be added continually even as it is running.

Chakrabarti and Ramakrishnan

crawling overheads
Crawling overheads
  • Delays involved in
    • Resolving the host name in the URL to an IP address using DNS
    • Connecting a socket to the server and sending the request
    • Receiving the requested page in response
  • Solution: Overlap the above delays by
    • fetching many pages at the same time

Chakrabarti and Ramakrishnan

anatomy of a crawler
Anatomy of a crawler.
  • Page fetching threads
    • Starts with DNS resolution
    • Finishes when the entire page has been fetched
  • Each page
    • stored in compressed form to disk/tape
    • scanned for outlinks
  • Work pool of outlinks
    • maintain network utilization without overloading it
      • Dealt with by load manager
  • Continue till he crawler has collected a sufficient number of pages.

Chakrabarti and Ramakrishnan

slide8

Typical anatomy of a large-scale crawler.

Chakrabarti and Ramakrishnan

large scale crawlers performance and reliability considerations
Large-scale crawlers: performance and reliability considerations
  • Need to fetch many pages at same time
    • utilize the network bandwidth
    • single page fetch may involve several seconds of network latency
  • Highly concurrent and parallelized DNS lookups
  • Use of asynchronous sockets
    • Explicit encoding of the state of a fetch context in a data structure
    • Polling socket to check for completion of network transfers
    • Multi-processing or multi-threading: Impractical
  • Care in URL extraction
    • Eliminating duplicates to reduce redundant fetches
    • Avoiding “spider traps”

Chakrabarti and Ramakrishnan

dns caching pre fetching and resolution
DNS caching, pre-fetching and resolution
  • A customized DNS component with…..
    • Custom client for address resolution
    • Caching server
    • Prefetching client

Chakrabarti and Ramakrishnan

custom client for address resolution
Custom client for address resolution
  • Tailored for concurrent handling of multiple outstanding requests
  • Allows issuing of many resolution requests together
    • polling at a later time for completion of individual requests
  • Facilitates load distribution among many DNS servers.

Chakrabarti and Ramakrishnan

caching server
Caching server
  • With a large cache, persistent across DNS restarts
  • Residing largely in memory if possible.

Chakrabarti and Ramakrishnan

prefetching client
Prefetching client
  • Steps
    • Parse a page that has just been fetched
    • extract host names from HREF targets
    • Make DNS resolution requests to the caching server
  • Usually implemented using UDP
    • User Datagram Protocol
    • connectionless, packet-based communication protocol
    • does not guarantee packet delivery
  • Does not wait for resolution to be completed.

Chakrabarti and Ramakrishnan

multiple concurrent fetches
Multiple concurrent fetches
  • Managing multiple concurrent connections
    • A single download may take several seconds
    • Open many socket connections to different HTTP servers simultaneously
  • Multi-CPU machines not useful
    • crawling performance limited by network and disk
  • Two approaches
    • using multi-threading
    • using non-blocking sockets with event handlers

Chakrabarti and Ramakrishnan

multi threading
Multi-threading
  • logical threads
    • physical thread of control provided by the operating system (E.g.: pthreads) OR
    • concurrent processes
  • fixed number of threads allocated in advance
  • programming paradigm
    • create a client socket
    • connect the socket to the HTTP service on a server
    • Send the HTTP request header
    • read the socket (recv) until
      • no more characters are available
    • close the socket.
  • use blockingsystem calls

Chakrabarti and Ramakrishnan

multi threading problems
Multi-threading: Problems
  • performance penalty
    • mutual exclusion
    • concurrent access to data structures
  • slow disk seeks.
    • great deal of interleaved, random input-output on disk
    • Due to concurrent modification of document repository by multiple threads

Chakrabarti and Ramakrishnan

non blocking sockets and event handlers
Non-blocking sockets and event handlers
  • non-blocking sockets
    • connect, send or recv call returns immediately without waiting for the network operation to complete.
    • poll the status of the network operation separately
  • “select” system call
    • lets application suspend until more data can be read from or written to the socket
    • timing out after a pre-specified deadline
    • Monitor polls several sockets at the same time
  • More efficient memory management
    • code that completes processing not interrupted by other completions
    • No need for locks and semaphores on the pool
    • only append complete pages to the log

Chakrabarti and Ramakrishnan

link extraction and normalization
Link extraction and normalization
  • Goal: Obtaining a canonical form of URL
  • URL processing and filtering
    • Avoid multiple fetches of pages known by different URLs
    • many IP addresses
      • For load balancing on large sites
        • Mirrored contents/contents on same file system
      • “Proxy pass“
        • Mapping of different host names to a single IP address
        • need to publish many logical sites
    • Relative URLs
      • need to be interpreted w.r.t to a base URL.

Chakrabarti and Ramakrishnan

canonical url
Canonical URL

Formed by

  • Using a standard string for the protocol
  • Canonicalizing the host name
  • Adding an explicit port number
  • Normalizing and cleaning up the path

Chakrabarti and Ramakrishnan

robot exclusion
Robot exclusion
  • Check
    • whether the server prohibits crawling a normalized URL
    • In robots.txt file in the HTTP root directory of the server
      • species a list of path prefixes which crawlers should not attempt to fetch.
  • Meant for crawlers only

Chakrabarti and Ramakrishnan

eliminating already visited urls
Eliminating already-visited URLs
  • Checking if a URL has already been fetched
    • Before adding a new URL to the work pool
    • Needs to be very quick.
    • Achieved by computing MD5 hash function on the URL
  • Exploiting spatio-temporal locality of access
      • Two-level hash function.
        • most significant bits (say, 24) derived by hashing the host name plus port
        • lower order bits (say, 40) derived by hashing the path
      • concatenated bits use d as a key in a B-tree
  • qualifying URLs added to frontier of the crawl.
  • hash values added to B-tree.

Chakrabarti and Ramakrishnan

spider traps
Spider traps
  • Protecting from crashing on
    • Ill-formed HTML
      • E.g.: page with 68 kB of null characters
    • Misleading sites
      • indefinite number of pages dynamically generated by CGI scripts
      • paths of arbitrary depth created using soft directory links and path remapping features in HTTP server

Chakrabarti and Ramakrishnan

spider traps solutions
Spider Traps: Solutions
  • No automatic technique can be foolproof
  • Check for URL length
  • Guards
    • Preparing regular crawl statistics
    • Adding dominating sites to guard module
    • Disable crawling active content such as CGI form queries
    • Eliminate URLs with non-textual data types

Chakrabarti and Ramakrishnan

avoiding repeated expansion of links on duplicate pages
Avoiding repeated expansion of links on duplicate pages
  • Reduce redundancy in crawls
  • Duplicate detection
    • Mirrored Web pages and sites
  • Detecting exact duplicates
    • Checking against MD5 digests of stored URLs
    • Representing a relative link v (relative to aliases u1 and u2) as tuples (h(u1); v) and (h(u2); v)
  • Detecting near-duplicates
    • Even a single altered character will completely change the digest !
      • E.g.: date of update/ name and email of the site administrator
    • Solution : Shingling

Chakrabarti and Ramakrishnan

load monitor
Load monitor
  • Keeps track of various system statistics
    • Recent performance of the wide area network (WAN) connection
      • E.g.: latency and bandwidth estimates.
    • Operator-provided/estimated upper bound on open sockets for a crawler
    • Current number of active sockets.

Chakrabarti and Ramakrishnan

thread manager
Thread manager
  • Responsible for
    • Choosing units of work from frontier
    • Scheduling issue of network resources
    • Distribution of these requests over multiple ISPs if appropriate.
  • Uses statistics from load monitor

Chakrabarti and Ramakrishnan

per server work queues
Per-server work queues
  • Denial of service (DoS) attacks
    • limit the speed or frequency of responses to any fixed client IP address
  • Avoiding DOS
    • limit the number of active requests to a given server IP address at any time
    • maintain a queue of requests for each server
      • Use the HTTP/1.1 persistent socket capability.
    • Distribute attention relatively evenly between a large number of sites
  • Access locality vs. politeness dilemma

Chakrabarti and Ramakrishnan

text repository
Text repository
  • Crawler’s last task
    • Dumping fetched pages into a repository
  • Decoupling crawler from other functions for efficiency and reliability preferred
  • Page-related information stored in two parts
    • meta-data
    • page contents.

Chakrabarti and Ramakrishnan

storage of page related information
Storage of page-related information
  • Meta-data
    • relational in nature
      • usually managed by custom software to avoid relation database system overheads
      • text index involves bulk updates
    • includes fields like content-type, last-modified date, content-length, HTTP status code, etc.

Chakrabarti and Ramakrishnan

page contents storage
Page contents storage
  • Typical HTML Web page compresses to 2-4 kB (using zlib)
  • File systems have a 4-8 kB file block size
    • Too large !!
  • Page storage managed by custom storage manager
    • simple access methods for
      • crawler to add pages
      • Subsequent programs (Indexer etc) to retrieve documents

Chakrabarti and Ramakrishnan

page storage
Page Storage
  • Small-scale systems
    • Repository fitting within the disks of a single machine
    • Use of storage manager (E.g.: Berkeley DB)
      • Manage disk-based databases within a single file
      • configuration as a hash-table/B-tree for URL access key
        • To handle ordered access of pages
      • configuration as a sequential log of page records.
        • Since Indexer can handle pages in any order

Chakrabarti and Ramakrishnan

page storage32
Page Storage
  • Large Scale systems
    • Repository distributed over a number of storage servers
    • Storage servers
      • Connected to the crawler through a fast local network (E.g.: Ethernet)
      • Hashed by URLs
    • `T3' grade leased lines.
      • To handle 10 million pages (40 GB) per hour

Chakrabarti and Ramakrishnan

slide33

Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled.

Chakrabarti and Ramakrishnan

refreshing crawled pages
Refreshing crawled pages
  • Search engine's index should be fresh
  • Web-scale crawler never `completes' its job
  • High variance of rate of page changes
  • “If-modified-since” request header with HTTP protocol
    • Impractical for a crawler
  • Solution
    • At commencement of new crawling round estimate which pages have changed

Chakrabarti and Ramakrishnan

determining page changes
Determining page changes
  • “Expires” HTTP response header
    • For page that come with an expiry date
  • Otherwise need to guess if revisiting that page will yield a modified version.
    • Score reflecting probability of page being modified
    • Crawler fetches URLs in decreasing order of score.
    • Assumption : recent past predicts the future

Chakrabarti and Ramakrishnan

estimating page change rates
Estimating page change rates
  • Brewington and Cybenko & Cho
    • Algorithms for maintaining a crawl in which most pages are fresher than a specified epoch.
  • Prerequisite
    • average interval at which crawler checks for changes is smaller than the inter-modification times of a page
  • Small scale intermediate crawler runs
    • to monitor fast changing sites
      • E.g.: current news, weather, etc.
    • Patched intermediate indices into master index

Chakrabarti and Ramakrishnan

putting together a crawler
Putting together a crawler
  • Reference implementation of the HTTP client protocol
    • World-wide Web Consortium (http://www.w3c.org/)
    • w3c-libwww package

Chakrabarti and Ramakrishnan

design of the core components crawler class
Design of the core components: Crawler class.
  • To copy bytes from network sockets to storage media
  • Three methods to express Crawler's contract with user
    • pushing a URL to be fetched to the Crawler (fetchPush)
    • Termination callback handler (fetchDone) called with same URL
    • Method (start) which starts Crawler's event loop.
  • Implementation of Crawler class
    • Need for two helper classes called DNS and Fetch

Chakrabarti and Ramakrishnan