1 / 48

Claudio Scordino Ph.D. Student

Claudio Scordino Ph.D. Student. May 2004. Crawling the Web: problems and techniques. Computer Science Department - University of Pisa. Outline. Introduction Crawler architectures Increasing the throughput What pages we do not want to fetch Spider traps Duplicates Mirrors.

varana
Download Presentation

Claudio Scordino Ph.D. Student

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Claudio Scordino Ph.D. Student May 2004 Crawling the Web:problems and techniques Computer Science Department - University of Pisa

  2. Outline • Introduction • Crawler architectures • Increasing the throughput • What pages we do not want to fetch • Spider traps • Duplicates • Mirrors

  3. Introduction • Job of a crawler (or spider): fetching the Web pages to a computer where they will be analyzed • The algorithm is conceptually simple, but… …it’s a complex and underestimate activity

  4. Famous Crawlers • Mercator (Compaq, Altavista) • Java • Modular (components loaded dynamically) • Priority-based scheduling for URLs downloads • The algorithm is a pluggable component • Different processing modules for different contents • Checkpointing • Allows the crawler to recover its state after a failure • In a distributed crawler is performed by the Queen

  5. Famous Crawlers • GoogleBot (Stanford, Google) • C/C++ • WebBase (Stanford) • HiWE: Hidden Web Exposer (Stanford) • Heritrix (Internet Archive) • http://www.crawler.archive.org/

  6. Famous Crawlers • Sphinx • Java • Visual and interactive environment • Relocatable: capable of executing on a remote host • Site-specific • Customizable crawling • Classifiers: site-specific content analyzers • Links to follow • Parts to process • Not scalable

  7. PARSER HREFs extractor and normalizer Citations URL Filter Duplicate URL Eliminator SCHEDULER Internet Crawl Metadata Load Monitor HTTP DNS RETRIEVERS seed URLs Hosts URL FRONTIER Crawler Architecture

  8. Web masters annoyed • Web Server administrators could be annoyed by: • Server overload • Solution: per-server queues • Fetching of private pages • Solution: Robot Exclusion Protocol • File: /robots.txt

  9. Per-server queues Crawler Architecture Robots

  10. FRONT-END: prioritizes URLs with a value between 1 and k Queues containing URLs of only a single host BACK-END: ensures politeness (no server overload) Specifies when a server may be contacted again Mercator’s scheduler

  11. Parsing DNS HTTP Increasing the throughput Parallelize the process to fetch many pages at the same time (~thousands per second). Possible levels of parallelization:

  12. Domain Name resolution Problem: DNS requires time to resolve the server hostname

  13. Domain Name resolution • Asynchronous DNS resolver: • Concurrent handling of multiple outstanding requests • Not provided by most UNIX implementations of gethostbyname • GNU ADNS library • http://www.chiark.greenend.org.uk/~ian/adns/ • Mercator reduced the thread’s elapsed time from 87% to 25%

  14. Domain Name resolution • Customized DNS component: • Caching server with persistent cache largely residing in memory • Prefetching • Hostnames extracted by HREFs and requests made to the caching server • Does not wait for resolution to be completed

  15. DNS Cache Async DNS prefetch DNS resolver client Per-server queues Crawler Architecture Robots

  16. Page retrieval Problem: HTTP requires time to fetch a page • Multithreading • Blocking system calls (synchronous I/O) • pthreads multithreading library • Used in Mercator,Sphinx, WebRace • Sphinx uses a monitor to determine the optimal number of threads at runtime • Mutual exclusion overhead

  17. Page retrieval • Asynchronous sockets • not blocking the process/thread • selectmonitors several sockets at the same time • Does not need mutual exclusion since it performs a serialized completion of threads (i.e. the code that completes processing the page is not interrupted by other completions). • Used in IXE (1024 connection at once)

  18. Page retrieval • Persistent connection • Multiple documents requested on a single connection • Feature of HTTP 1.1 • Reduce the number of HTTP connection setups • Used in IXE

  19. IXE Crawler

  20. IXE Parser • Problem: parsing requires 30% of execution time • Possible solution: distributed parsing

  21. Table <UrlInfo> Citations URL1 URL1 URL2 URL2 DocID1 URL1 URL2 DocID2 URL Table Manager (“Crawler”) Parser Cache URL1 URL2 DocID1 DocID2 IXE Parser

  22. Table 1 <UrlInfo> URL2 Table 1 Manager ? Parser 1 URL1 URL1 URL1 URL2 URL2 URL2 DocID2 URL1 Table 2 <UrlInfo> Table 2 Manager DocID1 Parser N New DocID Cache Scheduler Citations A distributed parser MISS Hash(URL2) → Manager1 Hash (URL1) → Manager2 HIT Sched () → Parser1

  23. A distributed parser • Does this solution scale? • High traffic on the main link • Suppose that: • Average page size = 10KB • Average out-links per page = 10 • URL size = 40 characters (40 bytes) • DocID size = 5 byte • X = throughput (pages per second) • N = number of parsers

  24. Pages per parser Number of parsers DocID Reply Outlinks per page DocID Request Byte → bit A distributed parser • Bandwidth for web pages: • X*10*1024*8 = 81920*X bps • Bandwidth for messages (hit): • X/N * 10 * (40+5) * 8 * N = 3600*X bps • Using 100Mbps : X = 1226 pages per second

  25. What we don’t want to fetch • Spider traps • Duplicates • 2.1 Different URLs for the same page • 2.2 Already visited URLs • 2.3 Same document on different sites • 2.4 Mirrors • At least 10% of the hosts are mirrored

  26. Spider traps • Spider trap: hyperlink graph constructed unintentionally or malevolently to keep a crawler trapped • Infinitely “deep” Web sites • Problem: using CGI is possible to generate an infinite number of pages • Solution: check of the URL length

  27. Spider traps • Large number of dummy pages • Example: http://www.troutbums.com/Flyfactory/flyfactory/flyfactory/hatchline/hatchline/flyfactory/hatchline/flyfactory/hatchline/flyfactory/flyfactory/flyfactory/hatchline/flyfactory/hatchline/ • Solution: disable crawling • a guard removes from consideration any URL from a site which dominates the collection

  28. Avoid duplicates • Problem almost nonexistent in classic IR • Duplicate content • wastes resources (index space) • annoys users

  29. http://www.cocacola.com 129.33.45.163 http://www.coke.com Virtual Hosting • Problem: Virtual Hosting • Allows to map different sites to a single IP address • Could be used to create duplicates • Feature of HTTP 1.1 • Rely on canonical hostnames (CNAMEs) provided by DNS

  30. Already visited URLs • Problem: how to recognize an already visited URL ? • The page is reachable by many paths • We need an efficient Duplicate URL Eliminator

  31. BIT VECTOR 0/1 hash function 1 0/1 hash function 2 hash function n 0/1 Already visited URLs • Bloom Filter • Probabilistic data structure for set membership testing URL • Problem: false positivs • new URLs marked as already seen

  32. 128 bits Digest MD5 URL Already visited URLs • URL hashing • MD5 • Using a 64-bit hash function, a billion URLs requires 8GB • Does not fit in memory • Using the disk limit the crawling rate to 75 downloads per second

  33. 40 bits 24 bits Hostname+Port Path Already visited URLs • two-level hash function • The crawler is luckily to explore URLs within the same site • Relative URLs create a spatiotemporal locality of access • Exploit this kind of locality using a cache

  34. Content based techniques • Problem: how to recognize duplicates basing on the page contents? • Edit distance • Number of replacements required to transform one document to the other • Cost: l1*l2, where l1 and l2 are the lenghts of the documents: Impractical!

  35. Content based techniques • Hashing • A digest associated with each crawled page • Used in Mercator • Cost: one seek in the index for each new crawled page • Problem: pages could have minor syntatic differences ! • site mantainer’s name, latest update • anchors modified • different formatting

  36. Content based techniques • Shingling • Shingle (or q-gram): contiguous subsequence of tokens taken from document d • representable by a fixed length integer • w-shingle: shingle of width w • S(d,w): w-shingling of document d • unordered set of distinct w-shingles contained in document d

  37. Sentence: a rose is a rose is a rose Tokens: a a is is a rose rose rose 4-shingles: rose,is,a,rose is,a,rose,is a,rose,is,a Content based techniques a,rose,is,a rose,is,a,rose is,a,rose,is a,rose,is,a rose,is,a,rose S(d,4):

  38. w-shingle=320 bit Content based techniques • Each token = 32 bit • w = 10 (suitable value) • S(d,10) = set of 320-bits numbers • We can hash the w-shingles and keep 500 • bytes of digests for each document

  39. Content based techniques • Resemblance of documents d1 and d2: Jaccard coefficient • Eliminate pages too similar (pages whose resem-blance value is close to 1)

  40. http://www.research.digital.com/SRC/ URL access method path hostname Mirrors • Precision = relevant retrieved docs / retrieved docs

  41. Mirrors • URL String based • Vector Space model: term vector matching to compute the likelyhood that a pair of hosts are mirrors • terms with df(t) < 100

  42. Mirrors 27% • Hostname matching • Terms: substrings of the hostname • Term weighting: len(t)= number of segments obtained by breaking the term at ‘.’ characters • This weighting favours substrings composed by many segments very specific

  43. mdf = max df(t) t∈collection Mirrors 59% • Full path matching • Terms: entire paths • Term weighting: +19% • Connectivity based filtering stage: • Idea: mirrors share many common paths • Testing for each common path if it has the same set of out-links on both hosts • Remove hostnames from local URLs

  44. Mirrors 72% • Positional word bigram matching • Terms creation: • Break the path into a list of words by treating ‘/’ and ‘.’ as breaks • Eliminate non-alphanumeric characters • Replace digits with ‘*’ (effect similar to stemming) • Combine successive pairs of words in the list • Append the ordinal position of the first word

  45. Mirrors conferences/d299/advanceprogram.html conferences d* advanceprogram html conferences_d*_0 d*_advanceprogram_1 advanceprogram_html_2 Positional Word Bigrams

  46. Mirrors 45% • Host connectivity based • Consider all documents on a host as a single large document • Graph: • host → node • document on host a pointing to a document on host B → directed edge from A to B • Idea: two hosts are likely to be mirrors if their nodes point to the same nodes • Term vector matching • Terms: set of nodes that a host’s node points to

  47. References S. Chakrabarti and M. Kaufmann, Mining the Web: Analysis of Hypertext and Semi Structured Data, 2002. Pages 17-43,71-72. S.Brin and L.Page, The anatomy of a large-scale hypertextual Web search engine. Proceedings of the 7th World Wide Web Conference (WWW7), 1998. A.Heydon and M.Najork, Mercator: A scalable, extensible Web crawler, World Wide Web Conference, 1999. K.Bharat, A.Broder, J.Dean, M,R.Henzinger, A comparison of Techniques to Find Mirrored Hosts on the WWW, Journal of the American Society for Information Science, 2000.

  48. References A.Heydon and M.Najork, High performance Web Crawling, Technical Report, SRC Research Report, 173, Compaq Systems Research Center, 26 September 2001. R.C.Miller and K.Bharat, SPHINX: a framework for creating personal, site-specific web crawlers, Proceedings of the 7th World-Wide Web Conference, 1998. D. Zeinalipour-Yazti and M. Dikaiakos. Design and Implementation of a Distributed Crawler and Filtering Processor, Proceedings of the 5th Workshop on Next Generation Information Technologies and Systems (NGITS 2002), June 2002.

More Related