1 / 28

Web Crawling and Automatic Discovery

Web Crawling and Automatic Discovery. Donna Bergmark Cornell Information Systems bergmark@cs.cornell.edu. Web Resource Discovery. Finding info on the Web Surfing (random strategy; goal is serendipity) Searching (inverted indices; specific info)

hdemartino
Download Presentation

Web Crawling and Automatic Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems bergmark@cs.cornell.edu CS502 Web Information Systems

  2. Web Resource Discovery • Finding info on the Web • Surfing (random strategy; goal is serendipity) • Searching (inverted indices; specific info) • Crawling (follow links; “all” the info) • Uses for crawling • Find stuff • Gather stuff • Check stuff CS502 Web Information Systems

  3. Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web. CS502 Web Information Systems

  4. Crawlers and internet history • 1991: HTTP • 1992: 26 servers • 1993: 60+ servers; self-register; archie • 1994 (early) – first crawlers • 1996 – search engines abound • 1998 – focused crawling • 1999 – web graph studies • 2002 – use for digital libraries CS502 Web Information Systems

  5. So, why not write a robot? You’d think a crawler would be easy to write: Pick up the next URL Connect to the server GET the URL When the page arrives, get its links (optionally do other stuff) REPEAT CS502 Web Information Systems

  6. The Central Crawler Function Server 3 queue Connect a Socket to Server; send HTTP request Server 2 queue URL -> IP address via DNS Wait for the response: An HTML page Server 1 queue CS502 Web Information Systems

  7. Document seen before? Process this document Handling the HTTP Response Extract text FETCH No Extract links : : CS502 Web Information Systems

  8. LINK Extraction • Finding the links is easy (sequential scan) • Need to clean them up and canonicalize them • Need to filter them • Need to check for robot exclusion • Need to check for duplicates CS502 Web Information Systems

  9. Update the Frontier URL1 URL2 URL3 : FETCH PROCESS FRONTIER CS502 Web Information Systems

  10. Crawler Issues • System Considerations • The URL itself • Politeness • Visit Order • Robot Traps • The hidden web CS502 Web Information Systems

  11. Standard for Robot Exclusion • Martin Koster (1994) • http://any-server:80/robots.txt • Maintained by the webmaster • Forbid access to pages, directories • Commonly excluded: /cgi-bin/ • Adherence is voluntary for the crawler CS502 Web Information Systems

  12. Visit Order • The frontier • Breadth-first: FIFO queue • Depth-first: LIFO queue • Best-first: Priority queue • Random • Refresh rate CS502 Web Information Systems

  13. Robot Traps • Cycles in the Web graph • Infinite links on a page • Traps set out by the Webmaster CS502 Web Information Systems

  14. The Hidden Web • Dynamic pages increasing • Subscription pages • Username and password pages • Research in progress on how crawlers can “get into” the hidden web CS502 Web Information Systems

  15. MERCATOR CS502 Web Information Systems

  16. Mercator Features • One file configures a crawl • Written in Java • Can add your own code • Extend one or more of M’s base classes • Add totally new classes called by your own • Industrial-strength crawler: • uses its own DNS and java.net package CS502 Web Information Systems

  17. The Web is a BIG Graph • “Diameter” of the Web • Cannot crawl even the static part, completely • New technology: the focused crawl CS502 Web Information Systems

  18. Crawling and Crawlers • Web overlays the internet • A crawl overlays the web seed CS502 Web Information Systems

  19. Focused Crawling CS502 Web Information Systems

  20. 1 2 3 4 X X 5 R Focused Crawling 1 2 3 4 5 6 7 R Focused crawl Breadth-first crawl 1 CS502 Web Information Systems

  21. 1 2 3 4 X X 5 R Focused Crawling • Recall the cartoon for a focused crawl: • A simple way to do it is with 2 “knobs” CS502 Web Information Systems

  22. Focusing the Crawl • Threshold: page is on-topic if correlation to the closest centroid is above this value • Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than this value CS502 Web Information Systems

  23. Illustration Corr >= threshold 1 Cutoff = 1 2 3 4 555 5 X 6 7 X CS502 Web Information Systems

  24. Closest Furthest CS502 Web Information Systems

  25. Correlation vs. Crawl Length CS502 Web Information Systems

  26. Fall 2002 Student Project Centroids, Dictionary Term vectors Collection URLs Query Centroid Collection Description Mercator Chebyshev P.s HTML CS502 Web Information Systems

  27. Conclusion • We covered crawling – history, technology, deployment • Focused crawling with tunneling • We have a good experimental setup for exploring automatic collection synthesis CS502 Web Information Systems

  28. http://mercator.comm.nsdlib.org CS502 Web Information Systems

More Related