1 / 19

Crawlers and Crawling Strategies

Crawlers and Crawling Strategies. CSCI 572: Information Retrieval and Search Engines Summer 2011. Outline. Crawlers Web File-based Characteristics Challenges. Why Crawling?. Origins were in the web Web is a big “spiderweb”, so like a a “spider” crawl it

coralie
Download Presentation

Crawlers and Crawling Strategies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crawlers and Crawling Strategies CSCI 572: Information Retrieval and Search Engines Summer 2011

  2. Outline • Crawlers • Web • File-based • Characteristics • Challenges

  3. Why Crawling? • Origins were in the web • Web is a big “spiderweb”, so like a a “spider” crawl it • Focused approach to navigating the web • It’s not just visit all pages at once • …or randomly • There needs to be a sense of purpose • Some pages more important or different than others • Content-driven • Different crawlers for different purposes

  4. Different classifications of Crawlers • Whole-web crawlers • Must deal with different concerns thanmore focused vertical crawlers, or content-based crawlers • Politeness, ability to mitigate any and all protocols defined in the URL space • Deal with URL filtering, freshness and recrawling strategies • Examples: Heretix, Nutch, Bixo, crawler-commons, clever uses of wget and curl, etc.

  5. Different classifications of Crawlers • File-based crawlers • Don’t necessitate the understanding of protocol negotiation – it’s a hard problem in its own right! • Assume that the content is already local • Uniqueness is in the methodology for • File identification and selection • Ingestion methodology • Examples: OODT CAS, scripting (ls/grep/UNIX), internal appliances (Google), Spotlight

  6. Web-scale Crawling • What do you have to deal with? • Protocol negotiation • How do you get data from FTP, HTTP, SMTP, HDFS, RMI, CORBA, SOAP, Bittorrent, ed2k URLs? • Build a flexible protocol layer like Nutch did? • Determination of which URLs are important or not • Whitelists • Blacklists • Regular Expressions

  7. Politeness • How do you take into account that web servers and Internet providers can and will • Block you after a certain # of concurrent attempts • Block you if you ignore their crawling desirements codified in e.g., a robots.txt file • Block you if you don’t specify a User Agent • Identify you based on • Your IP • Your User Agent

  8. Politeness • Queuing is very important • Maintain host-specific crawl patterns and policies • Sub-collection based using regex • Threading and brute-force is your enemy • Respect robots.txt • Declare who you are

  9. Crawl Scheduling • When and where should you crawl • Based on URL freshness within some N day cycle? • Relies on unique identification of URLs and approaches for that • Based on per-site policies? • Some sites are less busy at certain times of the day • Some sites are on higher bandwidth connections than others • Profile this? • Adaptative fetching/scheduling • Deciding the above on the fly while crawling • Regular fetching/scheduling • Profiling the above and storing it away in policy/config

  10. Data Transfer • Download in parallel? • Download sequentially? • What to do with the data once you’ve crawled in, is it cached temporarily or persisted somewhere?

  11. Identification of Crawl Path • Uniform Resource Locators • Inlinks • Outlinks • Parsed data • Source of inlinks, outlinks • Identification of URL protocolschema/path • Deduplication

  12. File-based Crawlers • Crawling remote content,getting politeness down,dealing with protocols,and scheduling is hard! • Let some other componentdo that for you • CAS Pushpull great ex. • Staging areas, deliveryprotocols • Once you have the content, there is still interesting crawling strategy

  13. What’s hard? The file is already here • Identification of which files are important, and which aren’t • Content detection and analysis • MIME type, URL/filename regex, MAGIC detection, XML root chars detection, combinations of them • Apache Tika • Mapping of identified file types to mechanisms for extracting out content and ingesting it

  14. Quick intro to content detection • By URL, or file name • People codified classification into URLs or file names • Think file extensions • By MIME Magic • Think digital signatures • By XML schemas, classifications • Not all XML is created equally • By combinations of the above

  15. Case Study: OODT CAS • Set of componentsfor sciencedata processing • Deals withfile-based crawling

  16. File-based Crawler Types • Auto-detect • Met Extractor • Std Product Crawler

  17. Other Examples of File Crawlers • Spotlight • Indexing your hard drive on Mac and making it readily available for fast free-text search • Involves CAS/Tika like interactions • Scripting with ls and grep • You may find yourself doing this to run processing in batch, rapidly and quickliy • Don’t encode the data transfer into the script! • Mixing concerns

  18. Challenges • Reliability • If crawl fails during web-scale crawl, how do you mitigate? • Scalability • Web-based vs. file based • Commodity versus appliance • Google or build your own • Separation of concerns • Separate processing from ingestion from acquisition

  19. Wrapup • Crawling is a canonical piece of a search engine • Utility is seen in data systems across the board • Determine what your strategy for acquisition vis a vis your processing and ingestion strategy is • Separate and insulate • Identify content flexibly

More Related