1 / 37

(Web) Crawlers Domain

(Web) Crawlers Domain. Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch. Crawlers. 1. Crawlers: Background 2. Unified Domain Model 3. Individual Applications 3.1 WebSphinx 3.2 WebLech 3.3. Grub 3.4 Aperture 4. Summary and Conclusions. Crawlers – Background.

tausiq
Download Presentation

(Web) Crawlers Domain

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch Crawlers - Presentation 2 - April 2008

  2. Crawlers 1. Crawlers: Background 2. Unified Domain Model 3. Individual Applications 3.1 WebSphinx 3.2 WebLech 3.3. Grub 3.4 Aperture 4. Summary and Conclusions Crawlers - Presentation 2 - April 2008

  3. Crawlers – Background • What is a crawler? • Collect information about internet pages • Near-infinite amount of web pages, no directory • Use links contained within pages to find out about new pages to visit • How do crawlers work? • Pick a starting page URL (seed) • Load starting page from internet • Find all links in page and enqueue them • Get any desired information from page • Loop Crawlers - Presentation 2 - April 2008

  4. Crawlers – Background • Rules which apply on the Domain: • All crawlers have a URL Fetcher • All crawlers have a Parser (Extractor) • Crawlers are a Multi Threaded processes • All crawlers have a Crawler Manager • All crawlers have a Queue structure • Strongly related to the search engine domain Crawlers - Presentation 2 - April 2008

  5. Unified Domain Class Diagram * Common features *Added by code modeling SpiderConfig Scheduler Spider ExternalDB Merger Queue Thread DB Robots StorageManager Extractor Filter Fetcher PageData CrawlerHelper Crawlers - Presentation 2 - April 2008 Crawlers - Presentation 2 - April 2008 5

  6. Unified Domain Sequence Diagram Fetching and extracting phase: Post-processing phase: Finish crawling phase: Pre-fetching phase: Pre-crawling phase: Main loop  Optional objects! Optional object! End of main loop  Crawlers - Presentation 2 - April 2008 Crawlers - Presentation 2 - April 2008 6

  7. Unified Domain - Applications • For the User Modeling group, the applications were the first chance to see things in practice • For the entire group, the applications provided a fresh view about the domain, which led to many changes (Assignment 2) • With everyone viewing the applications in the domain context, most differences were explained as being application-specific • Interesting experiment: Let new Code Modeling group use applications as basis for domain? Crawlers - Presentation 2 - April 2008

  8. WebSphinx • WebSphinx: Website-Specific Processors for HTML INformation eXtraction (2002) • The WebSphinx class library provides support for writing web crawlers in Java • Designation: Small-scope crawls for mirroring, offline viewing, hyperlink trees • Extensible to saving information about page elements Crawlers - Presentation 2 - April 2008

  9. WebSphinx Hyperlink Tree Crawlers - Presentation 2 - April 2008

  10. Settings Robots Filters Spider, Queue (Configuration) Mirror Extractor Link: A link is a type of element, usually <A HREF=“”></A>, which points to a specific page or file. Storing information about each link relative to our seeds can help us analyze results Fetcher, PageData, StorageManager Thread Link Scheduler Element WebSphinx Mirror: A collection of files (Pages) intended to provide a perfect copy of another website Element: Web pages are composed of many elements (<element></element>). Elements can be nested (For example, <body> will have many child elements) Crawlers - Presentation 2 - April 2008

  11. WebSphinx Crawlers - Presentation 2 - April 2008

  12. Web Lech • Web Lech allows you to "spider" a website and to recursively download all the pages on it. Crawlers - Presentation 2 - April 2008

  13. Web Lech • Web Lech is a fully featured web site download/mirror tool in Java, which supports : • download websites • emulate standard web-browser behavior Web Lech is multithreaded and will feature a GUI console. Crawlers - Presentation 2 - April 2008

  14. Web Lech • Open Source MIT License means it's totally free and you can do what you want with it • Pure Java code means you can run it on any Java-enabled computer • Multi-threaded operation for downloading lots of files at once • Supports basic HTTP authentication for accessing password-protected sites • HTTP referrer support maintains link information between pages (needed to Spider some websites) Crawlers - Presentation 2 - April 2008

  15. Web Lech • Lots of configuration options: • Depth-first or breadth-first traversal of the site • Candidate URL filtering, so you can stick to one web server, one directory, or just Spider the whole web • Configurable caching of downloaded files allows restart without needing to download everything again • URL prioritization, so you can get interesting files first and leave boring files till last (or ignore them completely) • Check pointing so you can snapshot spider state in the middle of a run and restart without lots of processing. Crawlers - Presentation 2 - April 2008

  16. Class Diagram Crawlers - Presentation 2 - April 2008

  17. Crawlers - Presentation 2 - April 2008

  18. Crawlers - Presentation 2 - April 2008

  19. Sequence Diagram Crawlers - Presentation 2 - April 2008

  20. Crawlers - Presentation 2 - April 2008

  21. Common Features Crawlers - Presentation 2 - April 2008

  22. Common Features Crawlers - Presentation 2 - April 2008

  23. Unique Features Crawlers - Presentation 2 - April 2008

  24. Grub Crawler • A Little bit about NASA’s SETI • What are distributed Crawlers? • Why distributed Crawlers? • Pros & Cons of distributed Crawlers Crawlers - Presentation 2 - April 2008

  25. Class Diagram Crawlers - Presentation 2 - April 2008

  26. Class Diagram (2) Spider & Thread Config & Robot Crawlers - Presentation 2 - April 2008

  27. Class Diagram (3) Extractor Queue & Storage Manager Fetcher Crawlers - Presentation 2 - April 2008

  28. Sequence Diagram Crawlers - Presentation 2 - April 2008

  29. Sequence Diagram Crawlers - Presentation 2 - April 2008

  30. Use Case Crawlers - Presentation 2 - April 2008

  31. Aperture • Developing Year: 2005 • Designation: crawling and indexing • Crawl different information systems • Many common file formats • Flexible architecture • Main process phases: • Fetch information from a chosen source • Identify source type (MIME protocol) • Full-text and metadata extraction • Store and index information Crawlers - Presentation 2 - April 2008

  32. Aperture Web Demo • Go to: http://www.dfki.unikl.de/ApertureWebProject/ Crawlers - Presentation 2 - April 2008 Crawlers - Presentation 2 - April 2008 32

  33. Aperture Class Diagram Spider, SpiderConfig, Queue • Interface name: • CrawlReport • Aperture’s unique! • Roll: Help crawler to keep necessary information about crawling changing status, fails and successes Extractor Types CrawlReport Extractor Crawler Types Thread, Scheduler, Robots • Class name: • Mime • Aperture’s unique! • Roll: Identify source type in order to choose the correct extractor. Aperture offers many extractors which are able to extract data and metadata from files,email,sites,calendars etc. StorageManager Mime • Classes name: • DataObject • RDFContainer • Aperture’s unique! • Roll: Represnet a source object after fetching it. Object includes source data and metadata in a RDF format. Aperture offers a crawler for each data source. Our domain focus on web !crawling Fetcher, CrawlerHelper DataObject DB RDFContainer CrawlerHelper Crawlers - Presentation 2 - April 2008 Crawlers - Presentation 2 - April 2008 33

  34. Aperture Sequence Diagram Crawlers - Presentation 2 - April 2008 Crawlers - Presentation 2 - April 2008 34

  35. Summary - ADOM • ADOM was helpful in establishing domain requirements • With better understanding of ADOM, abstraction became easier – level of abstraction was improved (increased) with each assignment • Using XOR and OR limitations on relations helpful in creating domain class diagram • Difficult not to get carried away with “It’s optional, no harm in adding it” decisions Crawlers - Presentation 2 - April 2008

  36. Summary – Domain Modeling • Difficulty in modeling functional entities – functions are often contained within another class • Difficult to model when many optional entities exist, some of which heavily impact class relations and sequences • Vast difference in application scale • Next time, we’ll pick a different domain… Crawlers - Presentation 2 - April 2008

  37. Crawlers • Thank you • Any questions? Crawlers - Presentation 2 - April 2008

More Related