Web Crawler - PowerPoint PPT Presentation

workshop 3rd semester n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Web Crawler PowerPoint Presentation
Download Presentation
Web Crawler

play fullscreen
1 / 17
Web Crawler
284 Views
Download Presentation
seth
Download Presentation

Web Crawler

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Workshop, 3rd Semester Web Crawler

  2. What is a Web Crawler ? • A crawler could be a part of: • a search engine, where every page is searched for a search string • a spam agent that visits all sites and collects email addresses • a site map creation application

  3. Goal for workshop • Develop a web crawler that takes an URI as input and visits all pages at the identified site. • The web crawler must search HTML documents for links and build a graph containing documents as vertices (nodes) and links as edges.

  4. Subjects • Graphs • HTML • Simple parsing • http and tcp • Design patterns

  5. Graphs • To represent the site you can use a directed graph: • HTML documents being represented by vertices (nodes) • links being represented by edges

  6. HTML • Every page the crawler visits is a HTML document. • The HTML document should be investigated for link tags • <a href = ”http://www.prenhall.com”>Prentice Hall</a> • Frames (frame sets) are to be considered as individual and independent HTML pages

  7. Simple parsing • The simple version searches the HTML documents for links by using ”brute force” • It is obvious to investigate other techniques • state machines • regular expressions • recursive descend

  8. Communication • For communication use the .NET sockets class • HTTP and TCP • URI, URL etc.

  9. Design patterns • Design patterns should be applied wherever appropriate

  10. The program • Overall structure • Graph • HTML • HTTP • Crawler

  11. The Crawler

  12. The Graph Package

  13. The Communication Package

  14. The HTML Package

  15. Testsite • http://public.noea.dk/fen/testsite/

  16. Schedule for the Week • Day 1 – Monday • .Net socket class and graph implementation. (PQC) • Day 2 – Tuesday • Working with practical html parsing, coupling with the http class. Parsing techniques. (CHJO) • Day 3 – Wednesday • Continue working on your own. • Day 4 – Thursday • Putting it all together: graph-, html- and the http part. The crawler algorithm. (PQC) • Day 5 – Friday • Finishing up and preparing the presentation (CHJO). • Presentation and evaluation • Thursday, April 29, 8.30 – 10.00

  17. Evaluation • Nothing in writing, but every team gives a presentation: • Results are presented in Powerpoint presentations: • class model • design issues etc. • A demo • Review and discussion of interesting code fragments • Approximately 20 minutes per team • Participation in the evaluation is mandatory. • Need to hand in report if you do not attend evaluation.