1 / 20

Creating a Web Crawler in 3 Steps

Creating a Web Crawler in 3 Steps. Issac Goldstand isaac@cpan.org Mirimar Networks http://www.mirimar.net/. The 3 steps. Creating the User Agent Creating the content parser Tying it together. Step 1 – Creating the User Agent. Lib-WWW Perl (LWP)

evelien
Download Presentation

Creating a Web Crawler in 3 Steps

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Creating a Web Crawler in 3 Steps Issac Goldstand isaac@cpan.org Mirimar Networks http://www.mirimar.net/

  2. The 3 steps • Creating the User Agent • Creating the content parser • Tying it together

  3. Step 1 – Creating the User Agent • Lib-WWW Perl (LWP) • OO interface to creating user agents for interacting with remote websites and web applications • We will look at LWP::RobotUA

  4. Creating the LWP Object • User agent • Cookie jar • Timeout

  5. Robot UA extras • Robot rules • Delay • use_sleep

  6. Implementation of Step 1 use LWP::RobotUA; # First, create the user agent - MyBot/1.0 my $ua=LWP::RobotUA->new('MyBot/1.0', \ 'isaac@cpan.org'); $ua->delay(15/60); # 15 seconds delay $ua->use_sleep(1); # Sleep if delayed

  7. Step 2 – Creating the content parser • HTML::Parser • Event-driven parser mechanism • OO and function oriented interfaces • Hooks to functions at certain points

  8. Subclassing HTML::Parser • Biggest issue is non-persistence • CGI authors may be used to this, but still makes for many caveats • You must implement your own state preservation mechanism

  9. Implementation of Step 2 package My::LinkParser; # Parser class use base qw(HTML::Parser); use constant START=>0; # Define simple constants use constant GOT_NAME=>1; sub state { # Simple access methods return $_[0]->{STATE}; } sub author { return $_[0]->{AUTHOR}; }

  10. Implementation of Step 2 (cont) sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0; } sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq "meta" && lc($attr->{name}) eq "author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; } }

  11. Shortcut HTML::SimpleLinkExtor • Simple package to extract links from HTML • Handles many links – we only want HREF type links

  12. Step 3 – Tying it together • Simple application • Instanciate objects • Enter request loop • Spit data to somewhere • Add parsed links to queue

  13. Implementation of Step 3 for (my $i=0;$i<10;$i++) { # Parse loop my $response=$ua->get(pop @urls); # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links unshift @urls,$linkex->a; # and add links to queue } }

  14. End result #!/usr/bin/perl use strict; use LWP::RobotUA; use HTML::Parser; use HTML::SimpleLinkExtor; my @urls; # List of URLs to visit my %authors; my $ua=LWP::RobotUA->new('AuthorBot/1.0','isaac@cpan.org'); # First, create & setup the user agent $ua->delay(15/60); # 15 seconds delay $ua->use_sleep(1); # Sleep if delayed my $p=My::LinkParser->new; # Create parsers my $linkex=HTML::SimpleLinkExtor->new; $urls[0]="http://www.beamartyr.net/"; # Initialize list of URLs

  15. End result for (my $i=0;$i<10;$i++) { # Parse loop my $response=$ua->get(pop @urls); # Get HTTP response if ($response->is_success) { # If reponse is OK $p->reset; $p->parse($response->content); # Parse for author $p->eof; if ($p->state==1) { # If state is FOUND_AUTHOR $authors{$p->author}++; # then add author count } else { $authors{'Not Specified'}++; # otherwise add default count } $linkex->parse($response->content); # parse for links unshift @urls,$linkex->a; # and add links to queue } } print "Results:\n"; # Print results map {print "$_\t$authors{$_}\n"} keys %authors;

  16. End result package My::LinkParser; # Parser class use base qw(HTML::Parser); use constant START=>0; # Define simple constants use constant GOT_NAME=>1; sub state { # Simple access methods return $_[0]->{STATE}; } sub author { return $_[0]->{AUTHOR}; } sub reset { # Clear parser state my $self=shift; undef $self->{AUTHOR}; $self->{STATE}=START; return 0; }

  17. End result sub start { # Parser hook my($self, $tagname, $attr, $attrseq, $origtext) = @_; if ($tagname eq "meta" && lc($attr->{name}) eq "author") { $self->{STATE}=GOT_NAME; $self->{AUTHOR}=$attr->{content}; } }

  18. What’s missing? • Full URLs for relative links • Non-HTTP links • Queues & caches • Persistent storage • Link (and data) validation

  19. In review • Create robot user agent to crawl websites nicely • Create parsers to extract data from sites, and links to the next sites • Create a simple program to parse a queue of URLs

  20. Thank you! For more information: Issac Goldstand isaac@cpan.org http://www.beamartyr.net/ http://www.mirimar.net/

More Related