Data mining the web using perl
Download
1 / 41

Data-Mining the Web Using Perl - PowerPoint PPT Presentation


  • 455 Views
  • Updated On :

Data-Mining the Web Using Perl. Burt L. Monroe Director, Quantitative Social Science Initiative Department of Political Science The Pennsylvania State University. Data-Mining the Web. Examples Election Returns in Luxembourg Luxembourg Official Election Results, 2004

Related searches for Data-Mining the Web Using Perl

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data-Mining the Web Using Perl' - Philip


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Data mining the web using perl l.jpg

Data-Mining the Web Using Perl

Burt L. Monroe

Director, Quantitative Social Science Initiative

Department of Political Science

The Pennsylvania State University


Data mining the web l.jpg
Data-Mining the Web

  • Examples

    • Election Returns in Luxembourg

      • Luxembourg Official Election Results, 2004

      • http://qssi.psu.edu/files/luxembourg.pl

    • Parliamentary Speech

      • The Congressional Record


How d you do that l.jpg
How’d You Do That?

  • There are several programming languages with “straightforward” facilities for doing this. Most notably,

    • Perl

    • Python

    • Java

  • I’m going to talk about Perl, because

    • it’s the most established

    • it’s the one I know

  • It appears that Python may be preferable, but that’s for someone else to say.


What s perl l.jpg
What’s Perl?

  • Open source (free / flexible / extensible / a little wild and woolly – like Linux, R) programming language.

  • It is very very good at processing text.

    • note, webpages are just texts.

    • note, datasets (like a flat spreadsheet or Stata file) are just texts.

    • Social scientists might have some use for turning one into the other, no?

  • It has very useful facilities for building

    • Spiders

    • Scrapers

    • (and “agents”, “robots”, “crawlers”, etc.)


What s a spider l.jpg
What’s a Spider?

  • A spider is a program designed to automatically gather webpages.

  • If, for example, you want to automatically download all of the speeches delivered in Congress today – without manually clicking on every one, cutting and pasting, etc. – you might want to build a spider.


What s a scraper l.jpg
What’s a scraper?

  • A scraper (or “screen-scraper”) extracts the information you want – whatever you consider to be data – from a given webpage.

  • If you want to know who said “health” and how many times, you might want to build a scraper.


Beware l.jpg
BEWARE!

  • Spiders (and other similar types of programs – “robots”, “crawlers”) can be put to nefarious use:

    • appropriating copyrighted materials

    • extracting email addresses for spammers

    • overwhelming servers to create “denial of service”

    • generally violating a site’s “terms of service” or “acceptable use policy”

  • If you are not careful to use legal and ethical good practices, you can

    • be denied access to a website altogether

    • get yourself or the university sued or even subjected to criminal penalties


Slide8 l.jpg
Perl

  • Open-source

  • Cross-platform

    • (Windows – I recommend “ActivePerl” from http://www.activestate.com)

  • There are many websites with resources:

    • http://www.cpan.org (Comprehensive Perl Archive Network)

    • http://www.perlmonks.org (PerlMonks)

    • http://www.perl.org

    • http://perl.oreilly.com (O’Reilly Publishing)

  • Lots of mailing lists, etc.


Books l.jpg
Books

  • Basics of Perl

    • The best books are put out by O’Reilly Publishing and are generally known by the animal on the cover.

    • Learning Perl (the Llama)

      • or, Learning Perl on Win32 Systems (the Gecko)

    • Programming Perl (the Camel)

  • Web-mining

    • Perl & LWP (the Blesbok, apparently)

    • Spidering Hacks

  • These books, and some others, are or will be available in the “QuaSSI Library” (in Pond 216).


Running perl l.jpg
Running Perl

  • For machines with approved ActivePerl installations in Pond ...

    • Perl is located in c:/Perl/

  • For today,

    • we will operate entirely in the directory c:/Perl/eg/

    • To get there,

      • open Programs -> Accessories -> Command Prompt

      • At the prompt, type c:

      • Type cd Perl/eg

  • (In your particular installation, or in a Mac, or something like Unix on high performance computing, these details will be different.)


The first perl program l.jpg
The First Perl Program

  • Go to the QuaSSI Website for the example scripts for todays workshop:

    • http://qssi.psu.edu/files/howdy.pl

  • Right-click on the first script, “howdy.pl”, and save it to c:\Perl\eg\

  • Open up the text-editor WinEdt (you could use almost anything) and then open howdy.pl

  • That’s a complete Perl program.

  • Note: that’s all a program is – a text file.


Running a perl program l.jpg
Running a Perl Program

  • Go back to your command prompt.

  • Type perl howdy.pl –w

  • (The –w tells perl to give you warnings about what might be wrong if the program is broken.)


Modifying a program l.jpg
Modifying a program

  • Go back to WinEdt

  • Edit the text between the quotation marks to say something new

  • Click File -> Save

  • Go back to the command prompt

  • Hit the up arrow (to get the last command, perl howdy.pl –w

  • Look at that – you’re a programmer!


Break the program l.jpg
Break the program

  • Go back to WinEdt

  • Delete the semicolon at the end of the line

  • Save the file

  • Go back to the command prompt and run the program, with –w, again

  • What happened?


Perl at 30 000 feet l.jpg
Perl at 30,000 feet

  • Much of the next set of slides is stolen shamelessly from Andy Tester’s “Perl at 10,000 Feet” at www.petdance.com

  • (I’m skipping even more than he did.)


Some generalities about perl l.jpg
Some generalities about Perl

  • Statements in Perl are, or usually can be, constructed in a fairly natural English-like way.

  • There are many ways to do any one thing.

  • The syntax can be offputting and hard to read, especially at first. It is easy to “obfuscate” Perl code and this is sometimes done intentionally.

  • Main syntax rule: end all lines with ;


Data types l.jpg
Data Types

  • Scalars

  • Arrays and Lists

  • Hashes

  • References

  • Filehandles

  • Objects


Scalars l.jpg
Scalars

  • Numbers

    • Generally decimal floating point

    • (Can be made integer, octal, hexadecimal)

  • Strings

    • Can contain any character

    • Can be null: “”

    • Can be arbitrarily large


Strings l.jpg
Strings

  • Single-quoted

    • characters are as shown with only two exceptions.

      • single-quote in a single-quoted string requires \’

      • backslash in a single-quoted string requires \\

  • Double-quoted

    • it will interpolate – calculate variables or control sequences.

      • For example

        • $foo = “myfile”;

        • $datafile = “$foo.txt”;

        • will result in the variable $datafile holding the string “myfile.txt”

      • Another example

        • print ‘Howdy\n’; will print:

          • Howdy\n

        • print “Howdy\n”; will print

          • Howdy

        • (\n is a control sequence, standing for “new line”).


Scalar operators l.jpg
Scalar operators

  • Math

    • *, /, % (for modulo), ** (for exponentiation), etc.

  • Strings

    • x to repeat the thing on the left

      • “b” x 10 gives “bbbbbbbbbb”

    • . concatenates strings

      • (“na” x 16).“ Batman!” gives ...

  • Perl knows to convert when mixing these two types:

    • “3”*4 gives 12

    • “3”.4 gives “34”


Comparing scalars l.jpg
Comparing Scalars

Comparison Numeric String

  • Equal == eq

  • Not equal != ne

  • Less than < lt

  • Greater than > gt

  • Less / equal <= le

  • Greater / equal >= ge

    8 < 25 TRUE!

    “8” lt “25” FALSE!


Variables l.jpg
Variables

  • A sign, followed by a letter, followed by pretty much whatever.

  • Sign determines the type:

    • $foo is a scalar

    • @foo is a list

    • %foo is a hash

  • Variables default to global (they apply in all parts of your program). This can be problematic.

    • local $var will make the variable active only for the current “block” of code.

    • my $var does the same, and is the more usual construction.

    • the very common use strict; at the beginning of code forces good practice in the use of local variables (creates more syntax errors, but prevents more whoppers that could blow everything up.)


Lists and arrays l.jpg
Lists and Arrays

  • A list is an ordered set of (usually) scalars.

  • An array is a variable holding a list.

  • my @foo = (1,2,3)

  • my @bar = (“elephant”, 3.14)

  • Can be constructed as lists of scalar variables:

    • my @data = ($name, $address, $SSN)


Using arrays l.jpg
Using Arrays

  • Elements are indexed, from 0.

    • my @animals = (“frog”, “bear”, “elephant”);

    • print $animals[2]; # prints elephant

    • Note: element is a scalar, so $ rather than @

  • Subsections are “slices”.

    • my @mammals = @animals[1,2];

  • Lots of functions for

    • using as a stack (moving things on and off the right or left side of the array).

    • sorting

    • joining two arrays

    • splitting a scalar string into an array

      • my $sentence = “This is my sentence.”;

      • my @words = split(“ “, $sentence);

      • # now @words contains (“This”, “is”, “my”, “sentence”);


Programming controls l.jpg
Programming Controls

  • Control structures

    • if / then / elsif / else

    • while

    • do {} while

    • do {} until

    • for ()

    • foreach() # loops over a list

  • Errors / warnings

    • die “message” kills program and prints “message”.

    • warn “message” prints message and keeps going.


Hashes l.jpg
Hashes

  • “Associative arrays”

  • A set of

    • values (any scalar), indexed by

    • keys (strings)

  • Example

    • my %info;

    • $info{ “name” } = “Burt Monroe”;

    • $info{ “age” } = 39;

  • With hashes and arrays you can create almost any arbitrary data structure (even arrays of arrays, arrays of hashes, hashes of arrays, etc.)


File handling l.jpg
File Handling

  • open() function opens a file for processing.

  • Prefix the filename to define how

    • “<“ for input from existing file (read)

    • “>” to create for output (write)

    • “>>” to append to a file (that may not yet exist)

  • open (IN, “<myfile.txt”) or die “Can’t open myfile.txt”;

  • Can then use <> to refer to the file. The above would be <IN>.


Matching string patterns using regular expressions l.jpg
Matching string patterns using regular expressions

  • This is where much of the power of Perl lies.

  • m/pattern/ will check the last stored variable ($_) for pattern.

  • $var =~ m/pattern/; will check $var for pattern.

  • If the pattern is in $var, then

    • $var =~ m/pattern/ is TRUE.

  • If you “group” part of the pattern and it is present,

    • $var =~ m/(pattern)/ is true, AND, now a variable names $1 contains the first match it found.

    • Group more pieces of the pattern and the matches are stored in $2, $3, etc.

  • This only grabs the *first* match. To grab all, say

    • my @matches = ($var =~ m/(pattern)/g);

    • This will store every match in the array @matches.


What s a regular expression l.jpg
What’s a “regular expression”?

  • Combination of

    any literal character, number, etc.

    . any single character

    * zero or more of the previous

    + one or more of the previous

    ? zero or one of the previous

    [aeiou] character class – this is the vowels

    ^ beginning of the line

    $ end of the line

    \b word boundary

    \d \D digit / non-digit

    \s \S space / non-space

    \w \W word character / non-word character

    | or – match this or that

    () grouping

  • See handout for more.


Examples l.jpg
Examples

  • Romeo|Juliet “Romeo” or “Juliet”

  • \d\d\d-\d\d\d\d a phone number

  • (\d\d\d-)?\d\d\d-\d\d\d\d phone #, maybe w/ area

  • \b[aeiou]\w+ a word starting w/ a vowel

  • \b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b email add.


Modules l.jpg
Modules

  • Hundreds of modules / packages available through cpan.

  • ActivePerl gives a GUI for installing them in its “Perl Package Manager”.


A basic perl example l.jpg
A basic Perl example

  • Counting words.

    • counter1.pl


Grabbing from the web l.jpg
Grabbing from the web

  • The basic idea is simply to have Perl act as an “agent”, in the way a browser like Explorer or Firefox does -- requesting and interpreting webpages.

  • There are a few basic modules that can do this.


Lwp simple l.jpg
LWP::Simple

  • lwpsimpleget.pl


Lwp useragent l.jpg
LWP::UserAgent

  • More elaborate than LWP::Simple.

  • I’m going to skip that one today, but it’s covered in details in the main books

    • Perl & LWP

    • Spidering Hacks

  • Pretty much all of the functionality has been wrapped more intuitively into ...


Www mechanize l.jpg
WWW::Mechanize

  • mechanizeget.pl


Scraping l.jpg
Scraping

  • At its base, this is just extracting information from the page(s) you download.

  • Simple example:

    • freshair.pl


Your agent can interact l.jpg
Your agent can interact ...

  • For example, what if the webpage involves a form ...

  • Example

    • abstracts.pl

  • You can authenticate with username and password, run through proxy servers, and so on.


Spiders l.jpg
Spiders

  • Type 1 Requester

    • Requests a few items with known urls from a website.

  • Type 2 Requester

    • Requests a few items, then requests (some set of) pages to which those items link.

  • Type 3 Requester

    • Starts at a given url, and then requests everything linked, everything linked by that, etc. at the same host server. The idea here is usually to download an entire website.

  • Type 4 Requester

    • Starts at a given url, requests everything linked anywhere, everything linked by that, etc. until it, perhaps, visits the entire web.

  • YOU – I am talking to YOU – in all likelihood have no business writing Type 3 or Type 4 spiders. These can easily go seriously awry causing mayhem of many sorts. Write only spiders with known finite scope.


Back to the luxembourg miner l.jpg
Back to the Luxembourg Miner

  • Commune-level election results from Luxembourg.

    • luxembourg.pl


More on scraping l.jpg
More on Scraping

  • All of the examples scraped / parsed using regular expressions.

  • More structured data like HTML is often better (or only) addressed with more specialized tools:

    • HTML::TokeParser

    • HTML::TreeBuilder

  • There are modules for scraping from XML, spreadsheets, databases, Word docs, PDFs.


ad