1 / 26

LING/C SC/PSYC 438/538

LING/C SC/PSYC 438/538. Lecture 12 Sandiway Fong. Administrivia. Homework 9 Perl regex Python re import re slightly complicated string handling: use raw https://docs.python.org/3/library/re.html. File I/O Summary. Common: open filehandle (concept comes from the underlying OS)

jkollar
Download Presentation

LING/C SC/PSYC 438/538

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong

  2. Administrivia • Homework 9 • Perl regex • Python re • import re • slightly complicated string handling: use raw • https://docs.python.org/3/library/re.html

  3. File I/O Summary • Common: • open • filehandle (concept comes from the underlying OS) • streams: STDIN STDOUT STDERR (Perl) • streams: sys.stdinsys.stdoutsys.stderr (Python) • close • Perl: https://perldoc.perl.org/perlopentut.html • <filehandle> (context: reads a line or the whole file) • print filehandle String • Python: https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files • .read() (methods) • .readline() • .readlines() • .write(String) (no newline) • print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False) (function)

  4. Regular Expressions to the rescue • https://xkcd.com/208/

  5. Regular Expressions from Hell Email validation: RFC 5322: (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

  6. Homework 9 • File: hw9.txt • 56 lines • Contents: each line has 3 fields • name of state or US territory (in alphabetical order) • population • area (sq. miles) • fields are separated by a tab (\t) • Source: Wikipedia

  7. Homework 9 Question 1 • Using Perl • supply the file hw9.txt on the command line • DO NOT MODIFY hw9.txt • read the file • use regex to extract the information • create hash table(s) indexed by name containing population and land area • Print a table of states/territories inversely ranked by land area • Print a table of states/territories ranked by population (i.e. 1st is highest population) • compute the density (population per sq. mile) • Print a table of states/territories ranked by density (i.e. 1st is highest density)

  8. Homework 9 Question 1 • Hints: • note that some state/territory names consist of more than one word • note that numeric values may have commas • read about @ARGV • read about split • read about tr: $num =~ tr/,//d deletes the pesky commas in $num • revisit sort parameters: https://perldoc.perl.org/functions/sort.html • if you need to trim whitespace from the ends: $line =~ s/^\s+|\s+$//g; • for nicely-formatted lists, read http://perldoc.perl.org/functions/sprintf.html about printf FORMAT

  9. Homework 9: Question 2 • 538 only (optional for 438): • Do the same exercise as Question 1 in Python3 using a dictionary or dictionaries • In your opinion, which code is simpler? • These may prove useful: • str.strip() • str.replace() • str.split() • sys.argv • int()

  10. Homework 9 • Usual submission rule: • ONE PDF file • Submit code/run/comments • Email subject heading: 438/538 Homework 4 Your Name • Due date by midnight of next Monday (review in class on Tuesday)

  11. regex • Read textbook chapter 2: section 1 on Regular Expressions

  12. Perl regex • Read up on the syntax of Perl regular expressions • Online tutorials • http://perldoc.perl.org/perlrequick.html • http://perldoc.perl.org/perlretut.html

  13. Perl regex • Perl regex matching: • $s =~ /foo/ (/…/contains a regex) • can use in a conditional: • e.g. if ($s =~ /foo/) … • evaluates to true/false depending on what’s in $s • can also use as a statement: • e.g. $s =~ /foo/; • global variable $& contains the match • Perl regex match and substitute: • $s =~ s/foo/bar/ • s/…match… /…substitute… / contains two expressions • will modify $s by looking for a single occurrence of match and replacing that with substitute • s/…match… /…substitute… /g global substitution

  14. Perl regex • Most useful with the code template for reading in a file line-by-line: open($fh, $ARGV[0]) or die "$ARGV[0] not found!\n"; while ($line = <$fh>) { do RE stuff with $line } close($fh)

  15. Chapter 2: JM spaces matter! character class: Perl lingo

  16. Chapter 2: JM range: in ASCII table backslash lowercase letter for class Uppercase variant for all but class

  17. Chapter 2: JM

  18. Chapter 2: JM Can use (…) if > 1 char Sheeptalk

  19. Perl regex \S+ing\b \s is a whitespace, so \S is a non-whitespace + is repetition (1 or more) \b is a word boundary, (words are made up of \w characters)

  20. Perl regex • \b or \b{wb} • global variables • other boundary metacharacters: ^ (beginning of line), $ (end of line)

  21. Perl regex: Unicode and \b \b \b{wb} Note: global match in while-loop Note: .*? is the non-greedy version of .*

  22. Perl regex: Unicode and \w list context • \w is [0-9A-Za-z_] Definition is expanded for Unicode: use utf8; use open qw(:std :utf8); my $str = "school écoleÉcolešolatrườngस्कूलškoleโรงเรียน"; @words = ($str =~ /(\w+)/g); foreach $word (@words) { print "$word\n" } Pragma https://perldoc.perl.org/open.html

  23. Chapter 2: JM • Why? • * means zero or more repetitions of the previous char/expr • . means any single character • ? means previous char/expr is optional

  24. Chapter 2: JM • Precedence of operators • Example: Column 1 Column 2 Column 3 … • /Column [0-9]+ */ • /(Column [0-9]+ *)*/ • /house(cat(s|)|)/ (| = disjunction; ? = optional) • Perl: • in a regular expression the pattern matched by within the pair of parentheses is stored in global variables $1 (and $2 and so on). • (?: … ) group but exclude from storage • Precedence Hierarchy: space

  25. Online regex tester https://regex101.com

  26. Perl regex http://perldoc.perl.org/perlretut.html returns 1 (true) or "" (empty if false) A shortcut: list context for matching returns a list

More Related