1 / 33

LING/C SC/PSYC 438/538

LING/C SC/PSYC 438/538. Lecture 7 Sandiway Fong. Today's Topics. A note on Unicode Homework 2 review Homework 3: due next Monday 11:59pm usual rules: one PDF file etc . Perl regex contd. Unicode and w. Recall: w is [0-9A-Za-z_] Experiment: use utf8; use open qw (: std :utf8);

Download Presentation

LING/C SC/PSYC 438/538

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING/C SC/PSYC 438/538 Lecture 7 Sandiway Fong

  2. Today's Topics • A note on Unicode • Homework 2 review • Homework 3: due next Monday 11:59pm • usual rules: one PDF file etc. • Perl regex contd.

  3. Unicode and \w • Recall: \w is [0-9A-Za-z_] • Experiment: use utf8; use open qw(:std :utf8); my $a = "school écoleÉcolešolatrườngस्कूलškoleโรงเรียน"; @words = ($a =~ /(\w+)/g); foreach $word (@words) { print "$word\n" }

  4. Homework 2 Review • Sample data file: • First try.. just try to detect a repeated word

  5. Homework 2 Review • Sample data file: • Sample output:

  6. Homework 2 Review • Key: think algorithmically… • think of a specific example first w1 w2 w3 w4 w5 Compare w1 with w2 Compare w2 with w3 Compare w3 with w4 Compare w4 with w5

  7. Homework 2 Review • Generalize specific example, then code it up Array indices start from 0… array @words words0 ,words1 … wordsn-1 Compare w1 with w1+1 Compare w2 with w2+1 “for” loop implementation Compare wn-2 with wn-2+1 Array indices end just before $#words… Compare wn-1 with wn

  8. Homework 2 Review

  9. Homework 2 Review

  10. Homework 2 Review

  11. Homework 2 Review

  12. Homework 2 Review

  13. Homework 2 Review

  14. Homework 2 Review a decent first pass …

  15. Homework 2 Review • Sample data file: • Output:

  16. Homework 2 Review • Second try.. merging multiple occurrences

  17. Homework 2 Review • Second try.. merging multiple occurrences • Sample data file: • Output:

  18. Homework 2 Review • Third try.. implementing a simple table of exceptions

  19. Homework 2 Review • Third try.. table of exceptions • Sample data file: • Output:

  20. Homework 3 Corpus file WSJ9_00x.txt (Tipster Vol 1.): <DOC> <DOCNO> WSJ891102-0170 </DOCNO> <DD> = 891102 </DD> <AN> 891102-0170. </AN> <HL> International: @ Australian Firm's Purchase </HL> <DD> 11/02/89 </DD> <SO> WALL STREET JOURNAL (J) </SO> <CO> A.FHF MBIO </CO> <IN> TENDER OFFERS, MERGERS, ACQUISITIONS (TNM) </IN> <DATELINE> SYDNEY </DATELINE> <TEXT> F.H. Faulding &amp; Co., an Australian pharmaceuticals company, said its Moleculon Inc. affiliate acquired Kalipharma Inc. for $23 million. Kalipharma is a New Jersey-based pharmaceuticals concern that sells products under the Purepac label. Faulding said it owns 33% of Moleculon's voting stock and has an agreement to acquire an additional 19%. That stake, together with its convertible preferred stock holdings, gives Faulding the right to increase its interest to 70% of Moleculon's voting stock. </TEXT> </DOC> Skip

  21. Homework 3 • Write a Perl program with regular expressions to extract all the dollar amounts from the file within the <TEXT> </TEXT> markups. • Examples (actual): • $23 million • $38.3 billion • C$1,000 • $95,142 • $3.01 • $38.375 • $45 • 16.125 Canadian dollars

  22. Homework 3 • Submit: • your program • document your regex • how many dollar amounts you extracted • The largest dollar amount • The smallest dollar amount • The median dollar amount (assume, for simplicity, that C$ and US$ are worth the same) • Appendix: list all the dollar amounts

  23. Shortest vs. Greedy Matching from http://www.perl.com/doc/manual/html/pod/perlre.html • Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*)bar/ ) { print "matched <$1>\n"; } • Output: • matched <d is under the bar in the > • Notes: • default variable $_ is also the default variable for matching • variable $1 refers to the parenthesized part of the match (.*) Default variable implicit $_ =~

  24. Shortest vs. Greedy Matching • default behavior • in Perl RE match: • take thelongest possible matching string • aka greedy matching • This behavior can be changed, see next slide

  25. Shortest vs. Greedy Matching from http://www.perl.com/doc/manual/html/pod/perlre.html • Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*?)bar/ ) { print ”matched <$1>\n"; } • Output: • matched <d is under the > • Notes: • ? immediately following a repetition operator like * (or +) makes the operator work in non-greedy mode

  26. Shortest vs. Greedy Matching from http://www.perl.com/doc/manual/html/pod/perlre.html • Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*?)bar/ ) { print ”matched <$1>\n"; } • Output: • greedy: matched <d is under the bar in the > • shortest: matched <d is under the > (.*?) (.*)

  27. Shortest vs. Greedy Matching • RE search is supposed to be fast • but searching is not necessarily proportional to the length of the input being searched • in fact, Perl RE matching can can take exponential time (in length) • non-deterministic • may need to backtrack (revisit) if it matches incorrectly part of the way through linear time time length length exponential

  28. Global Matching: scalar context g flag in the condition of a while-loop

  29. Global Matching: list context g flag in list context

  30. Split • @array = split /re/, string • splits string into a list of substrings split by re. Each substring is stored as an element of @array. • Examples (from perlrequick tutorial):

  31. Split

  32. Matched Positions

  33. Matched Positions

More Related