1 / 20

LING/C SC/PSYC 438/538 Computational Linguistics

LING/C SC/PSYC 438/538 Computational Linguistics. Sandiway Fong Lecture 2: 8/23. Administrivia. Anyone try installing Perl yet? Active State Perl http://www.activestate.com install the free version (not the trial versions). Background Reading. Textbook

qabil
Download Presentation

LING/C SC/PSYC 438/538 Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING/C SC/PSYC 438/538Computational Linguistics Sandiway Fong Lecture 2: 8/23

  2. Administrivia • Anyone try installing Perl yet? • Active State Perl • http://www.activestate.com • install the free version • (not the trial versions)

  3. Background Reading • Textbook • Chapter 2: Regular Expressions and Automata • Section 2.1: Regular Expressions • Perl Regular Expressions (RE) • perlrequick - Perl regular expressions quick start • http://perldoc.perl.org/perlrequick.html • perlretut - Perl regular expressions tutorial • http://perldoc.perl.org/perlretut.html

  4. Regular Expressions • regular expressions are used in string pattern-matching • important tool in automated searching • formally equivalent to finite-state automata (FSA) and regular grammars • popular implementations • Unix grep • command line program • returns lines matching a regular expression • standard part of all Unix-based systems • including MacOS X (command-line interface in Terminal) • many shareware/freeware implementations available for Windows XP • just Google and see... • grep functionality is built into many programming languages • e.g. Perl • wildcard search in Microsoft Word • limited version of regular expressions (not full power) • with differences in notation

  5. Historical note grep: name comes from Unixedcommand g/re/p “search globally for lines matching the regular expression, and print them” [Source: http://en.wikipedia.org/wiki/Grep] ed is an obscure and difficult-to-use text edit program on Unix systems doesn’t need a screen display would work on an ancient teletype Regular Expressions wikipedia

  6. Formally a regular expression (regexp) is formed from: an alphabet (= set of characters) operators a regexp is shorthand for a set of strings (possibly infinite set) (strings are of finite length) Formally a set of strings is called a language a language that can be defined by a regular expression is called a regular language (not all languages are regular) Regular Expressions

  7. alphabet e.g. {a,b,c,...,z} set of lower case English letters Note: case is important operators a single symbol a an exactly n occurrences of a, n a positive integer an a3 aaa a* zero or more occurrences of a a+ one or more occurrences of a concatenation two regexps may be concatenated, the resulting string is also a regexp e.g. abc disjunction infix operator: | (vertical bar) e.g. a|b parentheses may be used for disambiguation e.g. gupp(y|ies) Regular Expressions

  8. Technically, a+ is not necessary aa* = a+ “a concatenated with a* (zero or more occurrences of a)” = “one or more occurrences of a” Disjunction [set of characters] set of characters enclosed in square brackets means match one of the characters e.g. [aeiou] matches any of the vowels a, e,i, o or u but not d dash (-) shorthand for a range e.g. [a-e] matches a, b, c, d or e Regular Expressions

  9. Range defined over a computer character set typically ASCII originally a 7 bit character set 2^7 = 128 (0-127) different characters ASCII = American Standard Code for Information Interchange e.g. [A-z] [0-9A-Za-z] http://www.idevelopment.info/data/Programming/ascii_table/PROGRAMMING_ascii_table.shtml Regular Expressions

  10. Regular Expressions: Microsoft Word • terminology: • wildcard search

  11. Regular Expressions: Microsoft Word Note:zero or more times is missing in Microsoft Word

  12. Perl uses the same notation as grep (textbook also uses grep notation) More shorthand question mark (?) means the previous regexp is optional e.g. colou?r matches color or colour metacharacters or operators like ? have a function to match a question mark, escape it using a backslash (\) e.g. why\? ? in Microsoft Word means match any character More shorthand period (.) stands for any character (except newline) e.g. e.t matches eat as well as eet caret sign (^) as the first character of a range of characters [^set of characters] means don’t match any of the characters mentioned (after the caret) e.g. [^aeiou] any character except for one of the vowels listed Regular Expressions

  13. Text files in Unix consists of sequences of lines separated by a newline character (LF = line feed) Typically, text files are read a line at a time by programs Matching in Perl and grep is line-oriented (can be changed in Perl) Differences in platforms for line breaking: Unix: LF Windows (DOS): CR LF MacOS (X): CR affects binary file transfers Regular Expressions

  14. Line-oriented metacharacters: caret (^) at the beginning of a regexp string matches the “beginning of a line” e.g. ^The matches lines beginning with the sequence The Note: the caret is very overloaded... [^ab] a^b dollar sign ($) at the end of a regexp string matches the “end of the line” e.g. end\.$ matches lines ending in the sequence end. e.g. ^$ matches blank lines only e.g. ^ $ matches lines contains exactly one space Regular Expressions

  15. Word-oriented metacharacters: a word is any sequence of digits [0-9], underscores (_) and letters [a-zA-Z] (historical reasons for this) \b matches a word boundary, e.g. a space or beginning or end of a line or a non-word character e.g. the matches the, they, breathe and other but \bthe will only match the and they the\b will match the and breathe \bthe\b will only match the (\< and \> can also be used to match the beginning and end of a word) e.g. \b99 matches 99 but not 299 also matches $99 Regular Expressions

  16. Note: definition of word does not include characters with accent marks extended ASCII character set 8 bit characters 128-255 does not include non-Roman characters towards multilingual computing Unicode one massive table Regular Expressions

  17. Range abbreviations: \d (digit) = [0-9] \s (whitespace character) = space (SP), tab (HT), carriage return (CR), newline (LF) or form feed (FF) \w (word character) = [0-9a-zA-Z_] uppercase versions denote negation e.g. \W means a non-word character \D means a non-digit Repetition abbreviations: a? a optional a* zero or more a’s a+ one or more a’s a{n,m} between n and m a’s a{n,} at least n a’s a{n} exactly n a’s e.g. \d{7,} matches numbers with at least 7 digits e.g. \d{3}-\d{4} matches 7 digit telephone numbers with a separating dash Regular Expressions

  18. Run from the command line in Windows Start > Run... cmd (brings up command line interpreter) Running a Perl program: perl -help (gives options) perl filename.pl (runs Perl command file filename.pl) perl filename.pl inputfile.txt (runs Perl command file filename.pl, inputfile.txt is supplied to filename.pl) e.g. filename.pl reads and processes input file inputfile.txt Perl

  19. Example Perl program (match.pl) to read in a text file and print lines matching a regexp enclosed by /.../ Example input file (text.txt) Command perl match.pl text.txt open (F,$ARGV[0]) or die "$ARGV[0] not found!\n"; while (<F>) { print $_ if (/The/); } Perl This is a test. The cat sat on the mat. These shoes are made for walking. Otherwise, I thought it was cold. 45

  20. Program: open (F,$ARGV[0]) or die "$ARGV[0] not found!\n"; while (<F>) { print $_ if (/The/); } while (<F>) first evaluates <F> <F> reads in a line from the file referenced by F and places the line in the program variable $_ then it executes the program code between the curly braces then it goes back and reads another line it does this repeatedly while <F> produces a valid line – if we reach the end of the file, the while loop stops print $_ if (/.../); is conditional code that means print the contents of variable $_ if the regexp between the /.../ can be found in $_ Perl • Program explained: • open the file referenced in $AGRV[0] for input • $AGRV[0] is the first command line argument following the program name • F is the file descriptor associated with the opened file • if there is a problem opening the file, e.g. file doesn’t exist, program execution dies and prints the value of the string enclosed in double quotes "$ARGV[0] not found!\n"

More Related