LING 408/508: Computational Techniques for Linguists

LING 408/508: Computational Techniques for Linguists Lecture 15 9/26/2012

Outline • File input/output • String pattern matching • Regular expressions • Short assignment #9

Filenames • File names in Unix, etc., use forward slashes filename = '/mydata/myfile.txt' • File names in Windows use backward slashes filename = 'C:\\mydata\\myfile.txt' filename = r'C:\mydata\myfile.txt' • But in Python under Windows, you can also use forward slashes • If you are going to run a program on both types of OS, it’s advisable to always use forward slashes

File input filename = 'C:/mydata/myfile.txt' # reference to file object f = open(filename, 'r') # read # read() reads entire file as a single line file_as_string = f.read() f.close() # can be omitted for reading files

File input: shorter filename = 'C:/mydata/myfile.txt' # doesn't use explicit file object reference file_as_string = open(filename, 'r').read()

Newlines in text files • Suppose file contents are: We like to go to Burger King. • Calling read() returns entire line as one string, with newlines >>> s = open(filename, 'r').read() >>> s 'We like to go\nto Burger King.\n'

Read one line at a time filename = 'C:/mydata/textfile.txt' f = open(filename, 'r') # readline() reads a single line as a string, # automatically advances to next line line1 = f.readline() line2 = f.readline()

Suppose file contents are: We like to go to Burger King. line1 = f.readline() line2 = f.readline() • Result: line1 is 'We like to go\n' line2 is 'to Burger King.\n'

Read all lines at once filename = 'C:/mydata/textfile.txt' lines = open(filename, 'r').readlines() • Returns a list of strings • one for each line • Each string is terminated with newline • Alternatively: f = open(filename, 'r') # read lines = f.readlines()

Suppose file contents are We like to go to Burger King. lines = f.readlines() • Result: lines is ['We like to go\n', 'to Burger King.\n']

Iterate through one line at a time for line in open(filename, 'r'): # code to process line

File output filename = '/home/echan3/myfile.txt' # Unix f = open(filename, 'w') # write # write 2 strings # should include newline characters so that # each string is written on a separate line f.write('hello Python\n') f.write('goodbye Python\n') f.close() # should close an output file

List all files in a directory:use the os module >>> import os >>> import os >>> os.listdir('C:\\Python32\\') ['DLLs', 'Doc', 'include', 'Lib', 'libs', 'LICENSE.txt', 'NEWS.txt', 'python.exe', 'pythonw.exe', 'README.txt', 'tcl', 'Tools', 'w9xpopen.exe']

List of files matching a pattern: glob module >>> import glob >>> glob.glob('C:\\Python32\\') ['C:\\Python32\\'] >>> glob.glob('C:\\Python32\\*') ['C:\\Python32\\DLLs', 'C:\\Python32\\Doc', 'C:\\Python32\\include', 'C:\\Python32\\Lib', 'C:\\Python32\\libs', 'C:\\Python32\\LICENSE.txt', 'C:\\Python32\\NEWS.txt', 'C:\\Python32\\python.exe', 'C:\\Python32\\pythonw.exe', 'C:\\Python32\\README.txt', 'C:\\Python32\\tcl', 'C:\\Python32\\Tools', 'C:\\Python32\\w9xpopen.exe'] >>> glob.glob('C:\\Python32\\*.exe') ['C:\\Python32\\python.exe', 'C:\\Python32\\pythonw.exe', 'C:\\Python32\\w9xpopen.exe']

Recursively read files in a directory import fnmatch import os #rootPath = 'C:\\Users\\Arizona\\Desktop\\classes\\508 fa 12\\lectures\\' rootPath = r'C:\Users\Arizona\Desktop\classes\508 fa 12\lectures' pattern = '*.pptx' # Can include any UNIX shell-style wildcards # list of strings with full path name filelist = [] for root, dirs, files in os.walk(rootPath): for filename in fnmatch.filter(files, pattern): filelist.append(os.path.join(root, filename)) for fi in filelist: print(fi)

Output >>> C:\Users\Arizona\Desktop\classes\508 fa 12\lectures\508-fa12-lecture-14.pptx C:\Users\Arizona\Desktop\classes\508 fa 12\lectures\508-fa12-lecture-15.pptx C:\Users\Arizona\Desktop\classes\508 fa 12\lectures\508-fa12-lecture-16.pptx

Example program • Produce a sorted list of all the word types in a corpus. Write to an output file. • Since we don’t need to represent duplicates of words, we’ll use a set to store the words. • To print them in sorted order, we need to convert the set back to a list, and then sort the list. • Use file alice.txt, on the course web page

Using set.add corpus_f = 'C:/Users/Arizona/Desktop/alice.txt' output_f = 'C:/Users/Arizona/Desktop/alice_vocab.txt' vocab = {} for line in open(corpus_f, 'r'): toks = line.split() for t in toks: vocab.add(t) vocab_list = sorted(list(vocab)) of = open(output_f, 'w') for w in vocab_list: of.write(w + '\n') of.close()

Using set.update corpus_f = 'C:/Users/Arizona/Desktop/alice.txt' output_f = 'C:/Users/Arizona/Desktop/alice_vocab.txt' vocab = {} for line in open(corpus_f, 'r'): vocab.update(line.split()) vocab_list = sorted(list(vocab)) of = open(output_f, 'w') for w in vocab_list: of.write(w + '\n') of.close()

>>> vocab_list[:10] ['"\'TIS', '"--SAID', '"Come', '"Coming', '"Edwin', '"French,', '"HOW', '"He\'s', '"How', '"I'] >>> vocab_list[1000:1010] ['Twinkle,', 'Two', "Two!'", 'Two,', 'Two.', 'Ugh,', 'Uglification,', 'V.', 'VERY', 'VI.'] >>> vocab_list[5000:5010] ["verdict,'", 'verse', "verse,'", "verse.'", 'verses', "verses.'", 'very', 'vinegar', 'violence', 'violent'] >>> vocab_list[-10:] ['yours', 'yours."\'', 'yourself', "yourself!'", 'yourself,', "yourself,'", "yourself.'", 'youth,', "youth,'", 'zigzag,']

Want to find strings with particular properties in a corpus • Looking at single strings • Morphological forms • Capitalization variants of words • Words with particular consonant/vowel patterns • Looking across multiple strings • Sentences and sentence boundaries • Names of people, organizations, locations • Numeric strings • Dates • Monetary amounts • Telephone numbers

Postal codes in U.S. or Canada 85702 85702-1234 02138+2631 K1A 0B1 K0H9Z0 G0N 3M0 • What is the pattern? • Either: 5 digits, optionally followed by either “-” or “+”, and 4 digits • Or: cap-letter, digit, cap-letter, optional space, digit, cap-letter, digit

Adverb-adjective phrases good bad very good very bad very, very good very, very bad very, very, very good very, very, very bad very, very, very, very good very, very, very, very bad • What is the general pattern? • Optionally: any number of “very, ” followed by one “very ” • And then followed by: either “good” or “bad”

Morphological variants (simplified) beat, beats, beating drink, drinks, drinking fill, fills, filling • What is the pattern? • String of letters followed by one of the following: • nothing • s • ing

We won’t learn how to account for these: • like, liking • win, winning • drink, drank • Need finite-state transducers • Take LING 438/538

Regular expressions • Automatically search for patterns within a text • Concise and flexible syntax for specifying patterns of interest • Required characters • Optional characters • Repeated sequences • Generative capacity of a regular language / finite-state automaton

Pattern-specific matching through code • Want to search for: • 2 words that begin with a capital letter, followed by 1 or more lower-case letters, and joined together by a hyphen • Winston-Salem, Wilkes-Barre, Sedro-Wooley def test(s): hyphen_idx = s.index('-') if hyphen_idx==-1: return False s1 = s[:hyphen_idx] s2 = s[hyphen_idx+1:] b1 = s1[0].isupper() and len(s1)>=2 b2 = all([x.islower() for x in s1[1:]]) b3 = s2[0].isupper() and len(s2)>=2 b4 = all([x.islower() for x in s2[1:]]) return b1 and b2 and b3 and b4

Same example with regular expression • Want to search for: • 2 words that begin with a capital letter, followed by 1 or more lower-case letters, and joined together by a hyphen • Winston-Salem, Wilkes-Barre, Sedro-Woolley • With regular expression: # corpusstr is the corpus as one long string import re re.search('^[A-Z][a-z]+-[A-Z][a-z]+$', corpusstr)

re module and re.search • re.search(p, s) function to check whether the pattern p can be found somewhere inside the string s. >>> import re >>> print(re.search('a', 'q')) None >>> print(re.search('a', 'a')) <_sre.SRE_Match object at 0x056F1AD8> • From list of words, create list of words ending in -ed # Let corpus be a list of strings matches = [w for w in corpus if re.search('ed', w)] • Result is not what we want: ['proposed', 'combined', 'reduce', 'experienced', 'urged', 'remedy', 'recommended', 'urged']

Boundary characters • Use $ to specify match at end of string [w for w in wordlist if re.search('ed$', w)] ['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', ...] • Use ^ to specify match at beginning of string [w for w in wordlist if re.search('^ed', w)] ['ed.', 'edema', 'edematous', 'edentulous', 'edge', 'edged', 'edges', 'edgewise', 'edging', 'edgy', 'edible', 'edifice', 'edified', 'edifying', 'edit', 'edited', 'editing']

Ranges • Specify a range of characters with – and square brackets • [s for s in corpus if re.search('^[a-z][A-Z]$', s)] ['pH', 'mV', 'pH', 'pH', 'pH', 'pH', 'pH', 'pH'] • [s for s in corpus if re.search('^[0-9][A-Z]$', s)] ['7A', '9N', '1M', '0C', '0C', '0C'] • [s for s in corpus if re.search('^[ghi][mno][jlk][def]$', s)] ['hold', 'hold', 'hold', 'hold', 'hold', 'hole', 'hole', 'golf', 'golf', 'hole', 'golf', 'gold']

Optionality(match zero or one) • ? specifies that the previous character is optional. • Match “email” and “e-mail”: '^e-?mail$' • Apply to ranges: '^[ghi]?[mno]?[jlk]?[def]?$' ['e', 'f', 'g', 'gm', 'go', 'god', 'gold', 'golf', 'h', 'he', 'hold', 'hole', 'i', 'if', 'ij', 'in', 'ink', 'j', 'je', 'k', 'l', 'le', 'm', 'me', 'ml', 'n', 'ne', 'o', 'of', 'ol', 'old', 'ole']

Closure (match zero/one or more) • Match one or more: + '^[ghi]+[mno]+[jlk]+[def]+$' ['gold', 'golf', 'hold', 'hole', 'holed', 'hooked'] • Match zero or more: * '^m*i*n*e*$' ['e', 'i', 'in', 'inn', 'm', 'me', 'mee', 'min', 'mine', 'mm', 'n', 'ne'] • Equivalent: '^aa*' '^a+'

Match specific quantity, or a range • Match phone numbers: XXX-XXXX Equivalent: '^[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$' '^[0-9]{3}-[0-9]{4}$' • Match numbers with 10 and 13 digits, inclusive: '^[0-9]{10,13}$' • Match numbers having >= 13 digits: '^[0-9]{13,}$'

Complement operator • When ^ is the first character inside square brackets, it matches everything except what’s in the brackets • Different from usage to specify beginning of string • Match any character other than a vowel: [^aeiouAEIOU] • Match strings containing no vowels, such as 435532, grrr, cyb3r, mmm, and zzzzzzzz '^[^aeiouAEIOU]+$'

Wildcard operator • The period . matches any character • Match 8-letter words with a j as the 3rd character and t as the 6th '^..j..t..$' ['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic'] • Match all words ending in -ing: '^.+ing$'

Special characters

Special characters: examples • Phone numbers '^[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$' '^[0-9]{3}-[0-9]{4}$' r'^\d{3}-\d{4}$' • Strings consisting of alphanumeric or underscore characters only '^[a-zA-Z0-9_]+$' r'^\w+$'

Backslash and regular expressions • Match backspace character: pattern = '\b' # wrong pattern = '\\b' # correct pattern = r'\b' # correct • Python interprets a regular expression just like any other string • Python interprets '\b' as a backspace character, rather than two separate characters \ and b. • Solutions • Escape blackslashes: '\\b' • Specify as raw string, so Python does not interpret backslashed characters You could just always specify regular expressions as raw strings

Match special characters • For example, match the string '+' pattern = '^\+$' • Also applies to characters such as: * . [

Disjunction and grouping • Disjunction operator | allows for a choice between two or more matching substrings • Use parentheses to group • Words ending in -ed or -ing: '^.+ed|ing$' • Words having a p, then ending in -ed or -ing: '^.+p(ed|ing)$' • Words having a p, then ending in –ed, -ing, or -s: '^.+p(ed|ing|s)$'

re.findall • Use re.findallto return the matching strings >>> word = 'supercalifragilisticexpialidocious' >>> re.findall(r'[aeiou]', word) ['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u'] • Compare to re.search: >>> re.search('[aeiou]', word) <_sre.SRE_Match object at 0x059E1DB0>

Stemming • Retrieve the stem of a word ending in a particular suffix >>> re.findall(r'^.*(ing|ed|es|s)$', 'processing') ['ing'] • Whoops, parentheses has selected material to extract Use ?: to specify only disjunction and not selection >>> re.findall(r'^.*(?:ing|ed|es|s)$', 'processing') ['processing'] • Whoops, want to break apart into stem and suffix • Use parentheses for both stem and suffix extraction >>> re.findall(r'^(.*)(ing|ed|es|s)$', 'processing') [('process', 'ing')]

Greedy matching >>> re.findall(r'^(.*)(ing|ed|es|s)$', 'processes') [('processe', 's')] • Whoops, we want the longest suffix -es • Default is greedy match (match as far as possible) • Specify a non-greedy match with *? >>> re.findall(r'^(.*?)(ing|ed|es|s)$', 'processes') [('process', 'es')] • Problem is not simple, this solution doesn’t work for all words >>> re.findall(r'^(.*?)(ing|ed|es|s)$', 'tables') [('tabl', 'es')]

Summary

Due 9/28 • We will go over answers in class.

1. Write a regular expression that matches hyphenated sequences of 3 words • Examples: • father-in-law • savings-and-loan • Assume alphabetic characters only

LING 408/508: Computational Techniques for Linguists