440 likes | 588 Views
This lecture outlines essential computational techniques for linguists focusing on string manipulation in Python. It covers the declaration of strings using single and double quotes, the creation of multi-line strings via triple quotes, and the use of escape sequences for special characters. The session explores string indexing, slicing, concatenation, repetition, and methods for type conversion. Additionally, it highlights built-in functions related to string handling, illustrating how to read user input and manipulate strings programmatically.
E N D
LING 408/508: Computational Techniques for Linguists Lecture 14 9/24/2012
Outline • Strings • Print function and string formatting • Long HW #4
Declaring strings • Strings can be enclosed in single quotes or double quotes >>> s1 = 'spam' >>> s2 = "spam" >>> s5 = '' # empty string >>> s6 = "" # empty string • Multi-line strings: triple quotes >>> s3 = '''spam and eggs''' >>> s4 = """spam and eggs"""
Escape sequences: print special characters • Whitespace characters (other than space): \t tab character \n newline • Example: >>> print('help\t\tme\nplease') help me please • Multi-line string contains newline >>> '''spam and eggs''' 'spam\nand eggs'
Escape sequences • Quotes need to be escaped to distinguish from beginning/end of string markers \' single quote \" double quote • Example: >>> 'spa\'m' "spa'm" >>> "spa\"m" 'spa"m'
Escape sequences • However, double quotes may be used without escaping in a single-quoted string, and vice versa >>> s1 = "I ask, \"where is Sarah?\"" >>> s2 = 'I ask, \'where is Sarah?\'' >>> s3 = "I ask, 'how is David?'" >>> s4 = 'I ask, "how is David?"'
Backslash • Since backslash is interpreted as escape, we need to use an escape sequence when we want the backslash character in a string \\ backslash >>> s1 = "hello \napalm" >>> print(s1) hello apalm >>> s2 = "hello \\napalm" >>> print(s2) hello \napalm
Raw strings • A raw string is indicated by a r preceding the string • Raw strings turn off escape mechanism; Python interprets string contents literally • Especially useful for file names • Example: filename1 = 'C:\\mydir\\myfile.txt' filename2 = r'C:\mydir\myfile.txt'
One catch with raw strings • Due to limitations of the tokenizer, raw strings may not have a trailing backslash. • Doesn’t work: >>> dir1 = r'C:\mydir\' SyntaxError: EOL while scanning string literal >>> • Must use double slash or omit final slash: >>> dir2 = 'C:\\mydir\\' >>> dir3 = r'C:\mydir'
Indexing and slicing strings,just like lists >>> s = 'python' >>> s[3] 'h' >>> s[:3] 'pyt' >>> s[3:] 'hon' >>> s[2:4] 'th' >>> s[2:-2] 'th'
Strings are immutable(can’t be modified) >>> s = 'python' >>> s[0] = 'x' Traceback (most recent call last): File "<pyshell#25>", line 1, in <module> mystring[0] = 'x' TypeError: 'str' object does not support item assignment >>> L = [1,2,3,4] # but lists are mutable >>> L[0] = 5 >>> L [5, 2, 3, 4]
Create new strings >>> s1 = 'python' >>> s2 = 'big ' + s1 >>> s2 'big python' >>> s3 = s2[:4] + 'ball' + s2[4:] >>> s3 'big ball python'
Looping over strings >>> s1 = 'python' >>> for c in s1: print(c) p y t h o n
List comprehension over strings >>> L = ['chicken', 'pot', 'pie'] >>> L2 = [s[::-1] for s in L] >>> L2 ['nekcihc', 'top', 'eip']
String operators • Concatenation >>> 'ham' + 'eggs' 'hameggs' • Repetition >>> 'eggs' * 3 'eggseggseggs' • Membership >>> 'cd' in 'abcde' True • Logical operators >>> 'b' < 'a' False
Built-in functions >>> str() # string constructor '' >>> str('hello') 'hello' >>> str([1,2,3,4,5]) '[1, 2, 3, 4, 5]' >>> str((1, 'bye', 3.14)) "(1, 'bye', 3.14)"
Built-in functions >>> len('spam') 4 • Type conversion through type constructors: useful for reading data from files (convert string to numeric types) >>> int('356') 356 >>> float('3.56') 3.5600000000000001 >>> str(356) '356'
Built-in functions >>> a = input() # read a string from the user hello >>> a 'hello' >>> b = input() [3,4,5] >>> b '[3,4,5]' >>> c = eval(input()) # eval converts type 3 >>> c # result is an integer 3 >>> d = eval(input()) [3,4,5] >>> d [3, 4, 5]
Many string methods >>> dir(str) ['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
String methods >>> s = 'how_are_you' >>> s.split('_') ['how', 'are', 'you'] >>> s = 'how are\tyou\n' >>> s.split() # splits on whitespace ['how', 'are', 'you'] >>> ''.join(['how', 'are', 'you']) 'howareyou' >>> '_'.join(['how', 'are', 'you']) 'how_are_you'
String methods >>> s = ' how are you\t' >>> s.strip() 'how are you' >>> s # doesn’t modify the string ' how are you\t' >>> s.lstrip() 'how are you\t' >>> s.rstrip() ' how are you'
String methods >>> s = 'good morning' >>> s.startswith('goo') True >>> s.endswith('ning') True >>> s[:3]=='goo' True >>> s[-4:]=='ning' True
String methods >>> s = 'how are you\n' >>> s.upper() # shows return value 'HOW ARE YOU\n' >>> s # string was not modified 'how are you\n' >>> s = s.upper() # modify it >>> s 'HOW ARE YOU\n' >>> s.lower() 'how are you\n' >>> s.islower() # returns boolean False >>> 'HELLO'.isupper() True
count and replace >>> s = 'ab1cd1ef1gh' >>> s.count('1') 3 >>> s.replace('1', ' ') 'ab cd ef gh' >>> s 'ab1cd1ef1gh'
find and index >>> s = 'howareyou' >>> s.find('are') 3 >>> s.index('are') 3 >>> s.find('no') # not in 'howareyou' -1 >>> s.index('no') Traceback (most recent call last): File "<pyshell#17>", line 1, in <module> s.index('no') ValueError: substring not found
find and index >>> help(s.find) Help on built-in function find: find(...) S.find(sub [,start [,end]]) -> int Return the lowest index in S where substring sub is found, such that sub is contained within s[start:end]. Optional arguments start and end are interpreted as in slice notation. Return -1 on failure. >>> help(s.index) Help on built-in function index: index(...) S.index(sub [,start [,end]]) -> int Like S.find() but raise ValueError when the substring is not found.
rfind >>> s = 'ab.cdef.ghi' >>> s.find('.') 2 >>> s.rfind('.') 7
isalpha, isalnum, isdigit >>> str.isalpha('abcde') # str is name of type True >>> str.isalpha('abcde1') False >>> str.isalpha('abcde+++') False >>> str.isalnum('abcde12345') True >>> str.isdigit('abcde12345') False >>> str.isdigit('12345') True
Useful string constants (in module string, which is rather obsolete) >>> import string # not same as str! >>> dir(string) ['Formatter', 'Template', '_TemplateMetaclass', '__builtins__', '__cached__', '__doc__', '__file__', '__name__', '__package__', '_multimap', '_re', '_string', 'ascii_letters', 'ascii_lowercase', 'ascii_uppercase', 'capwords', 'digits', 'hexdigits', 'octdigits', 'printable', 'punctuation', 'whitespace']>>> string.ascii_letters 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' >>> string.digits '0123456789' >>> string.punctuation '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
Outline • Strings • Print function and string formatting • Long HW #4
String formatting • New style • format method on strings • Show you this first • Old style • Based on C printf function, also in C++, Java, etc. • Much code that you’ll see uses this style • Show you this afterwards
print function >>> print(5) 5 >>> print(3, 5, 7) 3 5 7 >>> print(3, 5, 7, sep='->') # default: sep=' ' 3->5->7 >>> for i in range(3): # default: end='\n' print(i, end=',') 0,1,2,
format method >>> s = '{} is fun' >>> s '{} is fun' >>> s.format('recursion') 'recursion is fun' >>> s2 = s.format('recursion') >>> s2 'recursion is fun' >>> print(s.format('recursion')) 'recursion is fun'
format method >>> s = 'Sarah' >>> d = 'David' >>> print('{0} is neurotic'.format(d)) David is neurotic >>> print('{} is also neurotic'.format(s)) Sarah is also neurotic >>> print('{0} is wife of {1}'.format(s, d)) Sarah is wife of David >>> print('{} is wife of {}'.format(s, d)) Sarah is wife of David >>> print('{1} is husband of {0}'.format(s, d)) David is husband of Sarah
Codes for different types >>> s = 'Sarah' >>> d = 'David' >>> # s: string >>> print('{0:s}\'s car broke down'.format(d)) David's car broke down >>> print('{0}\'s car broke down'.format(s)) Sarah's car broke down >>> print('{0:d} is a crowd'.format(3)) # d: integer 3 is a crowd >>> print('{0:f} is a crowd'.format(3)) # f: float 3.000000 is a crowd
Floating-point precision >>> # default: 6 decimals for a float >>> print('{0:f} is tasty'.format(3.14159)) 3.141590 is tasty >>> # specify 3 decimal places >>> print('{0:.2f} is tasty'.format(3.14159)) 3.14 is tasty >>> # rounds, instead of truncating >>> print('{0:.4f} is tasty'.format(3.14159)) 3.1416 is tasty >>> s = '{0:s}'.format(3.14159) >>> print(s[:s.rfind('.')+4]) # 3.14159
Justification and padding >>> print('---{:6d}---'.format(12)) # default --- 12--- >>> print('---{:06d}---'.format(12)) # pad with zero ---000012--- >>> print('---{:>6d}---'.format(12)) # right justify --- 12--- >>> print('---{:<6d}---'.format(12)) # left justify ---12 --- >>> print('---{:^6d}---'.format(12)) # center --- 12 ---
Different defaults for justification >>> print('---{:6d}---'.format(5)) --- 5--- >>> print('---{:6s}---'.format('hi')) ---hi ---
Print justified tables >>> X = ['wee', 'longer', 'very-long'] >>> Y = [1, 2, 33] >>> >>> for i in range(len(X)): s = '{:>15s}\t\t{:<d}'.format(X[i], Y[i]) print(s) wee 1 longer 2 very-long 33
Old-style string formatting can be used >>> print('%d is an integer' % 3) 3 is an integer >>> print('%d and %d are integers' % (3, 4)) 3 and 4 are integers >>> print('pi is %.3f, I think' % 3.14159) pi is 3.142, I think >>> print('%s had a little lamb.' % 'Mary') Mary had a little lamb. >>> print('It tasted good.') It tasted good.
Justification in old-style string formatting >>> print('---%6d---' % 12) # right justify --- 12--- >>> # DIFF: default right-just for strings >>> print('---%6s---' % 'Mary') # DIFF --- Mary--- >>> # DIFFERENT: syntax for left justify >>> print('---%-6d---' % 12) ---12 --- >>> print('---%06.2f---' % 1.234) # pad withzero ---001.23---
Outline • Strings • Print function and string formatting • Long HW #4
Due Wednesday 10/3 • Link for data will be e-mailed to you
Example of concordancing softwarehttp://www.filebuzz.com/software_screenshot/full/concordance-51987.gif