1 / 73

Statistical Natural Language Processing

Lecture 2 26/04/2011. Statistical Natural Language Processing. Outline. Overview of Python Lists and sets Functions and loops Strings and file I/O Dictionaries and tuples Modules and classes. Recommended reading. Slides from LING 508, Fall 2010

hayden
Download Presentation

Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 2 26/04/2011 Statistical Natural Language Processing

  2. Outline • Overview of Python • Lists and sets • Functions and loops • Strings and file I/O • Dictionaries and tuples • Modules and classes

  3. Recommended reading • Slides from LING 508, Fall 2010 • http://www.u.arizona.edu/~echan3/508.html • Python tutorial • http://docs.python.org/tutorial/

  4. For Java programmers • Python & Java: A Side-by-Side Comparison • Quick comparison of the two languages • http://pythonconquerstheuniverse.wordpress.com/category/java-and-python/ • Python for Java Programmers • Incomplete tutorial on Python, with Java examples side-by-side • http://python.computersci.org/Main/TableOfContents

  5. Install these • Python 2.6 • http://www.python.org/ • NumPy 1.5.1 • http://numpy.scipy.org/ • Matplotlib 1.0.1 • http://matplotlib.sourceforge.net/ • Contains the pyplot module

  6. Mac OS X 10.6 Snow Leopard • Python and NumPy are already built in, but Matplotlib is not • Matplotlib is incompatible with the built-in versions of Python and NumPy • So you’ll need to download and install Python, NumPy, and Matplotlib

  7. Alternative: install PyLab • PyLab includes NumPy and Matplotlib • http://www.scipy.org/PyLab • So, instead of: >>> import matplotlib • You can do: >>> from pylab import matplotlib

  8. Set Python environment variable • Create a directory mypythoncode for your Python code • Example: C:\Users\Arizona\Desktop\539\mypythoncode\ • Set environment variable so Python knows where to find your code • Windows Vista: • right-click on My Computer • choose "Advanced system properties" • add a new User variable called PYTHONPATH • set the value of the variable to mypythoncode

  9. Set Python environment variable • OS X, Unix/Linux, etc.: • csh • edit .cshrc • setenv PYTHONPATH /home/me/mypythoncode • bash • edit .bashrc or .bash_profile • export PYTHONHPATH=/home/me/mypythoncode

  10. Beware of Windows… • When you repeatedly execute code and cancel execution (with control-C), sometimes the processes continue anyway, and after a while IDLE won’t let you run your code • Solution: • press ctrl-alt-delete • start Task Manager • select lowest pythonw.exe processes • click on End Process • sometimes you might have to restart IDLE

  11. Python in this course • The Marsland book uses Python and numerical Python. • Next lecture: NumPy • You’ll need to learn some Python in order to: • Read the code • Use it and modify it for some assignments • Not all assignments will involve Python, and portions of assignments may be completed in other languages.

  12. Why Python • NLP community uses it • Language features • Datatypes built into language: strings, lists, hash tables • Automatic garbage collection • Dynamic typing • Easy to read: • Forced indentation • Code is concise, not verbose like Java • Gentle learning curve

  13. Hello World in Python and Java • Python 2.X: print 'Hello World!' • Java: class HelloWorld { public static void main(String[] args) { System.out.println("Hello World!"); } }

  14. Comments • Comments are ignored by the Python interpreter but are useful for describing the purpose of a section of code a = 1 # everything after hash mark is a comment b = 2 c = 3 # statement below does not execute because # it is within a comment # d = 4

  15. Overview of basic data types • Integers: • 3, 8, -2, 100 • Floating-point numbers • 3.14159, 0.0001, -.101010101, 2.34e+18 • Booleans • True, False • Strings • 'hello', "GOODBYE" • Python does not have characters; a single character is a string of length 1 • None • Value is None

  16. Overview of compound data types • Lists • [1,2,3,4,5], ['how', 'are', 'you'] • Elements are indexed: element at index 0 of [1,2,3,4,5] is 1 • Tuples • (1, 2, 3, 'a', 'b') • Sets: these are the same: • set([1,2,3,4,5]) • set([1,1,2,2,3,3,4,4,5,5]) • Elements are not indexed • Dictionaries (hash table) • {'a':1, 'b':2, 'c':3} • Map 'a' to 1, map 'b' to 2, map 'c' to 3 • Example application: represent the frequencies of letters in a text

  17. Python is dynamically typed • Don’t explicitly specify types of variables • Python interpreter keeps track of types b = 3 # b is an integer b = False # b is now a boolean def myfunc(x, y): # type not specified return x + y

  18. Outline • Overview of Python • Lists and sets • Functions and loops • Strings and file I/O • Dictionaries and tuples • Modules and classes

  19. Creating lists a = [1, 2, 3] # list of integers b = [True, False] # list of booleans c = ['a', 'b', 'cde'] # list of strings d = [7, 'cat', False] # can mix types e = [] # empty list f = [[1,2,3], [4,5,6]] # list of 2 lists g = [[], [[]]] # list of 2 lists

  20. List indices • Positive indices • Negative indices 1 2 3 4 0 'a' 'b' 'c' 'd' 'e' -4 -3 -2 -1 -5

  21. Indexing into lists >>> mylist = ['a', 'b', 'c', 'd', 'e'] >>> mylist[0] 'a' >>> mylist[1] 'b' >>> mylist[2] 'c' >>> mylist[3] 'd' >>> mylist[4] 'e' 1 2 3 4 0 'a' 'b' 'c' 'd' 'e' -4 -3 -2 -1 -5

  22. Negative indexing 1 2 3 4 0 >>> mylist[-1] 'e' >>> mylist[-2] 'd' >>> mylist[-3] 'c' >>> len(mylist) # built-in function length 5 >>> mylist[len(mylist)-1] 'e' 'a' 'b' 'c' 'd' 'e' -4 -3 -2 -1 -5

  23. Creating new lists through slices • Syntax: mylist[start_idx:stop_idx(:step_size)] • start_idx • begin accessing at this index (inclusive) • default: 0 • stop_idx • stop accessing at this index (exclusive) • default: len(list) • step_size: (optional) • number of items to step through • default: 1

  24. Creating new lists through slices >>> L = ['a', 'b', 'c', 'd', 'e'] >>> L[:2] # up to but not including index 2 ['a', 'b'] >>> L[2:] # beginning at index 2 ['c', 'd', 'e'] >>> L[2:5] ['c', 'd', 'e'] >>> L[2:4] ['c', 'd'] 1 2 3 4 0 'a' 'b' 'c' 'd' 'e'

  25. Built-in functions: len, sorted >>> L = [4,3,1,5,2] >>> len(L) 5 >>> sorted(L) [1, 2, 3, 4, 5] >>> L # does not modify original [4, 3, 1, 5, 2] >>> L = sorted(L) # create a new >>> L # sorted list [1, 2, 3, 4, 5]

  26. Built-in rangefunction • Returns a list containing an arithmetic progression of integers • Syntax: range([start,] stop[, step]) • Examples: >>> range(5) [0,1,2,3,4] >>> range(3,6) [3,4,5] >>> range(3,8,2) [3,5,7]

  27. List methods >>> L = [1, 2, 2] >>> L.append(4) # append a single object >>> L [1, 2, 2, 4] >>> L.extend([3,4]) # extend with a list >>> L [1, 2, 2, 4, 3, 4] >>> [1,2] + [3,4] # same as extend [1, 2, 3, 4]

  28. List methods >>> L [2, 2, 5, 3, 4] >>> L.reverse() # reverse list, modify it >>> L [4, 3, 5, 2, 2] >>> L.sort() # sort the list, modify it >>> L [2, 2, 3, 4, 5] >>> L = [3,2,1] # return a sorted list, >>> sorted(L) # but don’t modify list [1, 2, 3] >>> L [3, 2, 1]

  29. Sets >>> S = set() # call set constructor >>> S set([]) >>> S = set([1,2,2,3,3]) >>> S set([1, 2, 3]) • Won’t necessarily display in sorted order: >>> set([6,65,4,21,3,4,7,1]) set([65, 3, 4, 6, 7, 1, 21])

  30. Searching in a list vs. a set • Linear time to search for an item in a list >>> L = range(1000000) >>> 999999 in L # takes 0.24 seconds True • Constant time to search for an item in a list >>> S = set(range(1000000)) >>> 999999 in S # takes 0.0 seconds True

  31. Outline • Overview of Python • Lists and sets • Functions and loops • Strings and file I/O • Dictionaries and tuples • Modules and classes

  32. Functions: can have default values for arguments def f(a, b): return a + b def g(a, b=7): # consequence: only one return a + b # function definition, # unlike java, where you # have a function definition # for each combination of # arguments used f(3, 4) g(3, 4) # returns 7 g(3) # returns 10

  33. Functions: default return type is None >>> def f(): print 'hello' >>> x = f() hello >>> print x None

  34. For loops L = [1, 2, 3, 4, 5] L2 = [] L3 = [] for i in range(len(L)): L2.append(L[i] * 2) for x in L: L3.append(x * 2)

  35. Bubble sort def bubblesort(L): for i in range(len(L)-1): swap_made = False for j in range(len(L)-1): if L[j+1] < L[j]: L[j], L[j+1] = L[j+1], L[j] swap_made = True if swap_made==False: # list is sorted break

  36. Outline • Overview of Python • Lists and sets • Functions and loops • Strings and file I/O • Dictionaries and tuples • Modules and classes

  37. Declaring strings • Strings can be enclosed in single quotes or double quotes s1 = 'spam' s2 = "spam"

  38. Indexing and slicing strings,just like lists >>> s = 'python' >>> s[3] 'h' >>> s[:3] 'pyt' >>> s[3:] 'hon' >>> s[2:4] 'th' >>> s[2:-2] 'th'

  39. Strings are immutable >>> s = 'python' >>> s[0] = 'x' Traceback (most recent call last): File "<pyshell#25>", line 1, in <module> mystring[0] = 'x' TypeError: 'str' object does not support item assignment >>> L = [1,2,3,4] # but lists are mutable >>> L[0] = 5 >>> L [5, 2, 3, 4]

  40. Concatenation >>> s1 = 'python' >>> s2 = 'big ' + s1 >>> s2 'big python' >>> s1 = 'big ' + s1 >>> s3 = s2[:4] + 'ball' + s2[4:] >>> s3 'big ball python'

  41. Built-in functions >>> len('spam') 4 • Type conversion through type constructors: useful for reading data from files (convert string to numeric types) >>> int('356') 356 >>> float('3.56') 3.5600000000000001 >>> str(356) '356'

  42. Length of a string >>> s = 'hello' >>> len(s) 5

  43. String methods >>> s = 'howXareXyou' >>> s.split('X') ['how', 'are', 'you'] >>> s = 'how are\tyou\n' >>> s.split() # splits on whitespace ['how', 'are', 'you'] >>> s = ' how are you\n' >>> s.strip() # also lstrip and rstrip 'how are you'

  44. String methods >>> ''.join(['how', 'are', 'you']) 'howareyou' >>> 'X'.join(['how', 'are', 'you']) 'howXareXyou'

  45. String methods >>> s = 'goodmorning' >>> s.startswith('goo') True >>> s.endswith('ning') True >>> s[:3]=='goo' # slices and equality True >>> s[-4:]=='ning' True

  46. String methods >>> s = 'how are you\n' >>> s.upper() # shows return value 'HOW ARE YOU\n' >>> s # variable was not modified 'how are you\n' >>> s = s.upper() # modify it >>> s 'HOW ARE YOU\n' >>> 'hello'.isupper() False

  47. Input from a file:print each word in a corpus # open file for reading f = open('C:/myfile.txt', 'r') # read each line one at a time # “line” is a newline-terminated string for line in f: # convert to a list of strings tokens = line.split() # perform operation on each string for tok in tokens: print tok

  48. File output outfilename = '/home/echan3/myfile.txt' of = open(outfilename, 'w') # write # write a string # should include newline character of.write('hello Python\n') of.close() # need to close output file # in reading a file, you don’t have to # call close()

  49. Outline • Overview of Python • Lists and sets • Functions and loops • Strings and file I/O • Dictionaries and tuples • Modules and classes

  50. Tuples • Quick data structure to group variables together • Group together variables of different types • Multiple return values for functions >>> t = ('a', 1, True) >>> t = (1, (2, ((3, 4), 5))) >>> def x(): return (3, 4) >>> (a, b) = x() >>> e = () # empty tuple >>> e = (1,) # one-element tuple, note the comma

More Related