Ling 408 508 computational techniques for linguists
Download
1 / 40

LING 408/508: Computational Techniques for Linguists - PowerPoint PPT Presentation


  • 96 Views
  • Uploaded on

LING 408/508: Computational Techniques for Linguists. Lecture 18 10/3/2012. Outline. Applications of dictionaries Phone book Word frequencies and Zipf’s Law References, mutability, and dictionary values. Structure of dictionary for phone book problem. 2 people named “Sarah Connor”

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' LING 408/508: Computational Techniques for Linguists' - kaya


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Outline
Outline

  • Applications of dictionaries

    • Phone book

    • Word frequencies and Zipf’s Law

  • References, mutability, and dictionary values


Structure of dictionary for phone book problem
Structure of dictionary forphone book problem

  • 2 people named “Sarah Connor”

  • “Kyle Reese” and “Arnold Schwarz” have the same phone number

Key: (string, string) tuple

Value: list of integers

('Mitt','Romney')

[9997777]

('Sarah','Connor')

[1234567, 1010101]

('Kyle','Reese')

[7654321]

('Arnold','Schwarz')

[7654321]

('Arnold','Ventura')

[2233444]


Other queries don t look ahead spoilers follow
Other queries(DON’T LOOK AHEAD!!! SPOILERS FOLLOW)

# pb is a dictionary

# keys: (string, string)

# values: list of integers

1. How many names are listed in the phone book?

2. What are the distinct first names listed in the phone book?

3. How many distinct people are listed in the phone book? It’s possible that different people have the same name, and that multiple people have the same phone number.

4. How many phone numbers are there in the phone book?

5. How many phone numbers are there such that at least 2 people share that phone number?


Other queries
Other queries

# pb is a dictionary

# keys: (string, string)

# values: list of integers

1. How many names are listed in the phone book?

len(pb.keys())


Other queries1
Other queries

# pb is a dictionary

# keys: (string, string)

# values: list of integers

2. What are the distinct first names listed in the phone book?

set([first for (first, last) in pb.keys()])


Other queries2
Other queries

# pb is a dictionary

# keys: (string, string)

# values: list of integers

3. How many distinct people are listed in the phone book? It’s possible that different people have the same name, and that multiple people have the same phone number.

Example:

pb = {('Bob', 'Barker'):[1234567],

('Bobby', 'Barker'):[1234567],

('Sue', 'Parker'):[3333333, 4444444]}


3. How many distinct people are listed in the phone book? It’s possible that different people have the same name, and that multiple people have the same phone number.

ppl = []

for (name, ph_nums) in pb.items():

for num in ph_nums:

ppl.append((name, num))

num_distinct_ppl = len(ppl)

# >>> ppl

# [(('Sue', 'Parker'), 3333333),

# (('Sue', 'Parker'), 4444444),

# (('Bob', 'Barker'), 1234567),

# (('Bobby', 'Barker'), 1234567)]


Other queries3
Other queries It’s possible that different people have the same name, and that multiple people have the same phone number.

# pb is a dictionary

# keys: (string, string)

# values: list of integers

4. How many phone numbers are there in the phone book?

all_nums = set()

for nums in pb.values():

all_nums.update(nums)

num_phone_nums = len(all_nums)


Other queries4
Other queries It’s possible that different people have the same name, and that multiple people have the same phone number.

# pb is a dictionary

# keys: (string, string)

# values: list of integers

5. How many phone numbers are there such that at least 2 people share that phone number?

  • Need to get all numbers with >= 2 names

  • Construct a dictionary mapping a number to a list of names


# key: number It’s possible that different people have the same name, and that multiple people have the same phone number.

# value: list of names

num_to_names = {}

for (name, numlist) in pb.items():

for num in numlist:

names = num_to_names.get(num, [])

names.append(name)

num_to_names[num] = names # necessary

num_shared = 0

for (num, names) in num_to_names.items():

if len(names) >= 2:

num_shared += 1

print('quantity of phone numbers that are shared:')

print(num_shared)


Using a list comprehension
Using a list comprehension It’s possible that different people have the same name, and that multiple people have the same phone number.

num_shared = 0

for (num, names) in num_to_names.items():

if len(names) >= 2:

num_shared += 1

# x refers to (num, names) tuples

shared = len([x for x in num_to_names.items() if len(x[1])>=2])


Outline1
Outline It’s possible that different people have the same name, and that multiple people have the same phone number.

  • Applications of dictionaries

    • Phone book

    • Word frequencies and Zipf’s Law

  • References, mutability, and dictionary values


Frequencies of words in a corpus types and tokens
Frequencies of words in a corpus: types and tokens It’s possible that different people have the same name, and that multiple people have the same phone number.

  • Brown corpus of English:

    • 1160743 word tokens

    • 49680 word types

  • Type: a distinct word

    • “with”

  • Token: an individual occurrence of a word

    • “with” occurs 7270 times


Word frequency
Word It’s possible that different people have the same name, and that multiple people have the same phone number.frequency

  • Write a program that reads a text file and writes an output file listing each word type and its token frequency, in order of decreasing frequency.

  • Output format:

    1000 hello

    800 fish

    40 lion

  • Plot the frequency distribution.


# dictionary to store word frequencies It’s possible that different people have the same name, and that multiple people have the same phone number.

# maps a string to an integer

w_to_freq = {}

# read file one line at a time,

# break each line into individual tokens,

# and count the tokens

for line in open('C:/brown-corpus.txt', 'r'):

tokens = line.split()

for tok in tokens:

w_to_freq[tok] = w_to_freq.get(tok,0) + 1


# It’s possible that different people have the same name, and that multiple people have the same phone number.w.items() returns list of (string, int) tuples

# convert to list of (int, string) tuples

# so that we can sort by frequency

freqs_words = []

for (w, freq) in w_to_freq.items():

freqs_words.append((freq, w))

# sort by decreasing frequency

freqs_words.sort(reverse=True)

# write output file

of = open('C:/output.txt', 'w')

for (freq, w) in freqs_words:

of.write('{:8d}\t{:s}\n'.format(freq, w))

of.close()


Using a list comprehension1
Using a list comprehension It’s possible that different people have the same name, and that multiple people have the same phone number.

# freqs_words = []

# for (w, freq) in w_to_freq.items():

# freqs_words.append((freq, w))

# same, using a list comprehension

freqs_words = [(f,w) for (w,f) in w_to_freq.items()]


Frequency and rank
Frequency and rank It’s possible that different people have the same name, and that multiple people have the same phone number.

  • Sort words by decreasing frequency

  • Rank = order in sorted list

    • Rank 1: most-frequent word

    • Rank 2: second most-frequent word

    • etc.

  • Plot word frequencies by rank


Plotting
Plotting It’s possible that different people have the same name, and that multiple people have the same phone number.

  • Download and install matplotlib

  • http://matplotlib.sourceforge.net/


import It’s possible that different people have the same name, and that multiple people have the same phone number.matplotlib.pyplot as plt

# some code omitted

# read word frequencies from a corpus

#

# counts is a list of integers in

# decreasing order

counts = [freq for (freq,w) in freqs_words]

plt.plot(counts)

plt.xlabel('word rank')

plt.ylabel('word frequency')

plt.title('Word frequency vs. rank')

plt.show()


Plot of word frequencies linear scale out of date figure forgot to include labels and titles
Plot of word frequencies, linear It’s possible that different people have the same name, and that multiple people have the same phone number.scale(Out-of-date figure: forgot to include labels and titles)

Frequency ( in 10,000s )

Rank ( in 10,000s)


Plot of word frequencies zoom in
Plot of word frequencies, zoom in It’s possible that different people have the same name, and that multiple people have the same phone number.


import It’s possible that different people have the same name, and that multiple people have the same phone number.matplotlib.pyplot as plt

# some code omitted

# read word frequencies from a corpus

# counts is a list of integers in

# decreasing order

plt.plot(counts)

plt.xscale('log') # logarithmic scales

plt.yscale('log') # for x- and y- axes

plt.xlabel('word rank')

plt.ylabel('word frequency')

plt.title('Word frequency vs. rank')

plt.show()


Plot of word frequencies log log scale
Plot of word frequencies, log-log scale It’s possible that different people have the same name, and that multiple people have the same phone number.

Log1010 = 1

Log10100 = 2

Log101000 = 3


Plot of word frequencies log log scale1
Plot of word frequencies, log-log scale It’s possible that different people have the same name, and that multiple people have the same phone number.

~10 types with freq. > 10,000

Log1010 = 1

Log10100 = 2

Log101000 = 3

~100 types 1,000 < freq < 10,000

~1,000 types 100 < freq < 1,000

~10,000 types 10 < freq < 100

10,000s types 1 < freq < 10


Word frequency distributions in language
Word frequency distributions in language It’s possible that different people have the same name, and that multiple people have the same phone number.

  • There are a few common words

  • A large, but not huge number of medium frequency words

  • Very, very many low frequency words


Word frequencies exemplify zipf s law
Word frequencies exemplify Zipf’s law It’s possible that different people have the same name, and that multiple people have the same phone number.

  • Power law distribution

  • The frequencyF of a word w is inversely proportional to the rankR of w:

    F  1 / R

    i.e., F x R = k, for some constant k

  • Example: 50th most common word type should occur three times as freq. as 150th most common word type

    freq. at rank 50:  1 / 50

    freq. at rank 150:  1 / 150

    ( 1 / 50 ) / ( 1 / 150 ) = 3


Near linear relationship between freq and rank in log log scale
Near-linear relationship between freq. and rank in log-log scale

Log1010 = 1

Log10100 = 2

Log101000 = 3



Most frequent words function words perform grammatical functions
Most-frequent words: function scalewords(perform grammatical functions)



Outline2
Outline scale

  • Applications of dictionaries

    • Phone book

    • Word frequencies and Zipf’s Law

  • References, mutability, and dictionary values


References and mutability in the context of dictionary values
References and mutability in the context of dictionary values

  • In the code below, Anna has a room, and Jack has two rooms. Anna changes room, and Jack adds another room. What happens when line (1) is omitted? What happens line of code (2) is omitted?

    >>> d = {'Anna': 104, 'Jack':[303, 304]}

    >>> anna_room = d['Anna']

    >>> anna_room = 105

    >>> d['Anna'] = anna_room # 1

    >>> jack_rooms = d['Jack']

    >>> jack_rooms.append(305)

    >>> d['Jack'] = jack_rooms # 2

    >>> d

    {'Anna': 105, 'Jack':[303, 304, 305]}


References and mutability in the context of dictionary values1
References and mutability in the context of dictionary values

  • In the code below, Anna has a room, and Jack has two rooms. Anna changes room, and Jack adds another room. What happens when line (1) is omitted? What happens line of code (2) is omitted?

    >>> d = {'Anna': 104, 'Jack':[303, 304]}

    >>> anna_room = d['Anna']

    >>> anna_room = 105

    >>> # d['Anna'] = anna_room # 1

    >>> jack_rooms = d['Jack']

    >>> jack_rooms.append(305)

    >>> # d['Jack'] = jack_rooms # 2

    >>> d

    {'Anna': 104, 'Jack':[303, 304, 305]}


Simplifying representation of keys
(simplifying representation of keys) values

d = {'Anna': 104, 'Jack':[303, 304]}

Type: Integer

Data: 104

Key: 'Anna'

Ref: <address1>

d

Key: 'Jack'

Ref: <address2>

Type: List

Data: [303, 304]


d = {'Anna': 104, 'Jack':[303, 304]} values

anna_room = d['Anna']

anna_room = 105

d['Anna'] = anna_room # 1

Type: Integer

Data: 104

Key: 'Anna'

Ref: <address1>

Key: 'Anna'

Ref: <address3>

d

Key: 'Jack'

Ref: <address2>

Type: List

Data: [303, 304]

Name: anna_room

Ref: <address3>

Name: anna_room

Ref: <address1>

Type: Integer

Data: 105


d = {'Anna': 104, 'Jack':[303, 304]} values

jack_rooms = d['Jack']

jack_rooms.append(305)

# d['Jack'] = jack_rooms # NOT NEEDED

Type: Integer

Data: 104

Key: 'Anna'

Ref: <address1>

d

Key: 'Jack'

Ref: <address2>

Type: List

Data: [303, 304, 305]

Type: List

Data: [303, 304]

Name: jack_rooms

Ref: <address2>


  • But suppose you begin with an empty dictionary. values

  • Assignment is necessary because upon first encountering a key, an empty list is assigned to the variable rooms. This empty list is not yet a value for the key name.

  • Subsequently, when encountering a key already in the dictionary, the execution of the last line does not change the structure of the dictionary.

    a = [('Jack', 303), ('Jack', 304)]

    d = {}

    for (name, room) in a:

    rooms = d.get(name, [])

    rooms.append(name)

    d[name] = rooms # necessary


ad