ling 408 508 computational techniques for linguists n.
Download
Skip this Video
Download Presentation
LING 408/508: Computational Techniques for Linguists

Loading in 2 Seconds...

play fullscreen
1 / 40

LING 408/508: Computational Techniques for Linguists - PowerPoint PPT Presentation


  • 99 Views
  • Uploaded on

LING 408/508: Computational Techniques for Linguists. Lecture 18 10/3/2012. Outline. Applications of dictionaries Phone book Word frequencies and Zipf’s Law References, mutability, and dictionary values. Structure of dictionary for phone book problem. 2 people named “Sarah Connor”

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'LING 408/508: Computational Techniques for Linguists' - kaya


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline
  • Applications of dictionaries
    • Phone book
    • Word frequencies and Zipf’s Law
  • References, mutability, and dictionary values
structure of dictionary for phone book problem
Structure of dictionary forphone book problem
  • 2 people named “Sarah Connor”
  • “Kyle Reese” and “Arnold Schwarz” have the same phone number

Key: (string, string) tuple

Value: list of integers

('Mitt','Romney')

[9997777]

('Sarah','Connor')

[1234567, 1010101]

('Kyle','Reese')

[7654321]

('Arnold','Schwarz')

[7654321]

('Arnold','Ventura')

[2233444]

other queries don t look ahead spoilers follow
Other queries(DON’T LOOK AHEAD!!! SPOILERS FOLLOW)

# pb is a dictionary

# keys: (string, string)

# values: list of integers

1. How many names are listed in the phone book?

2. What are the distinct first names listed in the phone book?

3. How many distinct people are listed in the phone book? It’s possible that different people have the same name, and that multiple people have the same phone number.

4. How many phone numbers are there in the phone book?

5. How many phone numbers are there such that at least 2 people share that phone number?

other queries
Other queries

# pb is a dictionary

# keys: (string, string)

# values: list of integers

1. How many names are listed in the phone book?

len(pb.keys())

other queries1
Other queries

# pb is a dictionary

# keys: (string, string)

# values: list of integers

2. What are the distinct first names listed in the phone book?

set([first for (first, last) in pb.keys()])

other queries2
Other queries

# pb is a dictionary

# keys: (string, string)

# values: list of integers

3. How many distinct people are listed in the phone book? It’s possible that different people have the same name, and that multiple people have the same phone number.

Example:

pb = {('Bob', 'Barker'):[1234567],

('Bobby', 'Barker'):[1234567],

('Sue', 'Parker'):[3333333, 4444444]}

slide8

3. How many distinct people are listed in the phone book? It’s possible that different people have the same name, and that multiple people have the same phone number.

ppl = []

for (name, ph_nums) in pb.items():

for num in ph_nums:

ppl.append((name, num))

num_distinct_ppl = len(ppl)

# >>> ppl

# [(('Sue', 'Parker'), 3333333),

# (('Sue', 'Parker'), 4444444),

# (('Bob', 'Barker'), 1234567),

# (('Bobby', 'Barker'), 1234567)]

other queries3
Other queries

# pb is a dictionary

# keys: (string, string)

# values: list of integers

4. How many phone numbers are there in the phone book?

all_nums = set()

for nums in pb.values():

all_nums.update(nums)

num_phone_nums = len(all_nums)

other queries4
Other queries

# pb is a dictionary

# keys: (string, string)

# values: list of integers

5. How many phone numbers are there such that at least 2 people share that phone number?

  • Need to get all numbers with >= 2 names
  • Construct a dictionary mapping a number to a list of names
slide11

# key: number

# value: list of names

num_to_names = {}

for (name, numlist) in pb.items():

for num in numlist:

names = num_to_names.get(num, [])

names.append(name)

num_to_names[num] = names # necessary

num_shared = 0

for (num, names) in num_to_names.items():

if len(names) >= 2:

num_shared += 1

print('quantity of phone numbers that are shared:')

print(num_shared)

using a list comprehension
Using a list comprehension

num_shared = 0

for (num, names) in num_to_names.items():

if len(names) >= 2:

num_shared += 1

# x refers to (num, names) tuples

shared = len([x for x in num_to_names.items() if len(x[1])>=2])

outline1
Outline
  • Applications of dictionaries
    • Phone book
    • Word frequencies and Zipf’s Law
  • References, mutability, and dictionary values
frequencies of words in a corpus types and tokens
Frequencies of words in a corpus: types and tokens
  • Brown corpus of English:
    • 1160743 word tokens
    • 49680 word types
  • Type: a distinct word
    • “with”
  • Token: an individual occurrence of a word
    • “with” occurs 7270 times
word frequency
Word frequency
  • Write a program that reads a text file and writes an output file listing each word type and its token frequency, in order of decreasing frequency.
  • Output format:

1000 hello

800 fish

40 lion

  • Plot the frequency distribution.
slide16

# dictionary to store word frequencies

# maps a string to an integer

w_to_freq = {}

# read file one line at a time,

# break each line into individual tokens,

# and count the tokens

for line in open('C:/brown-corpus.txt', 'r'):

tokens = line.split()

for tok in tokens:

w_to_freq[tok] = w_to_freq.get(tok,0) + 1

slide17

# w.items() returns list of (string, int) tuples

# convert to list of (int, string) tuples

# so that we can sort by frequency

freqs_words = []

for (w, freq) in w_to_freq.items():

freqs_words.append((freq, w))

# sort by decreasing frequency

freqs_words.sort(reverse=True)

# write output file

of = open('C:/output.txt', 'w')

for (freq, w) in freqs_words:

of.write('{:8d}\t{:s}\n'.format(freq, w))

of.close()

using a list comprehension1
Using a list comprehension

# freqs_words = []

# for (w, freq) in w_to_freq.items():

# freqs_words.append((freq, w))

# same, using a list comprehension

freqs_words = [(f,w) for (w,f) in w_to_freq.items()]

frequency and rank
Frequency and rank
  • Sort words by decreasing frequency
  • Rank = order in sorted list
    • Rank 1: most-frequent word
    • Rank 2: second most-frequent word
    • etc.
  • Plot word frequencies by rank
plotting
Plotting
  • Download and install matplotlib
  • http://matplotlib.sourceforge.net/
slide21

import matplotlib.pyplot as plt

# some code omitted

# read word frequencies from a corpus

#

# counts is a list of integers in

# decreasing order

counts = [freq for (freq,w) in freqs_words]

plt.plot(counts)

plt.xlabel('word rank')

plt.ylabel('word frequency')

plt.title('Word frequency vs. rank')

plt.show()

plot of word frequencies linear scale out of date figure forgot to include labels and titles
Plot of word frequencies, linear scale(Out-of-date figure: forgot to include labels and titles)

Frequency ( in 10,000s )

Rank ( in 10,000s)

slide24
import matplotlib.pyplot as plt

# some code omitted

# read word frequencies from a corpus

# counts is a list of integers in

# decreasing order

plt.plot(counts)

plt.xscale('log') # logarithmic scales

plt.yscale('log') # for x- and y- axes

plt.xlabel('word rank')

plt.ylabel('word frequency')

plt.title('Word frequency vs. rank')

plt.show()

plot of word frequencies log log scale
Plot of word frequencies, log-log scale

Log1010 = 1

Log10100 = 2

Log101000 = 3

plot of word frequencies log log scale1
Plot of word frequencies, log-log scale

~10 types with freq. > 10,000

Log1010 = 1

Log10100 = 2

Log101000 = 3

~100 types 1,000 < freq < 10,000

~1,000 types 100 < freq < 1,000

~10,000 types 10 < freq < 100

10,000s types 1 < freq < 10

word frequency distributions in language
Word frequency distributions in language
  • There are a few common words
  • A large, but not huge number of medium frequency words
  • Very, very many low frequency words
word frequencies exemplify zipf s law
Word frequencies exemplify Zipf’s law
  • Power law distribution
  • The frequencyF of a word w is inversely proportional to the rankR of w:

F  1 / R

i.e., F x R = k, for some constant k

  • Example: 50th most common word type should occur three times as freq. as 150th most common word type

freq. at rank 50:  1 / 50

freq. at rank 150:  1 / 150

( 1 / 50 ) / ( 1 / 150 ) = 3

near linear relationship between freq and rank in log log scale
Near-linear relationship between freq. and rank in log-log scale

Log1010 = 1

Log10100 = 2

Log101000 = 3

outline2
Outline
  • Applications of dictionaries
    • Phone book
    • Word frequencies and Zipf’s Law
  • References, mutability, and dictionary values
references and mutability in the context of dictionary values
References and mutability in the context of dictionary values
  • In the code below, Anna has a room, and Jack has two rooms. Anna changes room, and Jack adds another room. What happens when line (1) is omitted? What happens line of code (2) is omitted?

>>> d = {'Anna': 104, 'Jack':[303, 304]}

>>> anna_room = d['Anna']

>>> anna_room = 105

>>> d['Anna'] = anna_room # 1

>>> jack_rooms = d['Jack']

>>> jack_rooms.append(305)

>>> d['Jack'] = jack_rooms # 2

>>> d

{'Anna': 105, 'Jack':[303, 304, 305]}

references and mutability in the context of dictionary values1
References and mutability in the context of dictionary values
  • In the code below, Anna has a room, and Jack has two rooms. Anna changes room, and Jack adds another room. What happens when line (1) is omitted? What happens line of code (2) is omitted?

>>> d = {'Anna': 104, 'Jack':[303, 304]}

>>> anna_room = d['Anna']

>>> anna_room = 105

>>> # d['Anna'] = anna_room # 1

>>> jack_rooms = d['Jack']

>>> jack_rooms.append(305)

>>> # d['Jack'] = jack_rooms # 2

>>> d

{'Anna': 104, 'Jack':[303, 304, 305]}

simplifying representation of keys
(simplifying representation of keys)

d = {'Anna': 104, 'Jack':[303, 304]}

Type: Integer

Data: 104

Key: 'Anna'

Ref: <address1>

d

Key: 'Jack'

Ref: <address2>

Type: List

Data: [303, 304]

slide38

d = {'Anna': 104, 'Jack':[303, 304]}

anna_room = d['Anna']

anna_room = 105

d['Anna'] = anna_room # 1

Type: Integer

Data: 104

Key: 'Anna'

Ref: <address1>

Key: 'Anna'

Ref: <address3>

d

Key: 'Jack'

Ref: <address2>

Type: List

Data: [303, 304]

Name: anna_room

Ref: <address3>

Name: anna_room

Ref: <address1>

Type: Integer

Data: 105

slide39

d = {'Anna': 104, 'Jack':[303, 304]}

jack_rooms = d['Jack']

jack_rooms.append(305)

# d['Jack'] = jack_rooms # NOT NEEDED

Type: Integer

Data: 104

Key: 'Anna'

Ref: <address1>

d

Key: 'Jack'

Ref: <address2>

Type: List

Data: [303, 304, 305]

Type: List

Data: [303, 304]

Name: jack_rooms

Ref: <address2>

slide40

But suppose you begin with an empty dictionary.

  • Assignment is necessary because upon first encountering a key, an empty list is assigned to the variable rooms. This empty list is not yet a value for the key name.
  • Subsequently, when encountering a key already in the dictionary, the execution of the last line does not change the structure of the dictionary.

a = [('Jack', 303), ('Jack', 304)]

d = {}

for (name, room) in a:

rooms = d.get(name, [])

rooms.append(name)

d[name] = rooms # necessary