Ling 408 508 computational techniques for linguists
This presentation is the property of its rightful owner.
Sponsored Links
1 / 40

LING 408/508: Computational Techniques for Linguists PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on
  • Presentation posted in: General

LING 408/508: Computational Techniques for Linguists. Lecture 18 10/3/2012. Outline. Applications of dictionaries Phone book Word frequencies and Zipf’s Law References, mutability, and dictionary values. Structure of dictionary for phone book problem. 2 people named “Sarah Connor”

Download Presentation

LING 408/508: Computational Techniques for Linguists

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ling 408 508 computational techniques for linguists

LING 408/508: Computational Techniques for Linguists

Lecture 18

10/3/2012


Outline

Outline

  • Applications of dictionaries

    • Phone book

    • Word frequencies and Zipf’s Law

  • References, mutability, and dictionary values


Structure of dictionary for phone book problem

Structure of dictionary forphone book problem

  • 2 people named “Sarah Connor”

  • “Kyle Reese” and “Arnold Schwarz” have the same phone number

Key: (string, string) tuple

Value: list of integers

('Mitt','Romney')

[9997777]

('Sarah','Connor')

[1234567, 1010101]

('Kyle','Reese')

[7654321]

('Arnold','Schwarz')

[7654321]

('Arnold','Ventura')

[2233444]


Other queries don t look ahead spoilers follow

Other queries(DON’T LOOK AHEAD!!! SPOILERS FOLLOW)

# pb is a dictionary

# keys: (string, string)

# values: list of integers

1.How many names are listed in the phone book?

2.What are the distinct first names listed in the phone book?

3.How many distinct people are listed in the phone book? It’s possible that different people have the same name, and that multiple people have the same phone number.

4.How many phone numbers are there in the phone book?

5.How many phone numbers are there such that at least 2 people share that phone number?


Other queries

Other queries

# pb is a dictionary

# keys: (string, string)

# values: list of integers

1.How many names are listed in the phone book?

len(pb.keys())


Other queries1

Other queries

# pb is a dictionary

# keys: (string, string)

# values: list of integers

2.What are the distinct first names listed in the phone book?

set([first for (first, last) in pb.keys()])


Other queries2

Other queries

# pb is a dictionary

# keys: (string, string)

# values: list of integers

3.How many distinct people are listed in the phone book? It’s possible that different people have the same name, and that multiple people have the same phone number.

Example:

pb = {('Bob', 'Barker'):[1234567],

('Bobby', 'Barker'):[1234567],

('Sue', 'Parker'):[3333333, 4444444]}


Ling 408 508 computational techniques for linguists

3.How many distinct people are listed in the phone book? It’s possible that different people have the same name, and that multiple people have the same phone number.

ppl = []

for (name, ph_nums) in pb.items():

for num in ph_nums:

ppl.append((name, num))

num_distinct_ppl = len(ppl)

# >>> ppl

# [(('Sue', 'Parker'), 3333333),

# (('Sue', 'Parker'), 4444444),

# (('Bob', 'Barker'), 1234567),

# (('Bobby', 'Barker'), 1234567)]


Other queries3

Other queries

# pb is a dictionary

# keys: (string, string)

# values: list of integers

4.How many phone numbers are there in the phone book?

all_nums = set()

for nums in pb.values():

all_nums.update(nums)

num_phone_nums = len(all_nums)


Other queries4

Other queries

# pb is a dictionary

# keys: (string, string)

# values: list of integers

5.How many phone numbers are there such that at least 2 people share that phone number?

  • Need to get all numbers with >= 2 names

  • Construct a dictionary mapping a number to a list of names


Ling 408 508 computational techniques for linguists

# key: number

# value: list of names

num_to_names = {}

for (name, numlist) in pb.items():

for num in numlist:

names = num_to_names.get(num, [])

names.append(name)

num_to_names[num] = names # necessary

num_shared = 0

for (num, names) in num_to_names.items():

if len(names) >= 2:

num_shared += 1

print('quantity of phone numbers that are shared:')

print(num_shared)


Using a list comprehension

Using a list comprehension

num_shared = 0

for (num, names) in num_to_names.items():

if len(names) >= 2:

num_shared += 1

# x refers to (num, names) tuples

shared = len([x for x in num_to_names.items() if len(x[1])>=2])


Outline1

Outline

  • Applications of dictionaries

    • Phone book

    • Word frequencies and Zipf’s Law

  • References, mutability, and dictionary values


Frequencies of words in a corpus types and tokens

Frequencies of words in a corpus: types and tokens

  • Brown corpus of English:

    • 1160743 word tokens

    • 49680 word types

  • Type: a distinct word

    • “with”

  • Token: an individual occurrence of a word

    • “with” occurs 7270 times


Word frequency

Word frequency

  • Write a program that reads a text file and writes an output file listing each word type and its token frequency, in order of decreasing frequency.

  • Output format:

    1000hello

    800fish

    40lion

  • Plot the frequency distribution.


Ling 408 508 computational techniques for linguists

# dictionary to store word frequencies

# maps a string to an integer

w_to_freq = {}

# read file one line at a time,

# break each line into individual tokens,

# and count the tokens

for line in open('C:/brown-corpus.txt', 'r'):

tokens = line.split()

for tok in tokens:

w_to_freq[tok] = w_to_freq.get(tok,0) + 1


Ling 408 508 computational techniques for linguists

# w.items() returns list of (string, int) tuples

# convert to list of (int, string) tuples

# so that we can sort by frequency

freqs_words = []

for (w, freq) in w_to_freq.items():

freqs_words.append((freq, w))

# sort by decreasing frequency

freqs_words.sort(reverse=True)

# write output file

of = open('C:/output.txt', 'w')

for (freq, w) in freqs_words:

of.write('{:8d}\t{:s}\n'.format(freq, w))

of.close()


Using a list comprehension1

Using a list comprehension

# freqs_words = []

# for (w, freq) in w_to_freq.items():

# freqs_words.append((freq, w))

# same, using a list comprehension

freqs_words = [(f,w) for (w,f) in w_to_freq.items()]


Frequency and rank

Frequency and rank

  • Sort words by decreasing frequency

  • Rank = order in sorted list

    • Rank 1: most-frequent word

    • Rank 2: second most-frequent word

    • etc.

  • Plot word frequencies by rank


Plotting

Plotting

  • Download and install matplotlib

  • http://matplotlib.sourceforge.net/


Ling 408 508 computational techniques for linguists

import matplotlib.pyplot as plt

# some code omitted

# read word frequencies from a corpus

#

# counts is a list of integers in

# decreasing order

counts = [freq for (freq,w) in freqs_words]

plt.plot(counts)

plt.xlabel('word rank')

plt.ylabel('word frequency')

plt.title('Word frequency vs. rank')

plt.show()


Plot of word frequencies linear scale out of date figure forgot to include labels and titles

Plot of word frequencies, linear scale(Out-of-date figure: forgot to include labels and titles)

Frequency ( in 10,000s )

Rank ( in 10,000s)


Plot of word frequencies zoom in

Plot of word frequencies, zoom in


Ling 408 508 computational techniques for linguists

import matplotlib.pyplot as plt

# some code omitted

# read word frequencies from a corpus

# counts is a list of integers in

# decreasing order

plt.plot(counts)

plt.xscale('log') # logarithmic scales

plt.yscale('log') # for x- and y- axes

plt.xlabel('word rank')

plt.ylabel('word frequency')

plt.title('Word frequency vs. rank')

plt.show()


Plot of word frequencies log log scale

Plot of word frequencies, log-log scale

Log1010 = 1

Log10100 = 2

Log101000 = 3


Plot of word frequencies log log scale1

Plot of word frequencies, log-log scale

~10 types with freq. > 10,000

Log1010 = 1

Log10100 = 2

Log101000 = 3

~100 types 1,000 < freq < 10,000

~1,000 types 100 < freq < 1,000

~10,000 types 10 < freq < 100

10,000s types 1 < freq < 10


Word frequency distributions in language

Word frequency distributions in language

  • There are a few common words

  • A large, but not huge number of medium frequency words

  • Very, very many low frequency words


Word frequencies exemplify zipf s law

Word frequencies exemplify Zipf’s law

  • Power law distribution

  • The frequencyF of a word w is inversely proportional to the rankR of w:

    F  1 / R

    i.e., F x R = k, for some constant k

  • Example: 50th most common word type should occur three times as freq. as 150th most common word type

    freq. at rank 50:  1 / 50

    freq. at rank 150:  1 / 150

    ( 1 / 50 ) / ( 1 / 150 ) = 3


Near linear relationship between freq and rank in log log scale

Near-linear relationship between freq. and rank in log-log scale

Log1010 = 1

Log10100 = 2

Log101000 = 3


What kind of words are frequent and infrequent

What kind of words are frequent and infrequent?


Most frequent words function words perform grammatical functions

Most-frequent words: function words(perform grammatical functions)


Least frequent words content words express meaning

Least-frequent words: content words (express meaning)


Outline2

Outline

  • Applications of dictionaries

    • Phone book

    • Word frequencies and Zipf’s Law

  • References, mutability, and dictionary values


References and mutability in the context of dictionary values

References and mutability in the context of dictionary values

  • In the code below, Anna has a room, and Jack has two rooms. Anna changes room, and Jack adds another room. What happens when line (1) is omitted? What happens line of code (2) is omitted?

    >>> d = {'Anna': 104, 'Jack':[303, 304]}

    >>> anna_room = d['Anna']

    >>> anna_room = 105

    >>> d['Anna'] = anna_room # 1

    >>> jack_rooms = d['Jack']

    >>> jack_rooms.append(305)

    >>> d['Jack'] = jack_rooms # 2

    >>> d

    {'Anna': 105, 'Jack':[303, 304, 305]}


References and mutability in the context of dictionary values1

References and mutability in the context of dictionary values

  • In the code below, Anna has a room, and Jack has two rooms. Anna changes room, and Jack adds another room. What happens when line (1) is omitted? What happens line of code (2) is omitted?

    >>> d = {'Anna': 104, 'Jack':[303, 304]}

    >>> anna_room = d['Anna']

    >>> anna_room = 105

    >>> # d['Anna'] = anna_room # 1

    >>> jack_rooms = d['Jack']

    >>> jack_rooms.append(305)

    >>> # d['Jack'] = jack_rooms # 2

    >>> d

    {'Anna': 104, 'Jack':[303, 304, 305]}


Simplifying representation of keys

(simplifying representation of keys)

d = {'Anna': 104, 'Jack':[303, 304]}

Type: Integer

Data: 104

Key: 'Anna'

Ref: <address1>

d

Key: 'Jack'

Ref: <address2>

Type: List

Data: [303, 304]


Ling 408 508 computational techniques for linguists

d = {'Anna': 104, 'Jack':[303, 304]}

anna_room = d['Anna']

anna_room = 105

d['Anna'] = anna_room # 1

Type: Integer

Data: 104

Key: 'Anna'

Ref: <address1>

Key: 'Anna'

Ref: <address3>

d

Key: 'Jack'

Ref: <address2>

Type: List

Data: [303, 304]

Name: anna_room

Ref: <address3>

Name: anna_room

Ref: <address1>

Type: Integer

Data: 105


Ling 408 508 computational techniques for linguists

d = {'Anna': 104, 'Jack':[303, 304]}

jack_rooms = d['Jack']

jack_rooms.append(305)

# d['Jack'] = jack_rooms # NOT NEEDED

Type: Integer

Data: 104

Key: 'Anna'

Ref: <address1>

d

Key: 'Jack'

Ref: <address2>

Type: List

Data: [303, 304, 305]

Type: List

Data: [303, 304]

Name: jack_rooms

Ref: <address2>


Ling 408 508 computational techniques for linguists

  • But suppose you begin with an empty dictionary.

  • Assignment is necessary because upon first encountering a key, an empty list is assigned to the variable rooms. This empty list is not yet a value for the key name.

  • Subsequently, when encountering a key already in the dictionary, the execution of the last line does not change the structure of the dictionary.

    a = [('Jack', 303), ('Jack', 304)]

    d = {}

    for (name, room) in a:

    rooms = d.get(name, [])

    rooms.append(name)

    d[name] = rooms # necessary


  • Login