LING 438/538 Computational Linguistics. Sandiway Fong Lecture 21: 11/7. Administrivia. Short Lecture Today Homework 5 out today due next Tuesday usual rules. Homework 5. Britney Spears. Webpage http://news.bbc.co.uk/cbbcnews/hi/music/newsid_1953000/1953614.stm. from the BBC website

• Short Lecture Today
• Homework 5
• out today
• due next Tuesday
• usual rules
Britney Spears
• Webpage
• http://news.bbc.co.uk/cbbcnews/hi/music/newsid_1953000/1953614.stm

from the BBC website

• typos: #2 is wrong,
• #3 and #5 shouldn’t be the same

use this list for the homework

Brittany

Brittney

Britany

Britny

Briteny

Britteny

Briney

Brittny

=9. Brintey

=9. Britanny

Question1: Britney Spears
• Question 1
• Part 1 (3pts)
• Compute the edit distances for the misspellings of Britney (Spears)
• use insert=delete=1, substitute=2
• Part 2 (3pts)
• Compute the edit distances for the misspellings of Britney (Spears)
• use insert=delete=1, substitute=1
• Part 3 (4pts)
• Come up with a metric that correctly ranks the top 7 misspellings for either Part 1 or Part 2
Making Money from Misspellings
• Webpage
• http://news.bbc.co.uk/1/hi/sci/tech/1575060.stm
Excerpts:

US legal authorities are appealing for help in tracking down John Zuccarini, who they say is making more than a million dollars a year from a collection of misspelled domain names.

The Federal Trade Commission is now looking for ways to recover the cash Mr Zuccarini has made from the domain names.

Excerpts:

Mr Zuccarini has been practising a novel variation of cybersquatting which usually involves gaining control of a website that you have no real claim to, and then offering it for sale to the rightful owner at a premium.

The domains registered by Mr Zuccarini were typically misspellings of well-known names. Mr Zuccarini has reportedly registered 15 variations of the spelling of Cartoon Network TV channel, and 41 of pop star Britney Spears.

Making Money from Misspellings
Corpus
• homework corpus
• WSJ9_041.txt
• from the course homepage
• Wall Street Journal articles (July 26–28 1989)
• this is the text file you will use
• contains almost 22,000 lines and 150,000 words
• use only the text between the SGML markers
• <text> </text>
• example...
Question 2

<TEXT>

Sun Microsystems Inc. said it will post a larger-than-expected fourth-quarter loss of as much as \$26 million and may show a loss in the current first quarter, raising further troubling questions about the once high-flying computer workstation maker.

Sun reported last month that management errors, rather than a weakness in the market for computer workstations, would result in lower earnings or a "slight loss" in the quarter ended June 30.

But the amount cited in yesterday's disclosure was far greater than analysts had suspected, and suggested deepening troubles.

"It is extremely disconcerting," said Peter Rogers, computer analyst at Robertson Stephens &amp; Co. in San Francisco.

"Many of us had been led to believe that most (of the management-systems problems) had been put behind them.

It looks like there is another layer."

"On the surface, it would lead one to conclude that Sun has at least temporarily completely lost control of its operations," Mr. Rogers added.

The maker of high-performance desktop computers now says the loss was probably between \$20 million and \$26 million, compared with year-ago net income of \$25.3 million, or 66 cents a share.

The huge fourth-quarter loss will bring year-end earnings to between \$55 million and \$61 million, or between 72 cents and 78 cents a share, compared with year-ago net of \$66.4 million, or 89 cents a share.

Sun said it expects to report fourth-quarter revenue of \$425 million to \$435 million, up 16% to 19% from a year earlier.

That would contrast sharply with Sun's third quarter, when revenue surged 92%, and would put full-year revenue in the \$1.75 billion to \$1.77 billion range, up from \$1.05 billion a year earlier.

Question 2

Sun said the problems that led to the loss have "largely been resolved," and that it received record bookings in its fourth quarter.

Still, Sun said profitability in the current quarter, ending Sept. 30, can't be assured.

The company added that a return to profitability will depend on the effectiveness of cost-cutting measures and its ability to obtain parts.

A spokeswoman said the company still faces a shortage of certain parts.

In June, Sun said operations were disrupted by a change to a new system for getting information to management.

The company also cited faulty forecasting of demand, problems in manufacturing new machines and a shortage of certain parts.

The spokeswoman reiterated that the company sees strong demand for its products, and believes the market for computer workstations remains healthy.

Sun said it has imposed a hiring freeze in all areas except sales and customer service, postponed moving into new facilities and curtailed other expenses.

The spokeswoman wouldn't say how much the cost-cutting measures will save.

The announcement was made after the market closed.

Sun's stock closed at \$16.25, up 62.5 cents, in national over-the-counter trading.

</TEXT>

Question 2
• edit out other stuff...

<DOC>

<DOCNO> WSJ890728-0079 </DOCNO>

<DD> = 890728 </DD>

<AN> 890728-0079. </AN>

<HL> Major Deficit

@ Signaled by Sun

@ Microsystems

@ ---

@ Firm to Post Quarterly Loss

@ As Much as \$26 Million;

@ Deepening Trouble Seen

@ ----

@ By Carrie Dolan

@ Staff Reporter of The Wall Street Journal </HL>

<DD> 07/28/89 </DD>

<SO> WALL STREET JOURNAL (J) </SO>

<CO> SUNW </CO>

<IN> COMPUTERS AND INFORMATION TECHNOLOGY (CPR) </IN>

<DATELINE> MOUNTAIN VIEW, Calif. </DATELINE>

for homework question 2

suggest you use the NSP software package,

any other package you want to use...

(Ngram Statistics Package) NSP

Ted Petersen’s Perl-based Ngram Statistics Package (NSP)

http://www.d.umn.edu/~tpederse/nsp.html

you need to install a free Perl on your system if not already available

e.g. Active State Perl

(Ngram Statistics Package) NSP
you only need to use the Perl program file

count.pl

NSP on Windows

command line options

perl count.pl --help

Usage: count.pl [OPTIONS] DESTINATION SOURCE [[, SOURCE] ...]

Counts up the frequency of all n-grams occurring in SOURCE.

Sends to DESTINATION the list of n-grams found, along with the

frequencies of combinations of the n tokens that the n-gram is composed of. If SOURCE is a directory, all text files in it are counted.

OPTIONS:

--ngram N Creates n-grams of N tokens each. N = 2 by default.

--newLine Prevents n-grams from spanning across the new-line character.

(Ngram Statistics Package) NSP
Question 2
• (4pts)
• List the most frequent closed-class word (for each class) in the corpus
• use the definition of closed-classes (and your judgment) listed in section 8.1 of the textbook
• (2pts)
• What is the most frequent proper noun?
• What is the most frequent (non-auxiliary) verb?
Question 2
• (8pts)
• compute the probability of the (similar) sentences
• Bristol-Myers agreed to merge with Sun Microsystems
• Bristol-Myers and Sun Microsystems agreed to merge
• using both the bigram and trigram approximations
• use add-one smoothing where relevant
Note

given the chain rule

p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)

what is w1?

if we’re talking about a sentence, w1= START

Example:

sentence begin with Sun ..

(see opposite column)

p(Sun|START)

assume

p(START) = 1

Note

Petersen’s program does not take into account START

you’ll have to calculate this separately or modify the corpus before running NSP...

sentence start symbol = START

file:

START Sun Microsystems Inc. said it will post a larger-than-expected fourth-quarter loss of as much as \$26 million and may show a loss in the current first quarter, raising further troubling questions about the once high-flying computer workstation maker.

Question 2
Summary
• for both 438/538
• Question 1: 10pts
• Question 2: 14pts
• Total: 24 pts