Ling 438 538 computational linguistics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 19

LING 438/538 Computational Linguistics PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

LING 438/538 Computational Linguistics. Sandiway Fong Lecture 21: 11/7. Administrivia. Short Lecture Today Homework 5 out today due next Tuesday usual rules. Homework 5. Britney Spears. Webpage from the BBC website

Download Presentation

LING 438/538 Computational Linguistics

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Ling 438 538 computational linguistics

LING 438/538Computational Linguistics

Sandiway Fong

Lecture 21: 11/7



  • Short Lecture Today

  • Homework 5

    • out today

    • due next Tuesday

    • usual rules

Homework 5

Homework 5

Britney spears

Britney Spears

  • Webpage


Ling 438 538 computational linguistics

  • from the BBC website

  • typos: #2 is wrong,

  • #3 and #5 shouldn’t be the same

use this list for the homework










=9. Britanny

Question1 britney spears

Question1: Britney Spears

  • Question 1

    • Part 1 (3pts)

      • Compute the edit distances for the misspellings of Britney (Spears)

      • use insert=delete=1, substitute=2

    • Part 2 (3pts)

      • Compute the edit distances for the misspellings of Britney (Spears)

      • use insert=delete=1, substitute=1

    • Part 3 (4pts)

      • Come up with a metric that correctly ranks the top 7 misspellings for either Part 1 or Part 2

Making money from misspellings

Making Money from Misspellings

  • Webpage


Making money from misspellings1


US legal authorities are appealing for help in tracking down John Zuccarini, who they say is making more than a million dollars a year from a collection of misspelled domain names.

The Federal Trade Commission is now looking for ways to recover the cash Mr Zuccarini has made from the domain names.


Mr Zuccarini has been practising a novel variation of cybersquatting which usually involves gaining control of a website that you have no real claim to, and then offering it for sale to the rightful owner at a premium.

The domains registered by Mr Zuccarini were typically misspellings of well-known names. Mr Zuccarini has reportedly registered 15 variations of the spelling of Cartoon Network TV channel, and 41 of pop star Britney Spears.

Making Money from Misspellings

Question 2

Question 2



  • homework corpus

    • WSJ9_041.txt

    • from the course homepage

      • Wall Street Journal articles (July 26–28 1989)

      • this is the text file you will use

      • contains almost 22,000 lines and 150,000 words

      • use only the text between the SGML markers

      • <text> </text>

      • example...

Question 21

Question 2


Sun Microsystems Inc. said it will post a larger-than-expected fourth-quarter loss of as much as $26 million and may show a loss in the current first quarter, raising further troubling questions about the once high-flying computer workstation maker.

Sun reported last month that management errors, rather than a weakness in the market for computer workstations, would result in lower earnings or a "slight loss" in the quarter ended June 30.

But the amount cited in yesterday's disclosure was far greater than analysts had suspected, and suggested deepening troubles.

"It is extremely disconcerting," said Peter Rogers, computer analyst at Robertson Stephens &amp; Co. in San Francisco.

"Many of us had been led to believe that most (of the management-systems problems) had been put behind them.

It looks like there is another layer."

"On the surface, it would lead one to conclude that Sun has at least temporarily completely lost control of its operations," Mr. Rogers added.

The maker of high-performance desktop computers now says the loss was probably between $20 million and $26 million, compared with year-ago net income of $25.3 million, or 66 cents a share.

The huge fourth-quarter loss will bring year-end earnings to between $55 million and $61 million, or between 72 cents and 78 cents a share, compared with year-ago net of $66.4 million, or 89 cents a share.

Sun said it expects to report fourth-quarter revenue of $425 million to $435 million, up 16% to 19% from a year earlier.

That would contrast sharply with Sun's third quarter, when revenue surged 92%, and would put full-year revenue in the $1.75 billion to $1.77 billion range, up from $1.05 billion a year earlier.

Question 22

Question 2

Sun said the problems that led to the loss have "largely been resolved," and that it received record bookings in its fourth quarter.

Still, Sun said profitability in the current quarter, ending Sept. 30, can't be assured.

The company added that a return to profitability will depend on the effectiveness of cost-cutting measures and its ability to obtain parts.

A spokeswoman said the company still faces a shortage of certain parts.

In June, Sun said operations were disrupted by a change to a new system for getting information to management.

The company also cited faulty forecasting of demand, problems in manufacturing new machines and a shortage of certain parts.

The spokeswoman reiterated that the company sees strong demand for its products, and believes the market for computer workstations remains healthy.

Sun said it has imposed a hiring freeze in all areas except sales and customer service, postponed moving into new facilities and curtailed other expenses.

The spokeswoman wouldn't say how much the cost-cutting measures will save.

The announcement was made after the market closed.

Sun's stock closed at $16.25, up 62.5 cents, in national over-the-counter trading.


Question 23

Question 2

  • edit out other stuff...


    <DOCNO> WSJ890728-0079 </DOCNO>

    <DD> = 890728 </DD>

    <AN> 890728-0079. </AN>

    <HL> Major Deficit

    @ Signaled by Sun

    @ Microsystems

    @ ---

    @ Firm to Post Quarterly Loss

    @ As Much as $26 Million;

    @ Deepening Trouble Seen

    @ ----

    @ By Carrie Dolan

    @ Staff Reporter of The Wall Street Journal </HL>

    <DD> 07/28/89 </DD>


    <CO> SUNW </CO>



Ngram statistics package nsp

for homework question 2

suggest you use the NSP software package,

brew your own, or

any other package you want to use...

(Ngram Statistics Package) NSP

Ted Petersen’s Perl-based Ngram Statistics Package (NSP)

you need to install a free Perl on your system if not already available

e.g. Active State Perl

(Ngram Statistics Package) NSP

Ngram statistics package nsp1

you only need to use the Perl program file

NSP on Windows

command line options

perl --help


Counts up the frequency of all n-grams occurring in SOURCE.

Sends to DESTINATION the list of n-grams found, along with the

frequencies of combinations of the n tokens that the n-gram is composed of. If SOURCE is a directory, all text files in it are counted.


--ngram N Creates n-grams of N tokens each. N = 2 by default.

--newLine Prevents n-grams from spanning across the new-line character.

(Ngram Statistics Package) NSP

Question 24

Question 2

  • (4pts)

    • List the most frequent closed-class word (for each class) in the corpus

      • use the definition of closed-classes (and your judgment) listed in section 8.1 of the textbook

  • (2pts)

    • What is the most frequent proper noun?

    • What is the most frequent (non-auxiliary) verb?

Question 25

Question 2

  • (8pts)

  • compute the probability of the (similar) sentences

    • Bristol-Myers agreed to merge with Sun Microsystems

    • Bristol-Myers and Sun Microsystems agreed to merge

  • using both the bigram and trigram approximations

  • use add-one smoothing where relevant

Question 26


given the chain rule

p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)

what is w1?

if we’re talking about a sentence, w1= START


sentence begin with Sun ..

(see opposite column)



p(START) = 1


Petersen’s program does not take into account START

you’ll have to calculate this separately or modify the corpus before running NSP...

sentence start symbol = START


START Sun Microsystems Inc. said it will post a larger-than-expected fourth-quarter loss of as much as $26 million and may show a loss in the current first quarter, raising further troubling questions about the once high-flying computer workstation maker.

Question 2



  • for both 438/538

    • Question 1: 10pts

    • Question 2: 14pts

    • Total: 24 pts

  • Login