A bayesian approach to spelling correction l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 23

A BAYESIAN APPROACH TO SPELLING CORRECTION PowerPoint PPT Presentation


  • 201 Views
  • Uploaded on
  • Presentation posted in: General

A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’. In a number of tasks involving natural language, the problem can be viewed as recovering an ‘original signal’ distorted by a `noisy channel’: Speech recognition Spelling correction OCR / handwriting recognition

Download Presentation

A BAYESIAN APPROACH TO SPELLING CORRECTION

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


A bayesian approach to spelling correction l.jpg

A BAYESIAN APPROACH TO SPELLING CORRECTION


Noisy channels l.jpg

‘Noisy channels’

  • In a number of tasks involving natural language, the problem can be viewed as recovering an ‘original signal’ distorted by a `noisy channel’:

    • Speech recognition

    • Spelling correction

    • OCR / handwriting recognition

    • (less felicitously perhaps): pronunciation variation

  • This metaphor has provided the justification for the Bayesian approach to statistical NLP,which has found application also outside these application areas


Spelling errors l.jpg

Spelling Errors

They are leaving in about fifteen minuets to go to her house

The study was conducted mainly be John Black.

The design an construction of the system will take more than one year.

Hopefully, all with continue smoothly in my absence.

Can they lave him my messages?

I need to notified the bank of this problem.

He is trying to fine out.


Handwriting recognition l.jpg

Handwriting recognition

  • From Woody Allen’s Take the Money and Run (1969)

    • Allen (a bank robber), walks up to the teller and hands her a note that reads. "I have a gun. Give me all your cash."

  • The teller, however, is puzzled, because he reads "I have a gub." "No, it's gun", Allen says.

  • "Looks like 'gub' to me," the teller says, then asks another teller to help him read the note, then another, and finally everyone is arguing over what the note means.


Spelling errors5 l.jpg

Spelling errors

  • How common are spelling errors?

    • .005% in carefully edited newswire

    • 1-3% in `normal’ human written text

    • 20% of web queries are misspelled (Google includes spelling correction algorithms)

    • 38% in applications like directory lookup

    • Handwriting recognition errors:

      • Apple Newton: 2-3%


Types of spelling errors l.jpg

Types of spelling errors

  • Damerau (1964): 80% of all misspelled words (non-word errors) caused by SINGLE-ERROR MISSPELLINGS:

    • INSERTION: thether

    • DELETION: the th

    • SUBSTITUTION: the  thw

    • TRANSPOSITION: the  hte


Dealing with spelling errors kukich 1992 l.jpg

Dealing with spelling errors (Kukich, 1992)

  • 3 increasingly broader problems:

    • NON-WORD ERROR DETECTION: ‘graffe’ instead of ‘giraffe’

    • ISOLATED WORD-ERRORCORRECTION: replacing ‘graffe’ with ‘giraffe’ without looking at context

    • CONTEXT-DEPENDENT ERROR DETECTION / CORRECTION: detecting also spelling errors that result in a real world


Detecting non word errors dictionaries l.jpg

Detecting non-word errors: Dictionaries

  • Peterson, 1986: large dictionaries may do more damage than good

    • wont

    • veery

  • Damerau and Mays (1989): no evidence this was the case


The noisy channel model l.jpg

The Noisy Channel model


Bayesian inference l.jpg

Bayesian inference

  • `Bayesian inference’ is the name given to techniques typically used in diagnostics to identify the CAUSE of certain OBSERVATIONS

  • The name ‘Bayesian’ comes from the fact that Bayes’ rule is used to ‘turn around’ a problem: from one of finding statistics about the posterior probability of the CAUSE to one of finding the posterior probability of the OBSERVATIONS


Bayesian inference the equations l.jpg

Bayesian inference: the equations

  • (These are equations that we will encounter again and again for different tasks)

  • The statistical formulation of the problem of finding the most likely `explanation’ for the observation:

  • Using Bayes’ Rule, this probability can be `turned around’:


Bayesian equations 2 l.jpg

Bayesian equations, 2

  • Some of these quantities are easy to compute, but others much less so – especially P(O)

  • Fortunately, we don’t really need to compute this term!! (It’s the same for ALL `explanations’)

  • This equation is a pattern that we will encounter again and again.


Applying the bayesian method to spelling kernigham et al 1990 l.jpg

Applying the Bayesian Method to Spelling: Kernigham et al, 1990

  • correct takes words rejected by spell and generates a list of potential correct words

  • Two steps:

    • Proposing candidate corrections

    • Scoring the candidates

  • An example of isolated word-error correction


Proposing candidate corrections l.jpg

Proposing candidate corrections

  • The noisy channel assumption: misspelled word the result of a `noisy channel’ – the typist performing a single MISTYPING OPERATION

  • Four possible operations:

    • INSERTION: x  xy

    • DELETION: xy  x

    • SUBSTITUTION: y  x

    • REVERSAL: xy  yx

  • At most one operation involved (cfr. Damerau, 1964)


Example acress l.jpg

Example: acress


Scoring the candidates l.jpg

Scoring the candidates

  • Choose the correction with the highest probability:

  • P(c): MLE estimation in a 44M words corpus, with smoothing (Good-Turing)


A simplification l.jpg

A simplification

THE TRAINING CORPUS:

acress actress actressacress acres acres acres

WOULD WANT:

Likelihoods: P(acress|actress) = 1/3 P(acress|acress) = ¼

APPROXIMATE WITH:

P(acress|actress) = del[ct,c] / count[ct] = 1/3 (?)


Confusion matrices l.jpg

Confusion matrices

  • Difficult to compute directly, but can be estimated by looking at LOCAL FACTORS only

  • Entry [m,n] in a CONFUSION MATRIX for SUBSTITUTION will tell us how often n is used instead of m

  • Kernighan et al used four confusion matrices:

    • del[x,y] (number of times x is typed instead of correct xy)

    • ins[x,y] (number of times xy is typed instead of correct x)

    • sub[x,y] (number of times y is typed instead of correct x)

    • trans[x,y] (number of times yx is typed instead of correct xy)


Estimating the likelihood of a typo l.jpg

Estimating the likelihood of a typo


Resulting likelihoods l.jpg

Resulting likelihoods


Evaluation 3 judges 329 triples l.jpg

Evaluation (3 judges, 329 triples)


More sophisticated methods l.jpg

More sophisticated methods

  • MINIMUM EDIT DISTANCE: allow for the possibility of more than one problem

  • N-GRAM models: use context (detect ‘real words’)


References l.jpg

References

  • Jurafsky and Martin, chapter 5

  • Kernighan, M. D., Church, K. W., and Gale, W. A. (1990). A spelling correction method based on a noisy channel model. COLING-90, 205-211.

  • Karen Kukich (1992). Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4), 377-439.

  • More recent work:

    • Brill, E. and Moore, R. An improved error model for noisy channel spelling correction Proc. ACL 2000


  • Login