Chapter 5 Probabilistic models fo pronunciation and spelling

Chapter 5 Probabilistic models fo pronunciation and spelling Xiaomeng Su 6 November

Main points • This chapter discusses the problem of detecting and correcting spelling errors. • First introduce the problems of detecting and correcting spelling errors ; also summarize typical human spelling error paterns • Introduce ways to solve the spelling problem : Bayes Rule and the noisy channel model.

Outline • 5.1 Dealing with spelling errors. • 5.2 Spelling error patterns. • 5.3 Detecting non-word errors. • 5.4 Probabilistic models. • 5.5 Applying the bayesian method to spelling. • 5.6 Minimum edit distance. • 5.11 Summary.

5.1 Dealing with spelling errors • Application aera • Typed text (word-processors). • Optical character recognition –OCR (optical scanner) • On-line handwriting recognition (Palm,Chinese) • Classification of spelling correction.(Kukich1992) • Non-word error detection: detecting spelling errors that result in non-words (graffe for giraffe). • Isolated-word error correction: correcting spelling errors that result in non-words. (correcting graffe to giraffe, but looking only at the word in isolation.) • Context-dependent error detection and correction: using the context to help detect and correct real-word errors. (dessert for desert or there for three).

5.2 Spelling errors patterns • The number and nature of spelling errors in human typed text differs from those caused by pattern-recognition devices like OCR and handwriting recognizers. • Number. • 1-3% in human typed text. • Vary. 0.2-20% for OCR. Special input script for Palm. • Nature.

Human typing errors Insertion: the as ther Deletion: the as th Substitution: the as thw Transposition: the as teh Other dimension of classification Typographic errors: Keyboard related. spell as spwll Cognitive errors: the writer doesn’t know how to spell . separate as seperate OCR errors. Substitution Multisubstitution Space deletion Insertion Failure. Nature of spelling errors

An example for OCR errors • Correct: The quick brown fox jumps over the lazy dog. • Recognized: ’lhe q~ick brown foxjurnps ovcr tb l azy dog. • Errors: substitution (e c) and multisubstitutions (T  ’l, mrn, heb) are caused by visual simlarity rather than keybooard distance; failures (u~) are cases where OCR does not select any letter with sufficient accuracy.

5.3 Detecting non-word errors • Detecting non-word errors in text, whether typed by humans ro scanned, is commonly done by using dictionary. • Small or big dictionary? • Small: Large dictionary contains rare words that resemble misspelling of other words: wont as won’t • Large: Emperical study found large dictionary are more helpful than harmful. • Use model of morphology for to deal with inflection.

5.4 Probabilistic models • The noisy channel model.

Equation for picking the best word

Using Bayesian rules to make the equation computable

5.5 Applying Bayesian method • Bayesian algorithm • Proposing candidate correction. • Scoring the candidates. • Proposing candidates • Simplifying assumption: single spelling error. • Example: misspelling acress

Example

p(c) can be estimated by counting how often the word c occurs in some corpus. Scoring the correction

Calculating p(t|c) • Still a research question. • Can be estimated. • Some simply ways. For example.. • Confusion matrix • A square 26*26 table which represents how many times one letter was incorrectly used instead of another. • For example: the cell [o,e] in a substitution confusion matrix would give the count of times that e was substituted for o. • Usually, there are four confusion matrix: deletion, insertion, substitution and transposition.

...was called a ”stellar and versatile acress whose combination of sass and glamour has defined here...” Chapter 6 will show how to augment the prior probability by using surrounding words. Result

Previous sections relied on the simplifying assumption – single spelling error. We need a more powerful algorithm to handle multiple errors. Minimum edit distance algorithm String distance, is some metric of how alike two strings are to each other. The minimum edit distance between two strings is the minimum number of editing operations. 5.6 Minimum edit distance

Three methods for representing distances

Minimum edit distance algorithm • Is an application of dynamic programming, which solving problems by combining solutions to subproblems. • The edit-distance matrix.

Minimum edit distance algorithm

5.11 Summary • We can present many language problems as if a clean string of symbols had been corrupted by passing through a noisy channel and it is our job to recover the original string. • One way to do it is to consider all possible original strings and rank them by their probability. • We use Bayes Rule to break down the probability into prior and likelihood. • Prior is computed by taking word frequencies. Likelihood is computed by training a simple probabilistic model (confusion matrix, a decision tree or a hand-written rule) on a database. • The minimum edit distance is introduced to solve multi-spelling errors.The minimum edit distance algorithm can be used to produce the distance two strings.

Chapter 5 Probabilistic models fo pronunciation and spelling