contemporary spelling correction decoding the noisy channel n.
Skip this Video
Loading SlideShow in 5 Seconds..
Contemporary Spelling Correction Decoding the Noisy Channel PowerPoint Presentation
Download Presentation
Contemporary Spelling Correction Decoding the Noisy Channel

Loading in 2 Seconds...

play fullscreen
1 / 26

Contemporary Spelling Correction Decoding the Noisy Channel - PowerPoint PPT Presentation

  • Uploaded on

Contemporary Spelling Correction Decoding the Noisy Channel. Bob Carpenter Alias I, Inc. Kinds of Spelling Mistakes: Typos. Typos are wrong characters by mistake Insertions “appellate” as “appellare”, “prejudice” as “prejudsice” Deletions

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Contemporary Spelling Correction Decoding the Noisy Channel' - jaimin

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
contemporary spelling correction decoding the noisy channel

ContemporarySpelling CorrectionDecoding the Noisy Channel

Bob Carpenter

Alias I, Inc.

kinds of spelling mistakes typos
Kinds of Spelling Mistakes: Typos
  • Typos are wrong characters by mistake
  • Insertions
    • “appellate” as “appellare”, “prejudice” as “prejudsice”
  • Deletions
    • “plaintiff” as “paintiff”, “judgement” as “judment”, “liability” as “liabilty”, “discovery” as “dicovery”, “fourth amendment” as “fourthamendment”
  • Substitutions
    • “habeas” as “haceas”
  • Transpositions
    • “fraud” as “fruad”, “bankruptcy” as “banrkuptcy
    • “subpoena” as “subpeona”
    • “plaintiff” as “plaitniff”
kinds of spelling mistakes brainos
Kinds of Spelling Mistakes: Brainos
  • Brainos are wrong characters “on purpose”
  • The kinds of mistakes found in lists of “common” misspellings
  • Very common in general web queries
  • Derive from either pronunciation or spelling or deep semantic confusions
  • English is particularly bad due to irregularity
  • Probably (?) common in other languages importing words
brainos soundalikes
Brainos: Soundalikes
  • Latinates
    • “subpoena” as “supena”, “judicata” as “judicada”, “voir” as “voire”
  • Consonant Clusters & Flaps
    • “privelege” as “priveledge”, “rescision” as “recision”, “collateral” as “colaterall”, “latter” as “ladder”, “estoppel” as “estopple”, “withholding” as “witholding”, “recission” as “recision”
  • Vowel Reductions
    • “collateral” as “collaterel”, “punitive” as “punative”
  • Vowel Clusters
    • “respondeat” as “respondiat”, “lien” as “lein”; “estoppel” as “estopple”, “habeas” as “habeeas”, “conveniens” as “convieniens
  • Marker Vowels
    • “foreclosure” as “forclosure”
  • Multiples
    • “subpoena” as “supena” (two deletes)
brainos confusions
Brainos: Confusions
  • Substitute more common or just plain different
    • Names: “Opperman” as “Oppenheimer”; “Eisenstein” as “Einstein”
  • Pronunciation Confusions
    • Transpositions; “preclusion” as “perclusion”, “meruit” as “meriut”
  • Irregular word forms
    • “juries” as “jurys” or “jureys”; “men” as “mans”
    • English is particularly bad for this, too
  • Tokenization issues
    • “AT&T” vs. “AT & T” vs. “A.T.&T.”, …
    • Correct variant (if unique) depends on search engine’s notion of “word”
  • Word Boundaries
    • “in camera” as “incamera”, “qui tam” as “quitam”, “injunction” as “in junction”, “foreclosure” as “for closure”, “dramshop” as “dram shop”
old school spelling correction
“Old School” Spelling Correction
  • Damerau, 1964, “A technique for computer detection and correction of spelling errors”. Comms. ACM.
    • One word (token) at a time
    • Only looked at unknown words not in dictionary
    • Suggest “closest” alternatives (first or multiple in order)
    • Closeness measured in number of edits (edit distance)
      • Deletions, Insertions, Substitutions, and sometimes Transpositions
      • Often results in ties
      • Good word game
      • With 50 characters and a 50-word query, get 5050 = 1084 alternatives
      • Can search whole space in linear time using dynamic programming
    • This technique lives on in many apps
    • Simple, fast and only requires a word list
edit distance damerau levenstein
Edit Distance (Damerau/Levenstein)
  • Quadratic time; linear space algorithm
  • Eg. D(“John”, “Jan”) = 2 [D(“John”, “Bob”) = 3]
    • Edits match ‘J’, subst ‘a’ for ‘o’; delete ‘h’, match ‘n’)

score(I,J) =

= Min (score(I-1,J-1)

+ match(I,J),


+ delete(J),


+ insert(I) )

middle aged spelling correction
“Middle Aged” Spelling Correction
  • Still look at single words not in a dictionary and list of common misspellings
  • Model Likely Edits
    • Whole words
      • “acceptable” as “acceptible”; “truant” as “truent”, etc.
    • Sound Sequences
      • “ie”   “ei”; “mm”   “m”
    • Typos
      • Closeness on keyboard (depends on your keyboard – mixtures)
      • “q” as “w”; “y” as “u” (substitutions)
      • “q” as “qw” or “wq” (insertions)
    • Position in Word
      • Edits more likely internally, next at end, least in front
      • Psychology of reading left-to-right & early resolution
      • “plantiff” (mid) > “plaintff” (end) > “laintiff” (front)
contemporary spelling correction
“Contemporary” Spelling Correction
  • Find most likely intended query given observed query
  • Integrated Probabilistic Model
    • Model of Query Likelihood (source): P(query)
    • Model of Edit Likelihood (channel): P(realization|query)
  • Shannon’s “Noisy Channel Model” (1940s):
    • Find most likely query (Q) given realization (R)

ARGMAXQ P(Q | R) [Problem]

= ARGMAXQ P(R | Q) * P(Q) / P(R) [Defn. Conditional]

= ARGMAXQ P(R | Q) * P(Q) [R constant]

simple example of correction
Simple Example of Correction
  • Query Likelihood Model
    • P(“hte”) = 1/1,000,000
    • P(“the”) = 1/20
  • Edit Likelihood Model
    • P(“hte” | “the”)

= P(transpose(“th”)) * P(match(“e”))

= 1/500 * 99/100 = 99/50000 ~ 1/500

    • P( “hte” | “hte”)

= P(match(“h”)) * P(match(“t”)) * P(match(“e”)) ~ 1/1

  • Therefore:
    • P(“hte” | “the” ) * P(“the”) = 1/500 * 1/20 = 1/10,000

>> P(“hte” | “hte” ) * P(“hte”) = 1/1 * 1/1,000,000

general approach solves several problems
General Approach Solves Several Problems
  • Orders alternatives based on likelihood
    • First best or ranked n-best alternatives
    • N-best is a tricky user-interface issue for web search
  • Measures likelihood that query is in error
    • Allows tuning of rejection thresholds for precision/recall
  • Measures likelihood that correction is correct
    • As posterior probability in the Bayesian model
  • Principled balance of query vs. edit likelihoods
    • Empirical issue determined by measurable user behavior
    • E.g. Word processors and web search very different
  • Suggests Valid Word Substitutions in Phrases
    • “pro bono” as “per bono”
    • “Peter principle” as “Peter principal”
    • Google e.g. “fodr”  “ford” but “fodr baggins”  “frodo baggins”
alias i s approach
Alias-i’s Approach
  • Models fully retrainable per application
    • Out-of-the-box solutions not feasible
    • Tailored query and edit models based on user application behavior
    • Scalable to gigabytes w/o pruning and to arbitrary amounts of data with selective pruning
  • Character-level model for queries: P(query)
  • Generalizes to subphrases of unknown tokens
    • E.g. “likelihoods” flagged as error by PowerPoint
    • E.g. “likelihood” not flagged as error by PowerPoint
  • Or Token-sensitive output (only output known words in corpus)
  • Allows efficient search based on prefixes
  • Flexible framework for edit likelihoods: P(realization|query)
    • Models likely substitutions in domain
source language models
Source Language Models
  • Character n-grams:


= PROD i<n P(ci | c0,…,ci-1 ) [chain rule]

~ PROD i<n P’(ci | ci-n+1,…,ci-1 ) [n-gram]

  • Generalized Witten-Bell smoothing (~ state of the art):

P’(d| c C)

= lambda(c C) * PML(d | c C)

+ (1 – lambda(c C)) * P’(d | C)

    • where d,c are characters, and C a sequence of characters,
    • PML is the maximum likelihood estimator,
    • the recursion grounds on the uniform estimate, and
    • lambda(X) = count(X) / (count(X) + K * outcomes(X)) [in [0,1]]
training data for query model p query
Training Data for Query Model: P(Query)
  • Trained independently of edit model
  • Captures domain-specific features more than edits
  • Appropriate Text Corpus matches problem
    • Overall stats: “trt”  “tart” or “tort” (depends on domain)
    • Phrasal Context: “linzer trt” vs. “trt reform”
  • Implicitly models number of possible “hits” for query
  • Can train per field for complex queries
    • E.g. author, institution, MeSH term, abstract in MEDLINE
  • Can retrain query models as new data arrives
  • Training data must match use data
    • e.g. all caps, mixed case, etc.
    • May normalize queries plus training data
training data for edit model p realization query
Training Data for Edit Model: P(realization|query)

1. No training data

  • A priori typos:
    • Characters near each other on keyboard are likely typos
    • More careful typing near beginning and end of word
  • A priori brainos:
    • Vowel sequences confusable with vowel sequences
    • Consonants that sound alike easily confused (‘t’ vs. ‘d’, etc.)
    • Consonants likely doubled or undoubled in error
    • More common in unstressed syllables (approximately later)

2. Bootstrap raw query logs

  • Can do this step with simpler model, such as ispell
  • Better with the first approximation model above (like EM)
  • Estimates rate of various errors and likely substitutions
training data for edit model p realization query cont
Training Data for Edit Model: P(realization|query) (cont.)
  • Sample of Correct/Error Classified Queries
    • Better estimate of error edit rates (not specific errors)
      • Estimate likely insert/delete/substitute/transpose errors
      • Requires unbiased sample of errors and correct queries
      • Search engines report 10-15% of queries have errors!!!
      • Need ~100 examples of each type of error type on average
    • Requires unbiased sample of errors (correct not necessary)
      • Need about 100 examples average per character, or about 5K examples total assuming 50 editable characters
      • We can find these using “active learning” or bootstrapping
      • Requires best guess of correction using simpler method
training data for edit model p realization query cont1
Training Data for Edit Model: P(realization|query) (cont.)
  • 4. Fully Supervised Learning
    • Same samples as in (3) above
    • Editor(s) provides correction for errors
      • Only a few days work with a halfway decent interface
      • Should use two editors on same sample to cross-validate
      • Multiple editors also provide a bound on human performance
    • Almost always significantly better than bootstrap methods
evaluating accuracy correcting the right queries
Evaluating Accuracy: Correcting the Right Queries
  • Need the labeled training data!
  • Are we correcting the right queries?
  • Confusion Matrix
    • True Positive: Error that is corrected
    • True Negative: Good query that is not corrected
    • False Positive: Good query that is corrected
    • False Negative: Error that is not corrected
  • Performance Metrics
    • Precision = TP / (TP + FP)

% of corrections that were errors

    • Sensitivity = TN / (TN + FP)

% of rejections that were not errors

    • Recall = TP / (TP + FN)

% of errors that are corrected

    • Accuracy = (TP + TN) / (TP + TN + FP + FN)

% of queries for which we do the right thing

  • Can balance false alarms and missed corrections
evaluating accuracy returning the proper correction
Evaluating Accuracy: Returning the Proper Correction
  • Correction Accuracy
    • % of corrections that were properly corrected
    • Combine with precision on the to-correct decision
  • Overall Accuracy
    • % of queries that are TN or TP with right correction
evaluating accuracy msn case study
Evaluating Accuracy: MSN Case Study

Cucerzan and Brill. 2004. Spelling Correction as an iterative process that explits the collective knowledge of web users. Proc. ACL.

  • 10-15% estimate of queries with errors
  • Training by Bootstrapping Query Logs (method 2)
  • Scoring one human against another: 90%
  • System accuracy against averaged humans: 82%
  • System accuracy on valid queries: 85%
  • System accuracy on queries with errors: 67%
  • System accuracy with baseline edit model
    • 80% total; 83% valid queries; 66% queries with errors
  • 8% lower estimates for auto-eval over sequential logs
  • 5% higher estimate for “reasonable” vs. exact correction
  • Good News
    • Web search is as hard as it gets – multi-topic and multi-lingual
evaluating efficiency
Evaluating Efficiency
  • May trade accuracy for efficiency along received operating curve
    • Smaller model size by token or characters
    • Smaller search space
    • Higher rejection threshold increases efficiency, reduces recall, and increases precision
  • Standalone Server Deployment
    • Allows larger shared models in memory
    • Simple timeout robustness from web server
    • Models require CRSW synchronization
      • Any number of concurrent queries share same model w/o blocking
      • No queries can run while model is changing
  • Correction may be done in parallel to search (not pure latency)
    • Do not need to evaluate number of queries returned,
      • though this may be combined post-hoc with results for tighter rejection
  • Should easily scale to requirements
    • 1 million queries in 8 hours on a single multiprocessor server
      • That’s 25-50 queries/second
      • LMs run at 2 million characters/second on desktop
but wait that s not all for lingpipe 2 0
But wait, that’s not all for LingPipe 2.0
  • Character and Token-level Language Models
  • Ranked Terminology Discovery
    • collocations within corpus (chi square independence test)
    • “what’s new” across corpora (binomial t-test)
  • Binary & Multiway Classification
    • Bayesian framework; language model implementations
    • Extensive probabilistic confusion matrix scoring
    • E.g. Topic (e.g. which newsgroup, which section of paper)
    • E.g. Sentiment (eg. Positive or negative product review)
    • E.g. Language (critical for multi-lingual applications)
    • E.g. De-duplication of message streams
    • E.g. Spam detection
  • Hierarchical Clustering
    • General framework; Language model implementations
    • E.g. Self-organizing web results
  • Chunking (high throughput Bayesian model)
    • E.g. Named entities, noun phrases and verb phrases
  • Implementations of standard evaluations and corpora
design standards
Design Standards
  • Extensive use of standard patterns
    • E.g. corpus visitors, abstract adapters, factories for runtime pluggable implementations
    • Mostly immutable & final (efficiency, state stability & testability)
    • Modules all support CRSW synchronization
  • Highly Modular Interfaces
    • Allows implementation plug and play
    • Most interfaces have abstract adapters
    • E.g. SpellChecker interface, AbstractSpellChecker adapter with abstract edit model, and ConstantSpellChecker and ProbabilisticSpellChecker implementations
  • Simple or Complex Tuning Parameterizations
    • Reasonable Defaults
    • M.S./Ph.D.-level tuning options (popular for theses)
  • Follows Sun’s coding standards
engineering support standards
Engineering & Support Standards
  • Active and Responsive User Group Forum
  • Tutorial examples of all modules
    • Most include industry-standard evaluations
  • Thorough Unit Testing (JUnit)
    • More good examples of API usage
    • Windows XP & Linux for Java 1.4.2 and 1.5.0
  • Profile-based tuning (JProfiler)
    • Speed, Memory and Disk access
  • Full javadoc of public/protected API
    • Classes are shy about their privates as a rule
    • Types are as specific as possible (many adapters)
  • Integration at command-line, XML or API levels
other applications
Other Applications
  • Case Restoration ()
    • Source: Train on mixed case data
    • Channel: Case switching costs nothing; others infinite
    • E.g. “LOUISE MCNALLY TEACHES AT POMPEU FABRU” becomes “Louise McNally teaches at Pompeu Fabru”
    • Useful for speech output or some old teletype feeds
    • Vlad-Lita et al. 2003. tRuEasIng. ACL ’03.
  • Punctuation Restoration
    • Channel: Punctuation insertion costs nothing; others infinite
    • Also useful for speech output
  • Chinese Tokenization (Bill Teahan)
    • Source: Train on space-separated tokens
    • Channel: Spaces insert free; others infinite
    • Teahan et al. 2000. A compression-based algorithm for Chinese word segmentation CL Journal.
decoding l33t speak
Decoding L33T-speak
  • “L33T” is l33t-speak for “elite”
  • Used by gamers (pwn4g3) and spammers (med|catiOn)
  • Substitute numbers (e.g. ‘E’ to ‘3’, ‘A’ to ‘4’, ‘O’ to ‘0’, ‘I’ to ‘1’)
  • Substitute punctuation (e.g. ‘/\’ for ‘A’, ‘|’ for ‘L’, ‘\/\/’ for ‘W’)
  • Some standard typos (e.g. ‘p’ for ‘o’)
  • De-duplicate or duplicate characters freely
  • Delete characters relatively freely
  • Insert/delete space or punctuation freely
  • Get creative
  • Examples from my Spam from this week:
  • VàLIUM CíAL1SS ViÁGRRA; MACR0MEDIA, M1CR0S0FT, SYMANNTEC $20 EACH; univers.ty de-gree online; HOt penny pick fue|ed by high demand; Fwd; cials-tabs, 24 hour sale online; HOw 1s yOur health; Your C A R D D E B T can be wipe clean; Savvy players wOuld be wise tO l0ad up early; Im fed up of my Pa|n med|catiOn pr0b|em; Y0ur wIfe needs tO cOpe with the PaIn; End your gIrlfr1end's Med!ca| prOcedures n0w; C,E*L.E*B,R.E'X    2oo  m'gg
  • Piece of cake to correct (pwn4g3 = “ownage”, a popular taunt if you win)
  • More info: