Contemporary Spelling Correction Decoding the Noisy Channel

ContemporarySpelling CorrectionDecoding the Noisy Channel Bob Carpenter Alias I, Inc. carp@alias-i.com

Kinds of Spelling Mistakes: Typos • Typos are wrong characters by mistake • Insertions • “appellate” as “appellare”, “prejudice” as “prejudsice” • Deletions • “plaintiff” as “paintiff”, “judgement” as “judment”, “liability” as “liabilty”, “discovery” as “dicovery”, “fourth amendment” as “fourthamendment” • Substitutions • “habeas” as “haceas” • Transpositions • “fraud” as “fruad”, “bankruptcy” as “banrkuptcy • “subpoena” as “subpeona” • “plaintiff” as “plaitniff”

Kinds of Spelling Mistakes: Brainos • Brainos are wrong characters “on purpose” • The kinds of mistakes found in lists of “common” misspellings • Very common in general web queries • Derive from either pronunciation or spelling or deep semantic confusions • English is particularly bad due to irregularity • Probably (?) common in other languages importing words

Brainos: Soundalikes • Latinates • “subpoena” as “supena”, “judicata” as “judicada”, “voir” as “voire” • Consonant Clusters & Flaps • “privelege” as “priveledge”, “rescision” as “recision”, “collateral” as “colaterall”, “latter” as “ladder”, “estoppel” as “estopple”, “withholding” as “witholding”, “recission” as “recision” • Vowel Reductions • “collateral” as “collaterel”, “punitive” as “punative” • Vowel Clusters • “respondeat” as “respondiat”, “lien” as “lein”; “estoppel” as “estopple”, “habeas” as “habeeas”, “conveniens” as “convieniens • Marker Vowels • “foreclosure” as “forclosure” • Multiples • “subpoena” as “supena” (two deletes)

Brainos: Confusions • Substitute more common or just plain different • Names: “Opperman” as “Oppenheimer”; “Eisenstein” as “Einstein” • Pronunciation Confusions • Transpositions; “preclusion” as “perclusion”, “meruit” as “meriut” • Irregular word forms • “juries” as “jurys” or “jureys”; “men” as “mans” • English is particularly bad for this, too • Tokenization issues • “AT&T” vs. “AT & T” vs. “A.T.&T.”, … • Correct variant (if unique) depends on search engine’s notion of “word” • Word Boundaries • “in camera” as “incamera”, “qui tam” as “quitam”, “injunction” as “in junction”, “foreclosure” as “for closure”, “dramshop” as “dram shop”

“Old School” Spelling Correction • Damerau, 1964, “A technique for computer detection and correction of spelling errors”. Comms. ACM. • One word (token) at a time • Only looked at unknown words not in dictionary • Suggest “closest” alternatives (first or multiple in order) • Closeness measured in number of edits (edit distance) • Deletions, Insertions, Substitutions, and sometimes Transpositions • Often results in ties • Good word game • With 50 characters and a 50-word query, get 5050 = 1084 alternatives • Can search whole space in linear time using dynamic programming • This technique lives on in many apps • Simple, fast and only requires a word list

Edit Distance (Damerau/Levenstein) • Quadratic time; linear space algorithm • Eg. D(“John”, “Jan”) = 2 [D(“John”, “Bob”) = 3] • Edits match ‘J’, subst ‘a’ for ‘o’; delete ‘h’, match ‘n’) score(I,J) = = Min (score(I-1,J-1) + match(I,J), score(I-1,J) + delete(J), score(I,J-1) + insert(I) )

“Middle Aged” Spelling Correction • Still look at single words not in a dictionary and list of common misspellings • Model Likely Edits • Whole words • “acceptable” as “acceptible”; “truant” as “truent”, etc. • Sound Sequences • “ie”   “ei”; “mm”   “m” • Typos • Closeness on keyboard (depends on your keyboard – mixtures) • “q” as “w”; “y” as “u” (substitutions) • “q” as “qw” or “wq” (insertions) • Position in Word • Edits more likely internally, next at end, least in front • Psychology of reading left-to-right & early resolution • “plantiff” (mid) > “plaintff” (end) > “laintiff” (front)

“Contemporary” Spelling Correction • Find most likely intended query given observed query • Integrated Probabilistic Model • Model of Query Likelihood (source): P(query) • Model of Edit Likelihood (channel): P(realization|query) • Shannon’s “Noisy Channel Model” (1940s): • Find most likely query (Q) given realization (R) ARGMAXQ P(Q | R) [Problem] = ARGMAXQ P(R | Q) * P(Q) / P(R) [Defn. Conditional] = ARGMAXQ P(R | Q) * P(Q) [R constant]

Simple Example of Correction • Query Likelihood Model • P(“hte”) = 1/1,000,000 • P(“the”) = 1/20 • Edit Likelihood Model • P(“hte” | “the”) = P(transpose(“th”)) * P(match(“e”)) = 1/500 * 99/100 = 99/50000 ~ 1/500 • P( “hte” | “hte”) = P(match(“h”)) * P(match(“t”)) * P(match(“e”)) ~ 1/1 • Therefore: • P(“hte” | “the” ) * P(“the”) = 1/500 * 1/20 = 1/10,000 >> P(“hte” | “hte” ) * P(“hte”) = 1/1 * 1/1,000,000

General Approach Solves Several Problems • Orders alternatives based on likelihood • First best or ranked n-best alternatives • N-best is a tricky user-interface issue for web search • Measures likelihood that query is in error • Allows tuning of rejection thresholds for precision/recall • Measures likelihood that correction is correct • As posterior probability in the Bayesian model • Principled balance of query vs. edit likelihoods • Empirical issue determined by measurable user behavior • E.g. Word processors and web search very different • Suggests Valid Word Substitutions in Phrases • “pro bono” as “per bono” • “Peter principle” as “Peter principal” • Google e.g. “fodr”  “ford” but “fodr baggins”  “frodo baggins”

Alias-i’s Approach • Models fully retrainable per application • Out-of-the-box solutions not feasible • Tailored query and edit models based on user application behavior • Scalable to gigabytes w/o pruning and to arbitrary amounts of data with selective pruning • Character-level model for queries: P(query) • Generalizes to subphrases of unknown tokens • E.g. “likelihoods” flagged as error by PowerPoint • E.g. “likelihood” not flagged as error by PowerPoint • Or Token-sensitive output (only output known words in corpus) • Allows efficient search based on prefixes • Flexible framework for edit likelihoods: P(realization|query) • Models likely substitutions in domain

Source Language Models • Character n-grams: P(c0,…,cn-1) = PROD i<n P(ci | c0,…,ci-1 ) [chain rule] ~ PROD i<n P’(ci | ci-n+1,…,ci-1 ) [n-gram] • Generalized Witten-Bell smoothing (~ state of the art): P’(d| c C) = lambda(c C) * PML(d | c C) + (1 – lambda(c C)) * P’(d | C) • where d,c are characters, and C a sequence of characters, • PML is the maximum likelihood estimator, • the recursion grounds on the uniform estimate, and • lambda(X) = count(X) / (count(X) + K * outcomes(X)) [in [0,1]]

Training Data for Query Model: P(Query) • Trained independently of edit model • Captures domain-specific features more than edits • Appropriate Text Corpus matches problem • Overall stats: “trt”  “tart” or “tort” (depends on domain) • Phrasal Context: “linzer trt” vs. “trt reform” • Implicitly models number of possible “hits” for query • Can train per field for complex queries • E.g. author, institution, MeSH term, abstract in MEDLINE • Can retrain query models as new data arrives • Training data must match use data • e.g. all caps, mixed case, etc. • May normalize queries plus training data

Training Data for Edit Model: P(realization|query) 1. No training data • A priori typos: • Characters near each other on keyboard are likely typos • More careful typing near beginning and end of word • A priori brainos: • Vowel sequences confusable with vowel sequences • Consonants that sound alike easily confused (‘t’ vs. ‘d’, etc.) • Consonants likely doubled or undoubled in error • More common in unstressed syllables (approximately later) 2. Bootstrap raw query logs • Can do this step with simpler model, such as ispell • Better with the first approximation model above (like EM) • Estimates rate of various errors and likely substitutions

Training Data for Edit Model: P(realization|query) (cont.) • Sample of Correct/Error Classified Queries • Better estimate of error edit rates (not specific errors) • Estimate likely insert/delete/substitute/transpose errors • Requires unbiased sample of errors and correct queries • Search engines report 10-15% of queries have errors!!! • Need ~100 examples of each type of error type on average • Requires unbiased sample of errors (correct not necessary) • Need about 100 examples average per character, or about 5K examples total assuming 50 editable characters • We can find these using “active learning” or bootstrapping • Requires best guess of correction using simpler method

Training Data for Edit Model: P(realization|query) (cont.) • 4. Fully Supervised Learning • Same samples as in (3) above • Editor(s) provides correction for errors • Only a few days work with a halfway decent interface • Should use two editors on same sample to cross-validate • Multiple editors also provide a bound on human performance • Almost always significantly better than bootstrap methods

Evaluating Accuracy: Correcting the Right Queries • Need the labeled training data! • Are we correcting the right queries? • Confusion Matrix • True Positive: Error that is corrected • True Negative: Good query that is not corrected • False Positive: Good query that is corrected • False Negative: Error that is not corrected • Performance Metrics • Precision = TP / (TP + FP) % of corrections that were errors • Sensitivity = TN / (TN + FP) % of rejections that were not errors • Recall = TP / (TP + FN) % of errors that are corrected • Accuracy = (TP + TN) / (TP + TN + FP + FN) % of queries for which we do the right thing • Can balance false alarms and missed corrections

Evaluating Accuracy: Returning the Proper Correction • Correction Accuracy • % of corrections that were properly corrected • Combine with precision on the to-correct decision • Overall Accuracy • % of queries that are TN or TP with right correction

Evaluating Accuracy: MSN Case Study Cucerzan and Brill. 2004. Spelling Correction as an iterative process that explits the collective knowledge of web users. Proc. ACL. • 10-15% estimate of queries with errors • Training by Bootstrapping Query Logs (method 2) • Scoring one human against another: 90% • System accuracy against averaged humans: 82% • System accuracy on valid queries: 85% • System accuracy on queries with errors: 67% • System accuracy with baseline edit model • 80% total; 83% valid queries; 66% queries with errors • 8% lower estimates for auto-eval over sequential logs • 5% higher estimate for “reasonable” vs. exact correction • Good News • Web search is as hard as it gets – multi-topic and multi-lingual

Evaluating Efficiency • May trade accuracy for efficiency along received operating curve • Smaller model size by token or characters • Smaller search space • Higher rejection threshold increases efficiency, reduces recall, and increases precision • Standalone Server Deployment • Allows larger shared models in memory • Simple timeout robustness from web server • Models require CRSW synchronization • Any number of concurrent queries share same model w/o blocking • No queries can run while model is changing • Correction may be done in parallel to search (not pure latency) • Do not need to evaluate number of queries returned, • though this may be combined post-hoc with results for tighter rejection • Should easily scale to requirements • 1 million queries in 8 hours on a single multiprocessor server • That’s 25-50 queries/second • LMs run at 2 million characters/second on desktop

But wait, that’s not all for LingPipe 2.0 • Character and Token-level Language Models • Ranked Terminology Discovery • collocations within corpus (chi square independence test) • “what’s new” across corpora (binomial t-test) • Binary & Multiway Classification • Bayesian framework; language model implementations • Extensive probabilistic confusion matrix scoring • E.g. Topic (e.g. which newsgroup, which section of paper) • E.g. Sentiment (eg. Positive or negative product review) • E.g. Language (critical for multi-lingual applications) • E.g. De-duplication of message streams • E.g. Spam detection • Hierarchical Clustering • General framework; Language model implementations • E.g. Self-organizing web results • Chunking (high throughput Bayesian model) • E.g. Named entities, noun phrases and verb phrases • Implementations of standard evaluations and corpora

Design Standards • Extensive use of standard patterns • E.g. corpus visitors, abstract adapters, factories for runtime pluggable implementations • Mostly immutable & final (efficiency, state stability & testability) • Modules all support CRSW synchronization • Highly Modular Interfaces • Allows implementation plug and play • Most interfaces have abstract adapters • E.g. SpellChecker interface, AbstractSpellChecker adapter with abstract edit model, and ConstantSpellChecker and ProbabilisticSpellChecker implementations • Simple or Complex Tuning Parameterizations • Reasonable Defaults • M.S./Ph.D.-level tuning options (popular for theses) • Follows Sun’s coding standards

Engineering & Support Standards • Active and Responsive User Group Forum • Tutorial examples of all modules • Most include industry-standard evaluations • Thorough Unit Testing (JUnit) • More good examples of API usage • Windows XP & Linux for Java 1.4.2 and 1.5.0 • Profile-based tuning (JProfiler) • Speed, Memory and Disk access • Full javadoc of public/protected API • Classes are shy about their privates as a rule • Types are as specific as possible (many adapters) • Integration at command-line, XML or API levels

Other Applications • Case Restoration () • Source: Train on mixed case data • Channel: Case switching costs nothing; others infinite • E.g. “LOUISE MCNALLY TEACHES AT POMPEU FABRU” becomes “Louise McNally teaches at Pompeu Fabru” • Useful for speech output or some old teletype feeds • Vlad-Lita et al. 2003. tRuEasIng. ACL ’03. • Punctuation Restoration • Channel: Punctuation insertion costs nothing; others infinite • Also useful for speech output • Chinese Tokenization (Bill Teahan) • Source: Train on space-separated tokens • Channel: Spaces insert free; others infinite • Teahan et al. 2000. A compression-based algorithm for Chinese word segmentation CL Journal.

Decoding L33T-speak • “L33T” is l33t-speak for “elite” • Used by gamers (pwn4g3) and spammers (med|catiOn) • Substitute numbers (e.g. ‘E’ to ‘3’, ‘A’ to ‘4’, ‘O’ to ‘0’, ‘I’ to ‘1’) • Substitute punctuation (e.g. ‘/\’ for ‘A’, ‘|’ for ‘L’, ‘\/\/’ for ‘W’) • Some standard typos (e.g. ‘p’ for ‘o’) • De-duplicate or duplicate characters freely • Delete characters relatively freely • Insert/delete space or punctuation freely • Get creative • Examples from my Spam from this week: • VàLIUM CíAL1SS ViÁGRRA; MACR0MEDIA, M1CR0S0FT, SYMANNTEC $20 EACH; univers.ty de-gree online; HOt penny pick fue|ed by high demand; Fwd; cials-tabs, 24 hour sale online; HOw 1s yOur health; Your C A R D D E B T can be wipe clean; Savvy players wOuld be wise tO l0ad up early; Im fed up of my Pa|n med|catiOn pr0b|em; Y0ur wIfe needs tO cOpe with the PaIn; End your gIrlfr1end's Med!ca| prOcedures n0w; C,E*L.E*B,R.E'X 2oo m'gg • Piece of cake to correct (pwn4g3 = “ownage”, a popular taunt if you win) • More info: http://en.wikipedia.org/wiki/Leet

Contemporary Spelling Correction Decoding the Noisy Channel

Contemporary Spelling Correction Decoding the Noisy Channel

Presentation Transcript

Channel-Adapted Quantum Error Correction

Noisy Text Correction – an exercise in futility?

Finite-State and the Noisy Channel

Advanced Spelling and Grammar Correction Method

Decoding and Spelling Big Words

Spelling Correction and the Noisy Channel

Uncertain input and noisy-channel sentence comprehension

Online Spelling Correction for Query Completion

Decoding and Spelling Big Words

Spelling Correction and the Noisy Channel

TGah Channel Model Document Correction

Error Correction and LDPC decoding

Spelling correction

Iterative Source- and Channel Decoding

Error Correction and LDPC decoding

A BAYESIAN APPROACH TO SPELLING CORRECTION

Spelling correction

Contemporary Spelling Correction Decoding the Noisy Channel

Finite-State and the Noisy Channel

Iterative Source- and Channel Decoding