1 / 19

Swapnil Chhajer schhajer@usc.edu http://schhajer.co.nr

Spelling Correction for Search Engine Queries B runo Martins and Mario J. Silva Proceedings of EsTAL-04, España for Natural Language Processing (2004). Swapnil Chhajer schhajer@usc.edu http://schhajer.co.nr. Topics Covered in Class.

palti
Download Presentation

Swapnil Chhajer schhajer@usc.edu http://schhajer.co.nr

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spelling Correction for Search Engine Queries Bruno Martins and Mario J. SilvaProceedings of EsTAL-04, España for Natural Language Processing (2004) Swapnil Chhajer schhajer@usc.edu http://schhajer.co.nr

  2. Topics Covered in Class • PeterNorvig’s Spelling Corrector: Query Processing [33-35] • LevenshteinAlgortihm: Query Processing [36-41] • Evaluation Metrices: Precision & Recall: Introduction to Information Retrieval [16] • Soundex Algorithm: Query Processing [18] Spelling Correction for Search Engine Queries

  3. Motivation & Abstract • Misspelled queries retrieve pages with misspelled words which leaves behind the most appropriate pages. • 10-12% of queries are misspelled. • To provide user with the best possible match instead of making user choose one of the possible corrections from the correction list. Spelling Correction for Search Engine Queries

  4. Google: Spelling Correction Spelling Correction for Search Engine Queries

  5. Spelling Correction • Uses • Correcting documents being indexed • Retrieve matching documents when query contains spellingerror Flavors: • Isolated words • Check words on its own • Unable to catch correctly spelled typos from vs.form • Context-sensitive • Look at surrounding words, e.g., I flew form Heathrow to Narita. “a paragraph cud half mini flaws but wood bee past by the isolated spill checker” Spelling Correction for Search Engine Queries

  6. General issues in Spelling Correction • UI • Did you mean works for one suggestion. • What about multiple possible corrections ? • Computational Cost • Spelling Correction is potentially expensive • Avoid running on each query • Maybe just on query that matches few documents • Guess: Spelling Correction of major search engines is efficient enough to be run on every query Spelling Correction for Search Engine Queries

  7. Kinds of Spelling Mistakes: Typos • Wrong characters by mistake • Categorized mainly into 4 categories: • Insertions (Missing Letter) • “appellate” as “appellare”, “prejudice” as “prejudsice” • Deletions (Extra Letter) • “plaintiff” as “paintiff”, “judgment” as “judment”, “liability” as “liabilty”, “discovery” as “dicovery”, “fourth amendment” as “fourthamendment” • Substitutions (Wrong letter) • “habeas” as “haceas” • Transpositions • “fraud” as “fruad”, “bankruptcy” as “banrkuptcy, “subpoena” as “subpeona”, “plaintiff” as “plaitniff” • 80-95% differ from the correct spellings in just one of the four ways. • Keyboard layout is important in such cases. Spelling Correction for Search Engine Queries

  8. Kinds of Spelling Mistakes: Brainos • Wrong characters on purpose • Most common type of mistake in general web queries • Mistakes derived from either pronunciation or spelling or semantic confusions • Brainos: Soundalike (Phonetic Errors) • “subpoena” as “supena”,“voir” as “voire”, “latter” as “ladder”, “withholding” as “witholding”, “foreclosure” as “forclosure” • Brainos: Confusions • “preclusion” as “perclusion”, “men” as “mans”, “juries” as “jurys” or “jureys”, “dramshop” as “dram shop” Spelling Correction for Search Engine Queries

  9. Dictionary Storage: Ternary Search Trees(TST) • Data structure: Ternary Search Tree(TST) • Type of a TRIE, limited to 3 children per node. • TRIE is the common definition for a tree storing strings, in which there is one node for every common prefix and the strings are stored in extra leaf nodes. • Searching: O(log(n)+k) • n: number of strings in tree • k: length of string being searched for Spelling Correction for Search Engine Queries

  10. TST Continued… Figure: A ternary search tree storing the words “to”, “too”, “toot”, “tab” and “so”, all within an associated frequency of 1 Spelling Correction for Search Engine Queries

  11. Spelling Correction Algorithm • Implemented using edit distance, rule-based techniques, n-grams probabilistic techniques, neural nets, similarity key techniques, or combinations. • Goal: To find edit distance based on different strategies. • Shorter distance implies Good Correction. • Soundex System: • Indexing based on sound. • Devised to help with the problem of phonetic errors. • Metaphone Systems: • Specific to English language • Transforming words into codes based on phonetic properties • Based on consonants & diphthongs • Spelling correction for web • Complete waste to make context dependent correction as user hardly type more than three terms for a query Spelling Correction for Search Engine Queries

  12. User entered query is tokenized ignoring non-word characters. Convert all words into lower case, and check whether the word is correctly spelled. Update the frequencies for correctly spelled words. This basically acts as a feedback to the system. Feedback system can be helpful for Spell Checker in predicting patterns in user’s searches. Misspelled words are replaced by correctly spelled words. Finally, a new query is presented to the user as a suggestion, together with the results page for the original query. Spelling Correction Algorithm Continued… Spelling Correction for Search Engine Queries

  13. Spelling Correction Algorithm Continued… • Algorithm is divided into 2 phases: • Phase 1: Generation of a set of candidate suggestions • Phase 2: Select the best choice among those selections • Phase 1 • 9 Steps, at each step look up dictionary for words that relate to the original misspelling. • Differ in one character from the original word. • Differ in two character from the original word. • Differ in one letter removed or added. • Differ in one letter removed or added, plus one letter different. • Differ in repeated characters removed. • Correspond to 2 concatenated words (space between words eliminated). • Differ in having two consecutive letters exchanged & 1 character different • Have the original word as a prefix. • Differ in repeated characters removed & 1 character different. Spelling Correction for Search Engine Queries

  14. Spelling Correction Algorithm Continued… • Phase 2: Heuristics used • Return the one if it only differs in accented characters • Return if it only differs in one character, with the error corresponding to an adjacent letter in the same row of the keyboard. • Return the smallest one, if there are solutions having same metaphone key as the original string. • Return if it only differs in one character, with the error corresponding to an adjacent letter in an adjacent row of the keyboard. • In last, return the last word. • Heuristics are followed sequentially and only move to the next if no matching words are found. • If there are more than one matching words, return the one with first character matched. • If still, there are more than one, choose the word with highest frequency. Spelling Correction for Search Engine Queries

  15. Results Comparison • Aspell Spell Checker • http://aspell.sourceforge.net/ • Aspell uses Metaphone algorithm with near miss strategy • 48.33% correct forms were correctly guessed. • Outperformed Aspell by 1.66% * Doesn’t detect the misspelling - Failed in returning a suggestion. Spelling Correction for Search Engine Queries

  16. Results Comparison Continued… • Tumba! : Search engine for Portuguese web Table: Results from spelling checker with Tumba! Spelling Correction for Search Engine Queries

  17. Conclusion & Future Work • Spelling checker uses a ternary search tree data structure for storing the dictionary. • For data source, referred two popular Portuguese newspapers. • Queries in search engine may contain company or person’s name. In such cases, keeping two dictionaries, one in the TST used for correction and another in an hash-table used only for checking valid words, could yield good results. Spelling Correction for Search Engine Queries

  18. Pros & Cons • Pros • Considered various factors affecting edit distance including probabilistic estimations. • Used feedback system to improve the quality of user queried results. • Cons • Did not consider Context Sensitive spell checking. • It is not language independent system. Mainly focused on Portuguese words. • No discussion about spell corrected completion suggestions as a query is incrementally entered. Spelling Correction for Search Engine Queries

  19. References • Contemporary Spelling Correction - Decoding the noisy channel, Bob Carpenter • Using the Web for Language Independent Spellchecking and Autocorrection, Whitelaw, Hutchinson, Chung and Ellis • How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic Analysis through Complex Network Approach, Choudhury, Thomas, Mukherjee, Basu and Ganguly Spelling Correction for Search Engine Queries

More Related