1 / 47

URL : cslu.ogi/~sproatr/Courses/TextNorm /

URL : http:// www.cslu.ogi.edu/~sproatr/Courses/TextNorm /. CS506/606: Text Normalization Richard Sproat , Steven Bedrick TA: Emily Tucker- Prud’hommeaux Fall 2011 Introduction. Course Outline. This course will consist of a combination of a (few) lectures,

goldy
Download Presentation

URL : cslu.ogi/~sproatr/Courses/TextNorm /

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. URL: http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/ CS506/606: Text NormalizationRichard Sproat, Steven BedrickTA: Emily Tucker-Prud’hommeauxFall 2011Introduction

  2. Text Normalization Course Outline • This course will consist of a combination of • a (few) lectures, • discussion of papers from the literature, • a lab component where the class as a team will build a set of modules for text normalization using the Thrax open-source finite-state grammar toolkit. • For most classes, there will be a combination of reading discussion, and discussion of progress on the project.

  3. Text Normalization Text Normalization • Conversion of text that includes ‘non-standard’ words like numbers, abbreviations, misspellings . . . into normal words. • Abbreviation expansion (including novel abbreviations) • Expansion of numbers into ‘number names’ • Correction of misspellings • Disambiguation in cases where there is ambiguity

  4. Text Normalization Where is normalization needed? • Very little in cases like this: Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversation?’ So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.

  5. Text Normalization Where is normalization needed? • A lot in cases like this:

  6. Text Normalization Humans are pretty good at this: can you read this? fucn rd thsthnurdngbtrthnny autmtc txt nrmlztionprgrmcn do.

  7. Text Normalization How about this? Aoccdrnig to a rscheearch at CmabrigdeUinervtisy, it deosn’tmttaer in what oredr the ltteers in a wrod are, the olnyiprmoetnttihng is taht the frist and lsatltteer be at the rghitpclae. The rset can be a total mses and you can sitllraed it wouthitporbelm. Tihs is bcuseae the huamnmniddeos not raederveylteter by istlef, but the wrod as a wlohe.

  8. Text Normalization Or this? Goccdrnia to a hscheearcr at EmabrigdcYinervtisu, it teosn’drttaem in tahw rredo the stteerl in a drow are, the ylnotprmoetnigihnt is taht the trisf and tsal rtteel be at the tghireclap. The tser can be a lotatssem and you can litlsdaer it touthiwmorbelp. Siht is ecuseab the nuamhdnimseod not daeryrveertetel by fstlei, but the drow as a elohw.

  9. Text Normalization Two components of text normalization • Given a string of characters in a text, what is the (reasonable) set of possible actual words (or word sequences) that might correspond to it. • Which of those is right for the particular context?

  10. Text Normalization An illustration He has goats Lotus for Windows 123 I live at King Avenue.

  11. Text Normalization Two components of text normalization • A component that gives you the set of possibilities: • 123 = one hundred (and) twenty three • 123 = one twenty three • 123 = one two three • A component that tells you which one(s) are appropriate to a particular context.

  12. Text Normalization A concrete example of finite-state methods in textnormalization: digit to number name translation • Factor digit string: • 123 → 1 · 102 + 2 · 101 + 3 • Translate factors into number names: • 102 → hundred • 2 · 101 → twenty • 1 · 101 + 3 → thirteen • Languages vary on how extensive these lexicons are. Some (e.g. Chinese) have very regular (hence very simple) number name systems; others (e.g. Urdu/Hindi) have a large set of number names with a name for almost every number from 1 to 100. • Each of these steps can be accomplished with FSTs

  13. Urdu (Hindi) Number Names Text Normalization

  14. Text Normalization Digit string factoring transducer (fragment)

  15. Text Normalization Germanic “decade flop” zwanzig vier 24 und

  16. Text Normalization 70’s

  17. Text Normalization Digit-string to number name translation: German • Factor digit string: • 123 → 1 · 102 + 2 · 101 + 3 • Flip decades and units: 2 · 101 + 3 → 3 + 2 · 101 • Translate factors into number names: • 102 → hundert • 2 · 101 → zwanzig • 1 · 101 + 3 → dreizehn

  18. Text Normalization German number grammar (fragment)

  19. Text Normalization Concrete example from English Consider a machine that maps between digit strings and their reading as number names in English. 30,294,005,179,018,903.56 → thirty quadrillion, two hundred and ninety four trillion, five billion, one hundred seventy nine million, eighteen thousand, nine hundred three, point five six

  20. Text Normalization 566 states and 1492 arcs

  21. Text Normalization

  22. Text Normalization

  23. Text Normalization

  24. Text Normalization NSW Classification

  25. Text Normalization

  26. Text Normalization

  27. Text Normalization

  28. Text Normalization

  29. Text Normalization

  30. Text Normalization

  31. Text Normalization

  32. Text Normalization

  33. Text Normalization

  34. Text Normalization

  35. Text Normalization

  36. Text Normalization

  37. Text Normalization

  38. Text Normalization

  39. Text Normalization

  40. Text Normalization

  41. Text Normalization

  42. Text Normalization

  43. Text Normalization

  44. Text Normalization

  45. Text Normalization

  46. Text Normalization Introduction to Thrax • The OpenGrmThrax tools compile grammars expressed as regular expressions and context-dependent rewrite rules into weighted finite-state transducers. It makes use of functionality in the OpenFst library to create, access and manipulate n-gram models. It is named after Dionysius Thrax (ΔιονύσιοςὁΘρᾷξ) (170 BC – 90 BC), the reputed first Greek grammarian. • http://www.openfst.org/twiki/bin/view/GRM/Thrax

  47. Text Normalization Reading Assignment • Richard Sproat, Alan Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. "Normalization of non-standard words." Computer Speech and Language, 15(3), 287-333, 2001.

More Related