1 / 21

Problem 1: Word Segmentation

Problem 1: Word Segmentation. whatdoesthisreferto. what does this refer to. Application: Chinese Text. Application: Internet Domain Names. www. visitbritain .com. Visit Britain. Statistical Machine Learning. Best segmentation = one with highest probability

taima
Download Presentation

Problem 1: Word Segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Problem 1: Word Segmentation whatdoesthisreferto what doesthis refer to

  2. Application: Chinese Text

  3. Application: Internet Domain Names www.visitbritain.com Visit Britain

  4. Statistical Machine Learning • Best segmentation= one with highest probability • Probability of a segmentation= P(first word) × P(rest of segmentation) • P(word)= estimated by counting

  5. Statistical Machine Learning choosespain Choose Spain Chooses pain P(“Choose Spain”) > P(“Chooses Pain”)

  6. Example • segment(“nowisthetime…”) • Pf(“n”) × Pr(“owisthetime…”) • Pf(“no”) × Pr(“wisthetime…”) • Pf(“now”) × Pr(“isthetime…”) • Pf(“nowi”) × Pr(“sthetime…”) • ……

  7. Example • segment(“nowisthetime…”)

  8. The Complete Program

  9. Performance • Accuracy = 98% • Trained on 1.7B words (English) • Typical errors: • baseratesoughtto • base rate sought to • smallandinsignificant • small and in significant • ginormousego • g in or mouse go

  10. Some Results • whorepresents.com[“who”, “represents”] • therapistfinder.com[“therapist”, “finder”] • expertsexchange.com[“experts”, “exchange”] • speedofart.net[“speed”, “of”, “art”] • penisland.comerror: expected [“pen”, “island”]

  11. Problem 2: Spelling Correction • Mehran Salami • Typical word processor:  Tehran Salami • But Google can …

  12. Statistical Machine Learning • Best correction=one with highest probability • Probability of a spelling correction c=P(c as a word) ×P(original is a typo for c) • P(c as a word)= estimated by counting • P(original is a typo for c)= proportional to number of changes

  13. The Complete Program

  14. Problem 3: Speech Recognition • An informal, incomplete grammar of the English language runs over 1,700 pages. • Invariably, simple models and a lot of data trump more elaborate models based on less data.

  15. Problem 3: Speech Recognition • If you have a lot of data, memorisation is a good policy. • For many tasks such as speech recognition, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need, without general rules.

  16. Problem 3: Speech Recognition

  17. Problem 3: Speech Recognition

  18. Problem 3: Speech Recognition “Every time I fire a linguist, the performance of our speech recognition system goes up.” --- Fred Jelinek

  19. Problem 4: Machine Translation

  20. Conclusion (Statistical) [Machine] Learning Is The Ultimate Agile Development Tool Peter Norvig (Director of Research, Google)

More Related