1 / 22

Language Processing in the Web Era

Language Processing in the Web Era. Kuansan Wang ISRC Microsoft Research, Redmond WA. Banko and Brill (HLT’01): Mitigating the Paucity-of-Data Problems. CIKM’08. There is no data like more data. That can be correctly exploited. NLP = Data + Model. Data, size does matter, but…

luka
Download Presentation

Language Processing in the Web Era

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Processing in the Web Era Kuansan Wang ISRC Microsoft Research, Redmond WA

  2. Banko and Brill (HLT’01): Mitigating the Paucity-of-Data Problems

  3. CIKM’08

  4. There is no data like more data That can be correctly exploited

  5. NLP = Data + Model • Data, size does matter, but… • Language styles • Global, multi-lingual • Dynamic • Model • Simple, but not overly simplistic • The less human involvements, the better • Machine doesn’t have to work the same way as human • For many tasks machine has outperformed human

  6. URL Anchor Text HTML Title Heading Search Queries google earning earnings GOOG gooogle quarterly report … Caption Body

  7. Tackling the gap between query and document languages • Machine Translation • Miller et al. (SIGIR-99): latent query generation • Berger and Lafferty (SIGIR-99): explicit query model • Jin et al. (SIGIR-02): title/body as parallel text • Smoothing • Lafferty and Zhai (SIGIR-01): divergence model • Zhai and Lafferty (SIGIR-02): two-stage smoothing • Questions • Quantitatively, how big the problem really is? • Computationally, what are insights for solutions?

  8. Microsoft Web N-gram Service • Cloud-based Web Service • Web documents/search queries received by Bing (EN-US market) • Live with June-09, April-10 snapshots • Training/adaptation tokens: ~1.2T per snapshot • CALM (ICASSP-2009) • http://web-ngram.research.microsoft.com

  9. Cross-Language Perplexities on Query June-09 Snapshot

  10. Query bigram PPL and OOV rate

  11. Dynamics of the Web: N-gram Counts

  12. Rapid Pace of the Web • Top unigrams for Web docs change a lot • Ref. Top 100K words from MS Web N-gram • Search query changes more quickly • Real-time media even more so • Twitter, Facebook updates • Web is not a “dead” corpus • Adaptation capability critical for Web NLP

  13. MAP Decision Approach • Channel Coding; Bayesian Minimum Risk… • Speech, MT, Parsing, Tagging, Information Retrieval • Sopt = arg max P(S|O) = arg max P(O|S)P(S) • P(O|S): transformation model • P(S): prior Distortion Channel Signal (S) Output (O)

  14. Plug-in MAP Problem • MAP decision is optimal only if P(S) and P(O|S) are “real” distributions • Adjustments needed when the probabilistic models include estimation errors or mismatch • Simple logarithmic interpolation: • Sopt = arg max [log P(O|S) + αlog P(S)] • “Random Field”/Machine Learning: • Sopt = arg max log P(S|O) = arg max Σαi* log P(fi|O)

  15. Challenging Problems • Generalizability • Robustness • Adaptability • Implementation efficiency • Cost • When do we need complex models?

  16. Case Study: Word Breaker • O: tweeter hash tag or URL domain name (e.g. “247moms”, “w84um8”) • S: what user meant to say (e.g. “24_7_moms”, “w8_4_u_m8” = “wait for you mate”) Transformation Channel Signal (S) Output (O)

  17. Word Breaking Challenge • Norvig (CIKM 2008): Large Data + Simple Model • Unigram model • Good enough, but sometimes yields embarrassing outcomes • Simple extension to trigram:

  18. Additional Challenges for WebApplications • Demo: bing.com/?q=word+breaker+web+era

  19. Prior Arts • Simple heuristics • BI: Binomial Model (Venkataraman, 2001) • log P(O|S) = n * log P# + (|S|– n – 1) * log (1-P#) • GM: Geometric Mean (Keohn and Kline, 2003) • Widely used, especially in MT systems • WL: Word Length Normalization (Kaitan et al, 2009) • log P(O|S) = Σ log P(|wi|) • ME: Maximum Entropy Principle (WWW’11) • P# = 0.5, α = 1.0, P(S) using MS Web N-gram • Bayesian • Modular Linguistic Model (Brent, 1999) • Dirichlet Process (Goldwater et al, 2006)

  20. Title Query Anchor Note: BI, WL are cheating experiments!

  21. Best = Right Data + Smart Model • Style of language trumps size of data • Right data alleviates Plug-in MAP problem • Complicated machine learning artillery not required; simple methods suffice • Performance scales with model power, as mathematically predicted • Smart model gives us: • Rudimentary multi-lingual capability • Fast inclusion of new words/phrases • Alleviate needs of human labor

  22. http://www.spellerchallenge.com From Word Breaking to Spelling Correction

More Related