Language Processing in the Web Era

Language Processing in the Web Era Kuansan Wang ISRC Microsoft Research, Redmond WA

Banko and Brill (HLT’01): Mitigating the Paucity-of-Data Problems

CIKM’08

There is no data like more data That can be correctly exploited

NLP = Data + Model • Data, size does matter, but… • Language styles • Global, multi-lingual • Dynamic • Model • Simple, but not overly simplistic • The less human involvements, the better • Machine doesn’t have to work the same way as human • For many tasks machine has outperformed human

URL Anchor Text HTML Title Heading Search Queries google earning earnings GOOG gooogle quarterly report … Caption Body

Tackling the gap between query and document languages • Machine Translation • Miller et al. (SIGIR-99): latent query generation • Berger and Lafferty (SIGIR-99): explicit query model • Jin et al. (SIGIR-02): title/body as parallel text • Smoothing • Lafferty and Zhai (SIGIR-01): divergence model • Zhai and Lafferty (SIGIR-02): two-stage smoothing • Questions • Quantitatively, how big the problem really is? • Computationally, what are insights for solutions?

Microsoft Web N-gram Service • Cloud-based Web Service • Web documents/search queries received by Bing (EN-US market) • Live with June-09, April-10 snapshots • Training/adaptation tokens: ~1.2T per snapshot • CALM (ICASSP-2009) • http://web-ngram.research.microsoft.com

Cross-Language Perplexities on Query June-09 Snapshot

Query bigram PPL and OOV rate

Dynamics of the Web: N-gram Counts

Rapid Pace of the Web • Top unigrams for Web docs change a lot • Ref. Top 100K words from MS Web N-gram • Search query changes more quickly • Real-time media even more so • Twitter, Facebook updates • Web is not a “dead” corpus • Adaptation capability critical for Web NLP

MAP Decision Approach • Channel Coding; Bayesian Minimum Risk… • Speech, MT, Parsing, Tagging, Information Retrieval • Sopt = arg max P(S|O) = arg max P(O|S)P(S) • P(O|S): transformation model • P(S): prior Distortion Channel Signal (S) Output (O)

Plug-in MAP Problem • MAP decision is optimal only if P(S) and P(O|S) are “real” distributions • Adjustments needed when the probabilistic models include estimation errors or mismatch • Simple logarithmic interpolation: • Sopt = arg max [log P(O|S) + αlog P(S)] • “Random Field”/Machine Learning: • Sopt = arg max log P(S|O) = arg max Σαi* log P(fi|O)

Challenging Problems • Generalizability • Robustness • Adaptability • Implementation efficiency • Cost • When do we need complex models?

Case Study: Word Breaker • O: tweeter hash tag or URL domain name (e.g. “247moms”, “w84um8”) • S: what user meant to say (e.g. “24_7_moms”, “w8_4_u_m8” = “wait for you mate”) Transformation Channel Signal (S) Output (O)

Word Breaking Challenge • Norvig (CIKM 2008): Large Data + Simple Model • Unigram model • Good enough, but sometimes yields embarrassing outcomes • Simple extension to trigram:

Additional Challenges for WebApplications • Demo: bing.com/?q=word+breaker+web+era

Prior Arts • Simple heuristics • BI: Binomial Model (Venkataraman, 2001) • log P(O|S) = n * log P# + (|S|– n – 1) * log (1-P#) • GM: Geometric Mean (Keohn and Kline, 2003) • Widely used, especially in MT systems • WL: Word Length Normalization (Kaitan et al, 2009) • log P(O|S) = Σ log P(|wi|) • ME: Maximum Entropy Principle (WWW’11) • P# = 0.5, α = 1.0, P(S) using MS Web N-gram • Bayesian • Modular Linguistic Model (Brent, 1999) • Dirichlet Process (Goldwater et al, 2006)

Title Query Anchor Note: BI, WL are cheating experiments!

Best = Right Data + Smart Model • Style of language trumps size of data • Right data alleviates Plug-in MAP problem • Complicated machine learning artillery not required; simple methods suffice • Performance scales with model power, as mathematically predicted • Smart model gives us: • Rudimentary multi-lingual capability • Fast inclusion of new words/phrases • Alleviate needs of human labor

http://www.spellerchallenge.com From Word Breaking to Spelling Correction

Language Processing in the Web Era

Language Processing in the Web Era

Presentation Transcript

GIS IN Web 2.0 Era

Natural Language Processing for the Web

“Journalists in the digital era: caught in the Web?”

Word Processing in Language Arts

Natural Language Processing for the Web

Natural Language Processing for the Web

Natural Language Processing for the Web

Natural Language Processing for the Web

Natural Language Processing for the Web

Natural Language Processing in 2004

CS626-460: Language Technology for the Web/Natural Language Processing

Opportunities in Natural Language Processing

The Music Processing Language

Opportunities in Natural Language Processing

Language processing

Evaluation in natural language processing

English language Teaching in the “post-Method” Era

CS626-460: Language Technology for the Web/Natural Language Processing

APPLICATIONS IN NATURAL LANGUAGE PROCESSING

Language Teaching in the “Post-Communicative” Era

CS626-460: Language Technology for the Web/Natural Language Processing

Natural Language Processing in 2004