Language processing in the web era
1 / 22

Language Processing in the Web Era - PowerPoint PPT Presentation

  • Uploaded on

Language Processing in the Web Era. Kuansan Wang ISRC Microsoft Research, Redmond WA. Banko and Brill (HLT’01): Mitigating the Paucity-of-Data Problems. CIKM’08. There is no data like more data. That can be correctly exploited. NLP = Data + Model. Data, size does matter, but…

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Language Processing in the Web Era' - luka

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Language processing in the web era

Language Processing in the Web Era

Kuansan Wang


Microsoft Research, Redmond WA

Banko and Brill (HLT’01): Mitigating the Paucity-of-Data Problems

There is no data like more data

There is no data like more data

That can be correctly exploited

Nlp data model
NLP = Data + Model

  • Data, size does matter, but…

    • Language styles

    • Global, multi-lingual

    • Dynamic

  • Model

    • Simple, but not overly simplistic

    • The less human involvements, the better

      • Machine doesn’t have to work the same way as human

      • For many tasks machine has outperformed human


Anchor Text

HTML Title


Search Queries

google earning

earnings GOOG

gooogle quarterly report



Tackling the gap between query and document languages
Tackling the gap between query and document languages

  • Machine Translation

    • Miller et al. (SIGIR-99): latent query generation

    • Berger and Lafferty (SIGIR-99): explicit query model

    • Jin et al. (SIGIR-02): title/body as parallel text

  • Smoothing

    • Lafferty and Zhai (SIGIR-01): divergence model

    • Zhai and Lafferty (SIGIR-02): two-stage smoothing

  • Questions

    • Quantitatively, how big the problem really is?

    • Computationally, what are insights for solutions?

Microsoft web n gram service
Microsoft Web N-gram Service

  • Cloud-based Web Service

  • Web documents/search queries received by Bing (EN-US market)

    • Live with June-09, April-10 snapshots

    • Training/adaptation tokens: ~1.2T per snapshot

    • CALM (ICASSP-2009)


Cross language p erplexities on query
Cross-Language Perplexities on Query

June-09 Snapshot

Rapid pace of the web
Rapid Pace of the Web

  • Top unigrams for Web docs change a lot

    • Ref. Top 100K words from MS Web N-gram

  • Search query changes more quickly

  • Real-time media even more so

    • Twitter, Facebook updates

  • Web is not a “dead” corpus

    • Adaptation capability critical for Web NLP

Map decision approach
MAP Decision Approach

  • Channel Coding; Bayesian Minimum Risk…

  • Speech, MT, Parsing, Tagging, Information Retrieval

  • Sopt = arg max P(S|O) = arg max P(O|S)P(S)

    • P(O|S): transformation model

    • P(S): prior



Signal (S)

Output (O)

Plug in map problem
Plug-in MAP Problem

  • MAP decision is optimal only if P(S) and P(O|S) are “real” distributions

  • Adjustments needed when the probabilistic models include estimation errors or mismatch

    • Simple logarithmic interpolation:

      • Sopt = arg max [log P(O|S) + αlog P(S)]

    • “Random Field”/Machine Learning:

      • Sopt = arg max log P(S|O) = arg max Σαi* log P(fi|O)

Challenging problems
Challenging Problems

  • Generalizability

  • Robustness

  • Adaptability

  • Implementation efficiency

  • Cost

  • When do we need complex models?

Case study word breaker
Case Study: Word Breaker

  • O: tweeter hash tag or URL domain name (e.g. “247moms”, “w84um8”)

  • S: what user meant to say (e.g. “24_7_moms”, “w8_4_u_m8” = “wait for you mate”)



Signal (S)

Output (O)

Word breaking challenge
Word Breaking Challenge

  • Norvig (CIKM 2008): Large Data + Simple Model

    • Unigram model

    • Good enough, but sometimes yields embarrassing outcomes

  • Simple extension to trigram:

Additional challenges for web applications
Additional Challenges for WebApplications

  • Demo:

Prior arts
Prior Arts

  • Simple heuristics

    • BI: Binomial Model (Venkataraman, 2001)

      • log P(O|S) = n * log P# + (|S|– n – 1) * log (1-P#)

    • GM: Geometric Mean (Keohn and Kline, 2003)

      • Widely used, especially in MT systems

    • WL: Word Length Normalization (Kaitan et al, 2009)

      • log P(O|S) = Σ log P(|wi|)

    • ME: Maximum Entropy Principle (WWW’11)

      • P# = 0.5, α = 1.0, P(S) using MS Web N-gram

  • Bayesian

    • Modular Linguistic Model (Brent, 1999)

    • Dirichlet Process (Goldwater et al, 2006)




Note: BI, WL are

cheating experiments!

Best right data smart model
Best = Right Data + Smart Model

  • Style of language trumps size of data

  • Right data alleviates Plug-in MAP problem

    • Complicated machine learning artillery not required; simple methods suffice

    • Performance scales with model power, as mathematically predicted

  • Smart model gives us:

    • Rudimentary multi-lingual capability

    • Fast inclusion of new words/phrases

    • Alleviate needs of human labor

From word breaking to spelling correction

From Word Breaking to Spelling Correction