Language processing in the web era
1 / 22

Language Processing in the Web Era - PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Language Processing in the Web Era. Kuansan Wang ISRC Microsoft Research, Redmond WA. Banko and Brill (HLT’01): Mitigating the Paucity-of-Data Problems. CIKM’08. There is no data like more data. That can be correctly exploited. NLP = Data + Model. Data, size does matter, but…

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Language Processing in the Web Era

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Language Processing in the Web Era

Kuansan Wang


Microsoft Research, Redmond WA

Banko and Brill (HLT’01): Mitigating the Paucity-of-Data Problems


There is no data like more data

That can be correctly exploited

NLP = Data + Model

  • Data, size does matter, but…

    • Language styles

    • Global, multi-lingual

    • Dynamic

  • Model

    • Simple, but not overly simplistic

    • The less human involvements, the better

      • Machine doesn’t have to work the same way as human

      • For many tasks machine has outperformed human


Anchor Text

HTML Title


Search Queries

google earning

earnings GOOG

gooogle quarterly report



Tackling the gap between query and document languages

  • Machine Translation

    • Miller et al. (SIGIR-99): latent query generation

    • Berger and Lafferty (SIGIR-99): explicit query model

    • Jin et al. (SIGIR-02): title/body as parallel text

  • Smoothing

    • Lafferty and Zhai (SIGIR-01): divergence model

    • Zhai and Lafferty (SIGIR-02): two-stage smoothing

  • Questions

    • Quantitatively, how big the problem really is?

    • Computationally, what are insights for solutions?

Microsoft Web N-gram Service

  • Cloud-based Web Service

  • Web documents/search queries received by Bing (EN-US market)

    • Live with June-09, April-10 snapshots

    • Training/adaptation tokens: ~1.2T per snapshot

    • CALM (ICASSP-2009)


Cross-Language Perplexities on Query

June-09 Snapshot

Query bigram PPL and OOV rate

Dynamics of the Web: N-gram Counts

Rapid Pace of the Web

  • Top unigrams for Web docs change a lot

    • Ref. Top 100K words from MS Web N-gram

  • Search query changes more quickly

  • Real-time media even more so

    • Twitter, Facebook updates

  • Web is not a “dead” corpus

    • Adaptation capability critical for Web NLP

MAP Decision Approach

  • Channel Coding; Bayesian Minimum Risk…

  • Speech, MT, Parsing, Tagging, Information Retrieval

  • Sopt = arg max P(S|O) = arg max P(O|S)P(S)

    • P(O|S): transformation model

    • P(S): prior



Signal (S)

Output (O)

Plug-in MAP Problem

  • MAP decision is optimal only if P(S) and P(O|S) are “real” distributions

  • Adjustments needed when the probabilistic models include estimation errors or mismatch

    • Simple logarithmic interpolation:

      • Sopt = arg max [log P(O|S) + αlog P(S)]

    • “Random Field”/Machine Learning:

      • Sopt = arg max log P(S|O) = arg max Σαi* log P(fi|O)

Challenging Problems

  • Generalizability

  • Robustness

  • Adaptability

  • Implementation efficiency

  • Cost

  • When do we need complex models?

Case Study: Word Breaker

  • O: tweeter hash tag or URL domain name (e.g. “247moms”, “w84um8”)

  • S: what user meant to say (e.g. “24_7_moms”, “w8_4_u_m8” = “wait for you mate”)



Signal (S)

Output (O)

Word Breaking Challenge

  • Norvig (CIKM 2008): Large Data + Simple Model

    • Unigram model

    • Good enough, but sometimes yields embarrassing outcomes

  • Simple extension to trigram:

Additional Challenges for WebApplications

  • Demo:

Prior Arts

  • Simple heuristics

    • BI: Binomial Model (Venkataraman, 2001)

      • log P(O|S) = n * log P# + (|S|– n – 1) * log (1-P#)

    • GM: Geometric Mean (Keohn and Kline, 2003)

      • Widely used, especially in MT systems

    • WL: Word Length Normalization (Kaitan et al, 2009)

      • log P(O|S) = Σ log P(|wi|)

    • ME: Maximum Entropy Principle (WWW’11)

      • P# = 0.5, α = 1.0, P(S) using MS Web N-gram

  • Bayesian

    • Modular Linguistic Model (Brent, 1999)

    • Dirichlet Process (Goldwater et al, 2006)




Note: BI, WL are

cheating experiments!

Best = Right Data + Smart Model

  • Style of language trumps size of data

  • Right data alleviates Plug-in MAP problem

    • Complicated machine learning artillery not required; simple methods suffice

    • Performance scales with model power, as mathematically predicted

  • Smart model gives us:

    • Rudimentary multi-lingual capability

    • Fast inclusion of new words/phrases

    • Alleviate needs of human labor

From Word Breaking to Spelling Correction

  • Login