Language processing in the web era
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

Language Processing in the Web Era PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on
  • Presentation posted in: General

Language Processing in the Web Era. Kuansan Wang ISRC Microsoft Research, Redmond WA. Banko and Brill (HLT’01): Mitigating the Paucity-of-Data Problems. CIKM’08. There is no data like more data. That can be correctly exploited. NLP = Data + Model. Data, size does matter, but…

Download Presentation

Language Processing in the Web Era

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Language processing in the web era

Language Processing in the Web Era

Kuansan Wang

ISRC

Microsoft Research, Redmond WA


Language processing in the web era

Banko and Brill (HLT’01): Mitigating the Paucity-of-Data Problems


Language processing in the web era

CIKM’08


There is no data like more data

There is no data like more data

That can be correctly exploited


Nlp data model

NLP = Data + Model

  • Data, size does matter, but…

    • Language styles

    • Global, multi-lingual

    • Dynamic

  • Model

    • Simple, but not overly simplistic

    • The less human involvements, the better

      • Machine doesn’t have to work the same way as human

      • For many tasks machine has outperformed human


Language processing in the web era

URL

Anchor Text

HTML Title

Heading

Search Queries

google earning

earnings GOOG

gooogle quarterly report

Caption

Body


Tackling the gap between query and document languages

Tackling the gap between query and document languages

  • Machine Translation

    • Miller et al. (SIGIR-99): latent query generation

    • Berger and Lafferty (SIGIR-99): explicit query model

    • Jin et al. (SIGIR-02): title/body as parallel text

  • Smoothing

    • Lafferty and Zhai (SIGIR-01): divergence model

    • Zhai and Lafferty (SIGIR-02): two-stage smoothing

  • Questions

    • Quantitatively, how big the problem really is?

    • Computationally, what are insights for solutions?


Microsoft web n gram service

Microsoft Web N-gram Service

  • Cloud-based Web Service

  • Web documents/search queries received by Bing (EN-US market)

    • Live with June-09, April-10 snapshots

    • Training/adaptation tokens: ~1.2T per snapshot

    • CALM (ICASSP-2009)

  • http://web-ngram.research.microsoft.com


Cross language p erplexities on query

Cross-Language Perplexities on Query

June-09 Snapshot


Query bigram ppl and oov rate

Query bigram PPL and OOV rate


Dynamics of the web n gram counts

Dynamics of the Web: N-gram Counts


Rapid pace of the web

Rapid Pace of the Web

  • Top unigrams for Web docs change a lot

    • Ref. Top 100K words from MS Web N-gram

  • Search query changes more quickly

  • Real-time media even more so

    • Twitter, Facebook updates

  • Web is not a “dead” corpus

    • Adaptation capability critical for Web NLP


Map decision approach

MAP Decision Approach

  • Channel Coding; Bayesian Minimum Risk…

  • Speech, MT, Parsing, Tagging, Information Retrieval

  • Sopt = arg max P(S|O) = arg max P(O|S)P(S)

    • P(O|S): transformation model

    • P(S): prior

Distortion

Channel

Signal (S)

Output (O)


Plug in map problem

Plug-in MAP Problem

  • MAP decision is optimal only if P(S) and P(O|S) are “real” distributions

  • Adjustments needed when the probabilistic models include estimation errors or mismatch

    • Simple logarithmic interpolation:

      • Sopt = arg max [log P(O|S) + αlog P(S)]

    • “Random Field”/Machine Learning:

      • Sopt = arg max log P(S|O) = arg max Σαi* log P(fi|O)


Challenging problems

Challenging Problems

  • Generalizability

  • Robustness

  • Adaptability

  • Implementation efficiency

  • Cost

  • When do we need complex models?


Case study word breaker

Case Study: Word Breaker

  • O: tweeter hash tag or URL domain name (e.g. “247moms”, “w84um8”)

  • S: what user meant to say (e.g. “24_7_moms”, “w8_4_u_m8” = “wait for you mate”)

Transformation

Channel

Signal (S)

Output (O)


Word breaking challenge

Word Breaking Challenge

  • Norvig (CIKM 2008): Large Data + Simple Model

    • Unigram model

    • Good enough, but sometimes yields embarrassing outcomes

  • Simple extension to trigram:


Additional challenges for web applications

Additional Challenges for WebApplications

  • Demo: bing.com/?q=word+breaker+web+era


Prior arts

Prior Arts

  • Simple heuristics

    • BI: Binomial Model (Venkataraman, 2001)

      • log P(O|S) = n * log P# + (|S|– n – 1) * log (1-P#)

    • GM: Geometric Mean (Keohn and Kline, 2003)

      • Widely used, especially in MT systems

    • WL: Word Length Normalization (Kaitan et al, 2009)

      • log P(O|S) = Σ log P(|wi|)

    • ME: Maximum Entropy Principle (WWW’11)

      • P# = 0.5, α = 1.0, P(S) using MS Web N-gram

  • Bayesian

    • Modular Linguistic Model (Brent, 1999)

    • Dirichlet Process (Goldwater et al, 2006)


Language processing in the web era

Title

Query

Anchor

Note: BI, WL are

cheating experiments!


Best right data smart model

Best = Right Data + Smart Model

  • Style of language trumps size of data

  • Right data alleviates Plug-in MAP problem

    • Complicated machine learning artillery not required; simple methods suffice

    • Performance scales with model power, as mathematically predicted

  • Smart model gives us:

    • Rudimentary multi-lingual capability

    • Fast inclusion of new words/phrases

    • Alleviate needs of human labor


From word breaking to spelling correction

http://www.spellerchallenge.com

From Word Breaking to Spelling Correction


  • Login