Better together large monolingual bilingual and multimodal corpora in natural language processing
This presentation is the property of its rightful owner.
Sponsored Links
1 / 45

Better Together: Large Monolingual, Bilingual and Multimodal Corpora in Natural Language Processing PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on
  • Presentation posted in: General

Better Together: Large Monolingual, Bilingual and Multimodal Corpora in Natural Language Processing. Shane Bergsma Johns Hopkins University. Fall, 2011. Research Vision. Robust processing of human language requires knowledge beyond what’s in small manually-annotated data sets

Download Presentation

Better Together: Large Monolingual, Bilingual and Multimodal Corpora in Natural Language Processing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Better together large monolingual bilingual and multimodal corpora in natural language processing

Better Together: Large Monolingual, Bilingual and Multimodal Corpora in Natural Language Processing

Shane Bergsma

Johns Hopkins University

Fall, 2011


Research vision

Research Vision

Robust processing of human language requires knowledge beyond what’s in small manually-annotated data sets

Many NLP successes exploit web-scale raw data:

  • Google Translate

  • IBM’s Watson

  • Things people use every day:

    • spelling correction, speech recognition, etc.


More data is better data

More data is better data

[Banko & Brill, 2001]

Grammar Correction

Task

@Microsoft


This talk

This Talk

Derive lots of knowledge from web-scale data and apply to syntax, semantics, discourse:

  • Raw text on the web (Google N-grams)

    • Part 1: Non-referential pronouns

  • Bilingual text (words plus their translations)

    • Part 2: Parsing noun phrases

  • Visual data (labelled online images)

    • Part 3: Learning the meaning of words


Search engines for nlp

Search Engines for NLP

  • Early web work: Use an Internet search engine to get data

    [Keller & Lapata, 2003]


Search engines

Search Engines

  • Search Engines for NLP: some objections

    • Scientific: not reproducible, unreliable [Kilgarriff, 2007, “Googleology is bad science.”]

    • Practical: Too slow for millions of queries


N grams

N-grams

  • Google N-gram Data [Brants & Franz, 2006]

    • N words in sequence + their count on web

    • A compressed version of all the text on web

      • 24 GB zipped fits on your hard drive

    • Enables better features for a range of tasks [Bergsma et al. ACL 2008, IJCAI 2009, ACL 2010, etc.]


Part 1 non referential pronouns

Part 1: Non-Referential Pronouns

E.g. the word “it” in English

  • “You can make itin advance.”

    • referential (50-75%)

  • “You can makeitin Hollywood.”

    • non-referential (25-50%)


Non referential pronouns

Non-Referential Pronouns

  • [Hirst, 1981]: detect non-referential pronouns, “lest precious hours be lost in bootless searches for textual referents.”

  • Most existing pronoun/coreference systems just ignore the problem

  • A common ambiguity:

    • “it” comprises 1% of English tokens


Non ref detection as classification

Non-Ref Detection as Classification

  • Input:

    s = “You can make itin advance”

  • Output:

    Is ita non-referential pronoun in s?

    Method: train a supervised classifier to make this decision on the basis of some features

[Evans, 2001, Boyd et al. 2005, Müller 2006]


A machine learning approach

A Machine Learning Approach

h(x) = w∙x(predict non-ref if h(x) > 0)

  • Typical ‘lexical’ features: binary indicators of context:

    x= (previous-word=make, next-word=in, previous-two-words=can+make, …)

  • Use training data to learn good values for the weights, w

    • Classifier learns, e.g., to give negative weight to PPs immediately preceding ‘it’ (e.g. … from it)


Better features from the web

Better: Features from the Web

[Bergsma, Lin, Goebel, ACL 2008]

  • Convert sentence to a context pattern:

    “make ____in advance”

  • Collect counts from the web:

    • “make it/themin advance”

      • 442vs. 449occurrences in Google N-gram Data

    • “makeit/themin Hollywood”

      • 3421 vs.0occurrences in Google N-gram Data


Applying the web counts

Applying the Web Counts

  • How wide should the patterns span?

    • We can use all that Google N-gram Data allows:

      You can make _

      can make _ in

      make _ in advance

      _ in advance .

    • Five 5-grams, four 4-grams, three 3-grams and two bigrams

  • What fillers to use? (e.g. it, they/them, any NP?)


Web count features

Web Count Features

“it”:

log-cnt(“You can make it in”) 5-grams

log-cnt(“can make itin advance”)

log-cnt(“make it in advance .”)

...

log-cnt(“You can make it”) 4-grams

log-cnt(“can make it in”)

... ...

“them”:

log-cnt(“You can make them in”) 5-grams

... ...


A machine learning approach revisited

A Machine Learning Approach Revisited

h(x) = w∙x(predict non-ref if h(x) > 0)

  • Typical features: binary indicators of context:

    x= (previous-word=make, next-word=in, previous-two-words=can+make, …)

  • New features: real-valued counts in web text:

    x= (log-cnt(“make it in advance”), log-cnt(“make them in advance”, log-cnt(“make * in advance”), …)

  • Key conclusion: classifiers with web features are robust on new domains!

[Bergsma, Pitler, Lin, ACL 2010]


Better together large monolingual bilingual and multimodal corpora in natural language processing

NADA

[Bergsma & Yarowsky, DAARC 2011]

  • Non-Anaphoric Detection Algorithm:

    • a system for identifying non-referential pronouns

  • Works on raw sentences; no parsing/tagging of input needed

  • Classifies ‘it’ in up to 20,000 sentences/second

  • It works well when used out-of-domain

    • Because it’s got those Web count features

http://code.google.com/p/nada-nonref-pronoun-detector/


Using web counts works great but is it practical

Using web counts works great… but is it practical?

All N

-

grams in the Google N

-

gram corpus

93 GB

Extract N

-

grams of length

-

4 only

33

GB

Extract N

-

grams containing

it, they, them only

500 MB

Lower

-

case, truncate tokens to four

189 MB

characters, replace special tokens (e.g. named

entities, pronouns, digits) with symbols, etc.

Encode tokens (6 bytes) and values (2 bytes),

44 MB

store only changes from previous line

gzip

resulting file

33

MB


Nada versus other systems

NADA versus Other Systems


Part 1 conclusion

Part 1: Conclusion

  • N-gram data better than search engines

  • Classifiers with N-gram counts are very effective, particularly on new domains

  • But we needed a large corpus of manually- annotated data to learn how to use the counts

  • We’ll see now how bilingual data can provide the supervision (for some problems)


Part 2 coordination ambiguity in nps

Part 2: Coordination Ambiguity in NPs

  • [dairy andmeat]production

  • [sustainability]and [meat production]

    yes: [dairy production] in (1)

    no: [sustainability production] in (2)

  • new semantic features from raw web text and a new approach to using bilingual data as soft supervision

  • [Bergsma, Yarowsky & Church, ACL 2011]


    Coordination ambiguity

    Coordination Ambiguity

    • Words whose POS tags match pattern:

      [DT|PRP$] (N.*|J.*) and [DT|PRP$] (N.*|J.*) N.*

    • Output: Decide if one NP or two

    • Resolving Coordination is classic hard problem

      • Treebank doesn’t annotate NP-internal structure

      • Modern parsers thus do very poorly on these decisions (78% Minipar, 79% for C&C parser)

      • For training/evaluation, we patched Treebank with Vadas & Curran ’07 NP annotations


    One noun phrase or two a machine learning approach

    One Noun Phrase or Two:A Machine Learning Approach

    Input: “dairy and meatproduction”→ features: x

    x= (…,first-noun=dairy, …

    second-noun=meat,…

    first+second-noun=dairy+meat, …)

    h(x) = w∙x(predict one NP if h(x) > 0)

    • Set w via training on annotated training data using some machine learning algorithm


    Leveraging web derived knowledge

    Leveraging Web-Derived Knowledge

    [dairy andmeat]production

    • If there is only one NP, then it is implicitly talking about “dairy production”

      • Do we see this phrase occurring a lot on the web? [Yes]

        sustainability and [meat production]

    • If there is only one NP, then it is implicitly talking about “sustainability production”

      • Do we see this phrase occurring a lot on the web? [No]

  • Classifier has features for these counts

    • But the web can gives us more!


  • Features for explicit paraphrases

    Features for Explicit Paraphrases

    ❶ and❷ ❸

    ❶ and❷ ❸

    New paraphrases extending ideas in [Nakov & Hearst, 2005]


    Better together large monolingual bilingual and multimodal corpora in natural language processing

    Human-Annotated Data (small)

    Training Examples

    motor and heating fuels

    freedom and security agenda

    conservation and good management

    Google

    N-gram Data

    Feature Vectors

    x1, x2, x3, x4

    Machine Learning

    Raw Data (HUGE)

    Classifier: h(x)


    Using bilingual data

    Using Bilingual Data

    • Bilingual data: a rich source of paraphrases

      dairy and meatproductionproducciónláctea y cárnica

    • Build a classifier which uses bilingual features

      • Applicable when we know the translation of the NP


    Bilingual paraphrase features

    Bilingual “Paraphrase” Features

    ❶ and❷ ❸

    ❶ and❷ ❸


    Bilingual paraphrase features1

    Bilingual “Paraphrase” Features

    ❶ and❷ ❸

    ❶ and❷ ❸


    Better together large monolingual bilingual and multimodal corpora in natural language processing

    Human-Annotated Data (small)

    Training Examples

    motor and heating fuels

    freedom and security agenda

    conservation and good management

    Translation Data

    Feature Vectors

    x1, x2, x3, x4

    Machine Learning

    Bilingual Data (medium)

    Classifier: h(xb)


    Better together large monolingual bilingual and multimodal corpora in natural language processing

    Training Examples

    + Features from Google Data

    h(xm)

    coal and steel money

    coal and steel money

    coal and steel money

    North and South Carolina

    North and South Carolina

    North and South Carolina

    rocket and mortar attacks

    rocket and mortar attacks

    rocket and mortar attacks

    pollution and transport safety

    pollution and transport safety

    pollution and transport safety

    business and computer science

    business and computer science

    business and computer science

    insurrection and regime change

    insurrection and regime change

    insurrection and regime change

    the environment and air transport

    the environment and air transport

    the environment and air transport

    the Bosporus and Dardanelles straits

    the Bosporus and Dardanelles straits

    the Bosporus and Dardanelles straits

    h(xb)

    Bitext Examples

    Training Examples

    + Features from Translation Data


    Better together large monolingual bilingual and multimodal corpora in natural language processing

    Training Examples

    + Features from Google Data

    h(xm)

    North and South Carolina

    North and South Carolina

    North and South Carolina

    pollution and transport safety

    pollution and transport safety

    pollution and transport safety

    business and computer science

    business and computer science

    insurrection and regime change

    insurrection and regime change

    insurrection and regime change

    the environment and air transport

    the environment and air transport

    the Bosporus and Dardanelles straits

    the Bosporus and Dardanelles straits

    business and computer science

    the environment and air transport

    the Bosporus and Dardanelles straits

    h(xb)1

    Training Examples

    + Features from Translation Data

    coal and steel money

    rocket and mortar attacks


    Better together large monolingual bilingual and multimodal corpora in natural language processing

    Training Examples

    + Features from Google Data

    business and computer science

    the Bosporus and Dardanelles straits

    the environment and air transport

    h(xm)1

    North and South Carolina

    North and South Carolina

    pollution and transport safety

    pollution and transport safety

    insurrection and regime change

    insurrection and regime change

    h(xb)1

    Co-Training: [Yarowsky’95], [Blum & Mitchell’98]

    Training Examples

    + Features from Translation Data

    coal and steel money

    rocket and mortar attacks


    Better together large monolingual bilingual and multimodal corpora in natural language processing

    Error rate (%) of co-trained classifiers

    h(xb)i

    h(xm)i


    Better together large monolingual bilingual and multimodal corpora in natural language processing

    Error rate (%) on Penn Treebank (PTB)

    unsupervised

    h(xm)N

    800 PTB training examples

    800 PTB training examples

    2 training examples


    Part 2 conclusion

    Part 2: Conclusion

    • Knowledge from large-scale monolingual corpora is crucial for parsing noun phrases

      • New paraphrase features

    • New way to use bilingual data as soft supervision to guide the use of monolingual features


    Part 3 using visual data to learn the meaning of words

    Part 3: Using visual data to learn the meaning of words

    • Large volumes of visual data also reveal meaning (semantics), but in language-universal way

    • Humans label their images as they post them online, providing the word-meaning link

    • There’s lots of images to work with

    [from Facebook’s Twitter feed]


    Better together large monolingual bilingual and multimodal corpora in natural language processing

    English Web Images

    Spanish Web Images

    cockatoo

    vela

    turtle

    cacatúa

    candle

    tortuga

    [Bergsma and Van Durme, IJCAI 2011]


    Linking bilingual words by web based visual similarity

    Linking bilingual words by web-based visual similarity

    Step 1: Retrieve online images via Google Image Search (in each lang.), 20 images for each word

    • Google competitive with “hand-prepared datasets”

    [Fergus et al., 2005]


    Step 2 create image feature vectors

    Step 2: Create Image Feature Vectors

    Color histogram features


    Step 2 create image feature vectors1

    Step 2: Create Image Feature Vectors

    SIFTkeypoint features

    Using David Lowe’s software [Lowe, 2004]


    Step 3 compute an aggregate similarity for two words

    Step 3: Compute an Aggregate Similarity for Two Words

    Vector

    Cosine Similarity

    0.33

    0.55

    Avg. over all English images

    Best match for one English image

    0.19

    0.46


    Output ranking of foreign translations by aggregate visual similarities

    Output: Ranking of Foreign Translations by Aggregate Visual Similarities

    • Lots of details in the paper:

    • Finding a class of words where this works (physical objects)

    • Comparing visual similarity to string similarity (cognate finder)


    Task 2 lexical semantics from images

    Task #2: Lexical Semantics from Images

    • Selectional Preference:

      • Is noun X a plausible object for verb Y?

    Can you eat “migas”?

    Can you eat “carillon”?

    Can you eat “mamey”?

    • [Bergsma and Goebel, RANLP 2011]


    Conclusion

    Conclusion

    • Robust NLP needs to look beyond human-annotated data to exploit large corpora

    • Size matters:

      • Many NLP systems trained on 1 million words

      • We use:

        • billions of words in bitexts

        • trillions of words of monolingual text

        • online images: hundreds of billions

          (⨯1000 words each  a 100 trillion words!)


    Questions thanks

    Questions + Thanks

    • Gold sponsors:

    • Platinum sponsors (collaborators):

      • Kenneth Church (Johns Hopkins), Randy Goebel (Alberta), Dekang Lin (Google), Emily Pitler (Penn), Benjamin Van Durme (Johns Hopkins) and David Yarowsky (Johns Hopkins)


  • Login