Marti hearst school of information uc berkeley ucb neyman seminar october 25 2006
This presentation is the property of its rightful owner.
Sponsored Links
1 / 82

Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on
  • Presentation posted in: General

Marti Hearst School of Information, UC Berkeley UCB Neyman Seminar October 25, 2006. Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems. This research supported in part by NSF DBI-0317510. Natural Language Processing.

Download Presentation

Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Marti hearst school of information uc berkeley ucb neyman seminar october 25 2006

Marti Hearst

School of Information, UC Berkeley

UCB Neyman Seminar

October 25, 2006

Unambiguous + Unlimited = UnsupervisedUsing the Web for Natural Language Processing Problems

This research supported in part by NSF DBI-0317510


Natural language processing

Natural Language Processing

  • The ultimate goal: write programs that read and understand stories and conversations.

    • This is too hard! Instead we tackle sub-problems.

  • There have been notable successes lately:

    • Machine translation is vastly improved

    • Speech recognition is decent in limited circumstances

    • Text categorization works with some accuracy


Automatic help desk translation at ms

Automatic Help Desk Translation at MS


How can a machine understand these differences

How can a machine understand these differences?

Get the cat with the gloves.


How can a machine understand these differences1

How can a machine understand these differences?

Get the sock from the cat with the gloves.

Get the glove from the cat with the socks.


How can a machine understand these differences2

How can a machine understand these differences?

  • Decorate the cake with the frosting.

  • Decorate the cake with the kids.

  • Throw out the cake with the frosting.

  • Throw out the cake with the kids.


Why is this difficult

Why is this difficult?

  • Same syntactic structure, different meanings.

  • Natural language processing algorithms have to deal with the specifics of individual words.

  • Enormous vocabulary sizes.

    • The average English speaker’s vocabulary is around 50,000 words,

    • Many of these can be combined with many others,

    • And they mean different things when they do!


How to tackle this problem

How to tackle this problem?

  • The field was stuck for quite some time.

    • Hand-enter all semantic concepts and relations

  • A new approach started around 1990

    • Get large text collections

    • Compute statistics over the words in those collections

  • There are many different algorithms.


Size matters

Size Matters

Recent realization: bigger is better than smarter!

Banko and Brill ’01: “Scaling to Very, Very Large Corpora for Natural Language Disambiguation”, ACL


Example problem

Example Problem

  • Grammar checker example:

    Which word to use?

    <principal><principle>

  • Solution: use well-edited text and look at which words surround each use:

    • I am in my third year as the principal of Anamosa High School.

    • School-principal transfers caused some upset.

    • This is a simple formulation of the quantum mechanical uncertainty principle.

    • Power without principle is barren, but principlewithout power is futile. (Tony Blair)


Using very very large corpora

Using Very, Very Large Corpora

  • Keep track of which words are the neighbors of each spelling in well-edited text, e.g.:

    • Principal: “high school”

    • Principle: “rule”

  • At grammar-check time, choose the spelling best predicted by the surrounding words.

  • Surprising results:

    • Log-linear improvement even to a billion words!

    • Getting more data is better than fine-tuning algorithms!


The effects of large datasets

The Effects of LARGE Datasets

  • From Banko & Brill ‘01


How to extend this idea

How to Extend this Idea?

  • This is an exciting result …

  • BUT relies on having huge amounts of text that has been appropriately annotated!


How to avoid manual labeling

How to Avoid Manual Labeling?

  • “Web as a baseline” (Lapata & Keller 04,05)

  • Main idea: apply web-determined counts to every problem imaginable.

    • Example: for t in {<principal><principle>}

    • Compute f(w-1, t, w+1)

    • The largest count wins


Web as a baseline

Web as a Baseline

  • Works very well in some cases

    • machine translation candidate selection

    • article generation

    • noun compound interpretation

    • noun compound bracketing

    • adjective ordering

  • But lacking in others

    • spelling correction

    • countability detection

    • prepositional phrase attachment

  • How to push this idea further?

Significantly better than the

best supervised algorithm.

Not significantly different from the best supervised.


Using unambiguous cases

Using Unambiguous Cases

  • The trick: look for unambiguous cases to start

  • Use these to improve the results beyond what co-occurrence statistics indicate.

  • An Early Example:

    • Hindle and Rooth, “Structural Ambiguity and Lexical Relations”, ACL ’90, Comp Ling’93

    • Problem: Prepositional Phrase attachment

      • I eat/v spaghetti/n1 with/p a fork/n2.

      • I eat/vspaghetti/n1 with/psauce/n2.

    • Question: does n2 attach to v or to n1?


Using unambiguous cases1

Using Unambiguous Cases

  • How to do this with unlabeled data?

  • First try:

    • Parse some text into phrase structure

    • Then compute certain co-occurrences

      f(v, n1, p) f(n1, p) f(v, n1)

    • Problem: results not accurate enough

  • The trick: look for unambiguous cases:

    • Spaghetti with sauce is delicious. (pre-verbal)

    • I eat with a fork. (no direct object)

  • Use these to improve the results beyond what co-occurrence statistics indicate.


Unambiguous unlimited unsupervised

Unambiguous + Unlimited = Unsupervised

  • Apply the Unambiguous Case Idea to the Very, Very Large Corpora idea

    • The potential of these approaches are not fully realized

  • Our work (with Preslav Nakov):

    • Structural Ambiguity Decisions

      • PP-attachment

      • Noun compound bracketing

      • Coordination grouping

    • Semantic Relation Acquisition

      • Hypernym (ISA) relations

      • Verbal relations between nouns

        • SAT Analogy problems


Applying u u u to structural ambiguity

Applying U + U = U to Structural Ambiguity

  • We introduce the use of (nearly) unambiguous features:

    • Surface features

    • Paraphrases

  • Combined with ngrams

  • Use from very, very large corpora

  • Achieve state-of-the-art results without labeled examples.


Noun compound bracketing

Noun Compound Bracketing

(a)[ [ liver cell ] antibody ] (left bracketing)

(b)[ liver [cell line] ] (right bracketing)

In (a), the antibodytargets the liver cell.

In (b), the cell lineis derived from the liver.


Dependency model

Dependency Model

  • right bracketing: [w1[w2w3] ]

    • w2w3 is a compound (modified by w1)

      • home health care

    • w1 and w2 independently modify w3

      • adult male rat

  • left bracketing : [ [w1w2 ]w3]

    • only 1 modificational choice possible

      • law enforcement officer

w1 w2 w3

w1 w2 w3


Our u u u algorithm

Our U + U + U Algorithm

  • Compute bigram estimates

  • Compute estimates from surface features

  • Compute estimates from paraphrases

  • Combine these scores with a voting algorithm to choose left or right bracketing.

  • We use the same general approach for two other structural ambiguity problems.


Computing bigram statistics

Computing Bigram Statistics

  • Dependency Model, Frequencies

    • Compare #(w1,w2) to #(w1,w3)

  • Dependency model, Probabilities

    • Pr(left) = Pr(w1w2|w2)Pr(w2w3|w3)

    • Pr(right) = Pr(w1w3|w3)Pr(w2w3|w3)

  • So we compare Pr(w1w2|w2) to Pr(w1w3|w3)

right

w1 w2 w3

left


Using ngrams to estimate probabilities

Using ngrams to estimate probabilities

  • Using page hits as a proxy for n-gram counts

    • Pr(w1w2|w2) = #(w1,w2) / #(w2)

      • #(w2) word frequency; query for “w2”

      • #(w1,w2) bigram frequency; query for “w1 w2”

    • smoothed by 0.5

  • Use 2 to determine if w1 is associated with w2 (thus indicating left bracketing), and same for w1 with w3


Our u u u algorithm1

Our U + U + U Algorithm

  • Compute bigram estimates

  • Compute estimates from surface features

  • Compute estimates from paraphrases

  • Combine these scores with a voting algorithm to choose left or right bracketing.


Web derived surface features

Web-derived Surface Features

  • Authors often disambiguate noun compounds using surface markers, e.g.:

    • amino-acid sequence  left

    • brain stem’s cell  left

    • brain’s stem cell  right

  • The enormous size of the Web makes these frequent enough to be useful.


Web derived surface features dash hyphen

Web-derived Surface Features:Dash (hyphen)

  • Left dash

    • cell-cycle analysis left

  • Right dash

    • donor T-cell right

  • Double dash

    • T-cell-depletion unusable…


Web derived surface features possessive marker

Web-derived Surface Features:Possessive Marker

  • Attached to the first word

    • brain’s stem cell  right

  • Attached to the second word

    • brain stem’s cell  left

  • Combined features

    • brain’s stem-cell  right


Web derived surface features capitalization

Web-derived Surface Features:Capitalization

  • anycase – lowercase – uppercase

    • Plasmodium vivax Malaria  left

    • plasmodium vivax Malaria  left

  • lowercase – uppercase–anycase

    • brain Stem cell  right

    • brain Stem Cell  right

  • Disable this on:

    • Roman digits

    • Single-letter words: e.g. vitamin D deficiency


Web derived surface features embedded slash

Web-derived Surface Features:Embedded Slash

  • Left embedded slash

    • leukemia/lymphoma cell  right


Web derived surface features parentheses

Web-derived Surface Features:Parentheses

  • Single-word

    • growth factor (beta)  left

    • (brain) stem cell  right

  • Two-word

    • (growth factor) beta  left

    • brain (stem cell)  right


Web derived surface features comma dot semi colon

Web-derived Surface Features:Comma, dot, semi-colon

  • Following the first word

    • home. health care  right

    • adult, male rat  right

  • Following the second word

    • health care, provider  left

    • lung cancer: patients  left


Web derived surface features dash to external word

Web-derived Surface Features:Dash to External Word

  • External word to the left

    • mouse-brain stem cell  right

  • External word to the right

    • tumor necrosis factor-alpha  left


Other web derived features abbreviation

Other Web-derived Features:Abbreviation

  • After the second word

    • tumor necrosis factor (NF) right

  • After the third word

    • tumor necrosis (TN) factor  right

  • We query for, e.g., “tumor necrosis tn factor”

  • Problems:

    • Roman digits: IV, VI

    • States: CA

    • Short words: me


Other web derived features concatenation

Other Web-derived Features:Concatenation

  • Consider health care reform

    • healthcare : 79,500,000

    • carereform : 269

    • healthreform: 812

  • Adjacency model

    • healthcare vs. carereform

  • Dependency model

    • healthcare vs. healthreform

  • Triples

    • “healthcarereform” vs. “health carereform”


Other web derived features reorder

Other Web-derived Features:Reorder

  • Reorders for “healthcare reform”

    • “care reform health” right

    • “reform health care” left


Other web derived features internal inflection variability

Other Web-derived Features:Internal Inflection Variability

  • Vary inflection of second word

    • tyrosine kinase activation

    • tyrosine kinases activation


Other web derived features switch the first two words

Other Web-derived Features:Switch The First Two Words

  • Predict right, if we can reorder

    • adult male ratas

    • male adult rat


Our u u u algorithm2

Our U + U + U Algorithm

  • Compute bigram estimates

  • Compute estimates from surface features

  • Compute estimates from paraphrases

  • Combine these scores with a voting algorithm to choose left or right bracketing.


Paraphrases

Paraphrases

  • The semantics of a noun compound is often made overt by a paraphrase (Warren,1978)

    • Prepositional

      • stem cells in the brain right

      • cells from the brainstem right

    • Verbal

      • virus causinghuman immunodeficiency  left

    • Copula

      • office building that is a skyscraper right


Paraphrases1

Paraphrases

  • prepositional paraphrases:

    • We use: ~150 prepositions

  • verbal paraphrases:

    • We use: associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for.

  • copula paraphrases:

    • We use: is/was and that/which/who

  • optional elements:

    • articles: a, an, the

    • quantifiers: some, every, etc.

    • pronouns: this, these, etc.


Our u u u algorithm3

Our U + U + U Algorithm

  • Compute bigram estimates

  • Compute estimates from surface features

  • Compute estimates from paraphrases

  • Combine these scores with a voting algorithm to choose left or right bracketing.


Evaluation datasets

Evaluation:Datasets

  • Lauer Set

    • 244 noun compounds (NCs)

      • from Grolier’s encyclopedia

      • inter-annotator agreement: 81.5%

  • Biomedical Set

    • 430 NCs

      • from MEDLINE

      • inter-annotator agreement: 88% (=.606)


Co occurrence statistics

Co-occurrence Statistics

  • Lauer set

  • Bio set


Paraphrase and surface features performance

Paraphrase and Surface Features Performance

  • Lauer Set

  • Biomedical Set


Individual surface features performance bio

Individual Surface Features Performance: Bio


Individual surface features performance bio1

Individual Surface Features Performance: Bio


Results lauer

Results Lauer


Results comparing with others

Results: Comparing with Others


Results bio

Results Bio


Results for noun compound bracketing

Results for Noun Compound Bracketing

  • Introduced search engine statistics that go beyond the n-gram (applicable to other tasks)

    • surface features

    • paraphrases

  • Obtained new state-of-the-art results on NC bracketing

    • more robust than Lauer (1995)

    • more accurate than Keller&Lapata (2004)


Prepositional phrase attachment

Prepositional Phrase Attachment

Problem:

(a) Peter spent millions of dollars. (noun attach)

(b) Peter spent time with his family. (verb attach)

Which attachment for quadruple:

(v, n1, p, n2)

Results:

Much simpler than other algorithms

As good as or better than best unsupervised, and better than some supervised approaches


Noun phrase coordination

Noun Phrase Coordination

  • (Modified) real sentence:

    • The Department of Chronic Diseases andHealth Promotion leads and strengthens global efforts to prevent and control chronic diseases or disabilities and to promote health and quality of life.


Nc coordination ellipsis

NC coordination: ellipsis

  • Ellipsis

    • car and truck production

    • means car production and truck production

  • No ellipsis

    • president and chief executive

  • All-way coordination

    • Securities and Exchange Commission


Results 428 examples from penn tb

Results428 examples from Penn TB


Semantic relation detection

Semantic Relation Detection

  • Goal: automatically augment a lexical database

  • Many potential relation types:

    • ISA (hypernymy/hyponymy)

    • Part-Of (meronymy)

  • Idea: find unambiguous contexts which (nearly) always indicate the relation of interest


Lexico syntactic patterns

Lexico-Syntactic Patterns


Lexico syntactic patterns1

Lexico-Syntactic Patterns


Adding a new relation

Adding a New Relation


Semantic relation detection1

Semantic Relation Detection

  • Lexico-syntactic Patterns:

    • Should occur frequently in text

    • Should (nearly) always suggest the relation of interest

    • Should be recognizable with little pre-encoded knowledge.

  • These patterns have been used extensively by other researchers.


Semantic relation detection2

Semantic Relation Detection

  • What relationship holds between two nouns?

    • olive oil – oil comes from olives

    • machine oil – oil used on machines

  • Assigning the meaning relations between these terms has been seen as a very difficult solution

  • Our solution:

    • Use clever queries against the web to figure out the relations.


Queries for semantic relations

Queries for Semantic Relations

  • Convert the noun-noun compound into a query of the form:

  • noun2 that * noun1

  • “oil that * olive(s)”

  • This returns search result snippets containing interesting verbs.

    • In this case:

      • Come from

      • Be obtained from

      • Be extracted from

      • Made from


Uncovering semantic relations

Uncovering Semantic Relations

  • More examples:

    • Migraine drug -> treat, be used for, reduce, prevent

    • Wrinkle drug -> treat, be used for, reduce, smooth

    • Printer tray -> hold, come with, be folded, fit under, be inserted into

    • Student protest -> be led by, be sponsored by, pit, be, be organized by


Application sat analogy problems

Application: SAT Analogy Problems


Tackling the sat analogy problem

Tackling the SAT Analogy Problem

  • First issue queries to find the relations (features) that hold between each word pair

  • Compare the features for each answer pair to those of the question pair.

    • Weight the features with term count and document counts

    • Compare the weighted feature sets using Dice coefficient


Queries for sat analogy problem

Queries for SAT Analogy Problem


Extract features from retrieved text

Extract Features from Retrieved Text

  • Verb

    • The committeeincludes many members.

    • This is a committee, whichincludes many members.

    • This is a committee, including many members.

  • Verb+Preposition

    • The committeeconsists of many members.

  • Preposition

    • He is a memberof the committee.

  • Coordinating Conjunction

    • the committeeand its members


Most frequent features for committee member

Most Frequent Features for “committee member”


Sat results nouns only

SAT Results: Nouns Only


Conclusions

Conclusions

  • The enormous size of the web opens new opportunities for text analysis

    • There are many words, but they are more likely to appear together in a huge dataset

    • This allows us to do word-specific analysis

  • To counter the labeled-data roadblock, we start with unambiguous features that we can find naturally.

    • We’ve applied this to structural and semantic language problems.

    • These are stepping stones towards sophisticated language understanding.


Http biotext berkeley edu supported in part by nsf dbi 0317510

http://biotext.berkeley.edu

Supported in part by NSF DBI-0317510

Thank you!


Using n grams to make predictions

Using n-grams to make predictions

  • Say trying to distinguish:

    [home health] care

    home [health care]

  • Main idea: compare these co-occurrence probabilities

    • “home health” vs

    • “health care”


Using n grams to make predictions1

Using n-grams to make predictions

  • Use search engines page hits as a proxy for n-gram counts

    • compare Pr(w1w2|w2) to Pr(w1w3|w3)

    • Pr(w1 w2|w2) = #(w1,w2) / #(w2)

      • #(w2) word frequency; query for “w2”

      • #(w1,w2) bigram frequency; query for “w1 w2”


Probabilities why 1

Probabilities: Why? (1)

  • Why should we use:

    • (a) Pr(w1w2|w2), rather than

    • (b) Pr(w2w1|w1)?

  • Keller&Lapata (2004) calculate:

    • AltaVista queries:

      • (a): 70.49%

      • (b): 68.85%

    • British National Corpus:

      • (a): 63.11%

      • (b): 65.57%


Probabilities why 2

Probabilities: Why? (2)

  • Why should we use:

    • (a) Pr(w1w2|w2), rather than

    • (b) Pr(w2w1|w1)?

  • Maybe to introduce a bracketing prior.

    • Just like Lauer (1995) did.

  • But otherwise, no reason to prefer either one.

    • Do we need probabilities? (association is OK)

    • Do we need a directed model? (symmetry is OK)


Adjacency dependency 2

Adjacency & Dependency (2)

  • right bracketing: [w1[w2w3] ]

    • w2w3 is a compound (modified by w1)

    • w1 and w2 independently modify w3

  • adjacency model

    • Is w2w3 a compound?

    • (vs. w1w2 being a compound)

  • dependency model

    • Does w1 modify w3?

    • (vs. w1 modifying w2)

w1 w2 w3

w1 w2 w3

w1 w2 w3


Paraphrases pattern 1

Paraphrases: pattern (1)

  • v n1pn2 v n2n1(noun)

  • Can we turn “n1p n2” into a noun compound “n2 n1”?

    • meet/v demands/n1from/p customers/n2 

    • meet/v the customer/n2 demands/n1

  • Problem: ditransitive verbs like give

    • gave/v an apple/n1 to/p him/n2

    • gave/v him/n2 an apple/n1

  • Solution:

    • no determiner before n1

    • determiner before n2 is required

    • the preposition cannot be to


Paraphrases pattern 2

Paraphrases: pattern (2)

  • v n1pn2 v p n2n1(verb)

  • If “p n2” is an indirect object of v, then it could be switched with the direct object n1.

    • had/v a program/n1in/p place/n2

    • had/v in/p place/n2aprogram/n1

Determiner before n1 is required to prevent

“n2 n1” from forming a noun compound.


Paraphrases pattern 3

Paraphrases: pattern (3)

  • v n1pn2 p n2 * v n1(verb)

  • “*” indicates a wildcard position (up to three intervening words are allowed)

  • Looks for appositions, where the PP has moved in front of the verb, e.g.

    • I gave/v an apple/n1to/p him/n2

    • to/p him/n2 I gave/v an apple/n1


Paraphrases pattern 4

Paraphrases: pattern (4)

  • v n1p n2 n1p n2 v(noun)

  • Looks for appositions, where “n1 p n2” has moved in front of v

    • shaken/v confidence/n1 in/p markets/n2 

    • confidence/n1 in/p markets/n2 shaken/v


Paraphrases pattern 5

Paraphrases: pattern (5)

  • v n1 p n2  v PRONOUN p n2(verb)

  • n1 is a pronoun  verb (Hindle&Rooth, 93)

  • Pattern (5) substitutes n1 with a dative pronoun (him or her), e.g.

    • put/v a client/n1 at/p odds/n2 

    • put/v himat/p odds/n2

pronoun


Paraphrases pattern 6

Paraphrases: pattern (6)

  • v n1 p n2  BEn1 p n2(noun)

  • BE is typically used with a noun attachment

  • Pattern (6) substitutes v with a form of to be (isor are), e.g.

    • eat/v spaghetti/n1 with/p sauce/n2 

    • isspaghetti/n1 with/p sauce/n2

to be


  • Login