Unsupervised approaches to sequence tagging morphology induction and lexical resource acquisition
Download
1 / 41

Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008 - PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on

Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource Acquisition. Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008. Unsupervised Methods. Sequence Labeling (Part-of-Speech Tagging ) Morphology Induction Lexical Resource Acquisition. pronoun.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Reza Bosaghzadeh & Nathan Schneider LS2 ~ 1 December 2008' - oke


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Unsupervised approaches to sequence tagging morphology induction and lexical resource acquisition

Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource Acquisition

Reza Bosaghzadeh & Nathan Schneider

LS2 ~ 1 December 2008


Unsupervised methods
Unsupervised Induction, and Lexical Resource AcquisitionMethods

  • Sequence Labeling (Part-of-Speech Tagging)

  • Morphology Induction

  • Lexical Resource Acquisition

    .

pronoun

verb

preposition

det

noun

adverb

She

ran

to

the

station

quickly

un-supervise-dlearn-ing


Part of speech pos tagging
Part-of-speech Induction, and Lexical Resource Acquisition(POS) tagging

  • Prototype-driven model

    Cluster based on a few examples

    (Haghighi and Klein 2006)

  • Contrastive Estimation

    Discount positive examples at cost of implicit negative example

    (Smith and Eisner 2005)


Prototype driven tagging haghighi klein 2006

Target Label Induction, and Lexical Resource Acquisition

Prototypes

Prototype-driven taggingHaghighi & Klein (2006)

Annotated

Data

Unlabeled

Data

Prototype

List

+

slide courtesy Haghighi & Klein


Prototype driven tagging haghighi klein 20061

Size Induction, and Lexical Resource Acquisition

Restrict

Terms

Location

Features

Prototype-driven taggingHaghighi& Klein (2006)

Information Extraction: Classified Ads

Newly remodeled2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, schools and park.Paid water and garbage.No dogs allowed.

Newly remodeled 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, schools and park. Paid water and garbage. No dogs allowed.

Prototype List

slide courtesy Haghighi & Klein


Prototype driven tagging haghighi klein 20062

PUNC Induction, and Lexical Resource Acquisition

NN

VBN

CC

JJ

CD

IN

DET

NNS

IN

NNP

RB

Prototype-driven taggingHaghighi & Klein (2006)

English POS

Newly remodeled 2Bdrms/1Bath,spacious upper unit,locatedin Hilltop Mallarea.Walkingdistance toshopping, public transportation,schoolsandpark.Paid water andgarbage. Nodogs allowed.

Newly remodeled 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Mall area. Walking distance to shopping, public transportation, schools and park. Paid water and garbage. No dogs allowed.

Prototype List

slide courtesy Haghighi & Klein


Prototype driven tagging haghighi klein 20063
Prototype-driven tagging Induction, and Lexical Resource AcquisitionHaghighi & Klein (2006)

  • 3 prototypes

    per tag

  • Automatically

    extracted by

    frequency

(Haghighi and Klein 2006)


Prototype driven tagging haghighi klein 20064
Prototype-driven tagging Induction, and Lexical Resource AcquisitionHaghighi & Klein (2006)

  • Trigram tagger, same features as (Smith & Eisner 2005)

    • Word type, suffixes up to length 3, contains-hyphen, contains-digit, initial capitalization

  • Tie each word to its most similar prototype, using context-based similarity technique (Schütze 1993)

    • SVD dimensionality reduction

    • Cosine similarity between context vectors

(Haghighi and Klein 2006)


Prototype driven tagging haghighi klein 20065
Prototype-driven tagging Induction, and Lexical Resource AcquisitionHaghighi & Klein (2006)

Pros

  • Fairly easy to choose a few tag prototypes

  • Number of prototypes is flexible

  • Doesn’t require tagging dictionary

    Cons

  • Still need a tag set

  • Chosen prototypes may not work so well if infrequent or highly ambiguous


Contrastive e stimation smith eisner 2005
Contrastive Induction, and Lexical Resource AcquisitionEstimationSmith & Eisner (2005)

  • Already discussed in class

  • Key idea: exploits implicit negative evidence

    • Mutating training examples often gives ungrammatical (negative) sentences

    • During training, shift probability mass from generated negative examples to given positive examples


Unsupervised pos tagging the state of the art
Unsupervised POS Induction, and Lexical Resource AcquisitiontaggingThe State of the Art

Best supervised result (CRF): 99.5%!


Unsupervised methods1
Unsupervised Methods Induction, and Lexical Resource Acquisition

  • Sequence Labeling (Part-of-Speech Tagging)

  • Morphology Induction

  • Lexical Resource Acquisition

    .

pronoun

verb

preposition

det

noun

adverb

She

ran

to

the

station

quickly

un-supervise-dlearn-ing


Unsupervised approaches to morphology
Unsupervised Approaches to Morphology Induction, and Lexical Resource Acquisition

  • Morphology refers to the internal structure of words

    • A morpheme is a minimal meaningful linguistic unit

    • Morpheme segmentation is the process of dividing words into their component morphemes

      unsupervised learning

      .


Unsupervised approaches to morphology1
Unsupervised Approaches to Morphology Induction, and Lexical Resource Acquisition

  • Morphology refers to the internal structure of words

    • A morpheme is a minimal meaningful linguistic unit

    • Morpheme segmentation is the process of dividing words into their component morphemes

      un-supervise-dlearn-ing

    • Word segmentation is the process of finding word boundaries in a stream of speech or text

      unsupervisedlearningofnaturallanguage


Unsupervised approaches to morphology2
Unsupervised Approaches to Morphology Induction, and Lexical Resource Acquisition

  • Morphology refers to the internal structure of words

    • A morpheme is a minimal meaningful linguistic unit

    • Morpheme segmentation is the process of dividing words into their component morphemes

      un-supervise-dlearn-ing

    • Word segmentation is the process of finding word boundaries in a stream of speech or text

      unsupervised learning of natural language


Paramor morphological paradigms monson et al 2007 2008
ParaMor Induction, and Lexical Resource Acquisition: Morphological paradigmsMonson et al. (2007, 2008)

  • Learns inflectional paradigms from raw text

    • Requires only the vocabulary of a corpus

    • Looks at word counts of substrings, and proposes (stem, suffix) pairings based on type frequency

  • 3-stage algorithm

    • Stage 1: Candidate paradigms based on frequencies

    • Stages 2-3: Refinement of paradigm set via merging and filtering

  • Paradigms can be used for morpheme segmentation or stemming


Paramor monson et al 2007 2008
ParaMor Induction, and Lexical Resource Acquisition: Monson et al. (2007, 2008)

  • A sampling of Spanish verb conjugations


Paramor morphological paradigms monson et al 2007 20081
ParaMor Induction, and Lexical Resource Acquisition: Morphological paradigmsMonson et al. (2007, 2008)

  • A proposed paradigm with stems {habl, bail} and suffixes {-ar, -o, -amos, -an}


Paramor morphological paradigms monson et al 2007 20082
ParaMor Induction, and Lexical Resource Acquisition: Morphological paradigmsMonson et al. (2007, 2008)

  • Same paradigm from previous slide, but with stems {habl, bail, compr}


Paramor morphological paradigms monson et al 2007 20083
ParaMor Induction, and Lexical Resource Acquisition: Morphological paradigmsMonson et al. (2007, 2008)

  • From just this list, other paradigm analyses (which happen to be incorrect) are possible


Paramor morphological paradigms monson et al 2007 20084
ParaMor Induction, and Lexical Resource Acquisition: Morphological paradigmsMonson et al. (2007, 2008)

  • Another possibility: stems {hab, bai}, suffixes {-lar, -lo, -lamos, -lan}


Paramor morphological paradigms monson et al 2007 20085
ParaMor Induction, and Lexical Resource Acquisition: Morphological paradigmsMonson et al. (2007, 2008)

  • Spurious segmentations—this paradigm doesn’t generalize to comprar(or most verbs)


Paramor morphological paradigms monson et al 2007 20086
ParaMor Induction, and Lexical Resource Acquisition: Morphological paradigmsMonson et al. (2007, 2008)

  • What if not all conjugations were in the corpus?


Paramor morphological paradigms monson et al 2007 20087
ParaMor Induction, and Lexical Resource Acquisition: Morphological paradigmsMonson et al. (2007, 2008)

  • We have two similar paradigms that we want to merge


Paramor morphological paradigms monson et al 2007 20088
ParaMor Induction, and Lexical Resource Acquisition: Morphological paradigmsMonson et al. (2007, 2008)

  • This amounts to smoothing, or “hallucinating” out-of-vocabulary items


Paramor morphological paradigms monson et al 2007 20089
ParaMor Induction, and Lexical Resource Acquisition: Morphological paradigmsMonson et al. (2007, 2008)

  • Heuristic-based, deterministic algorithm can learn inflectional paradigms from raw text

  • Paradigms can be used straightforwardly to predict segmentations

    • Combining the outputs of ParaMor and Morfessor (another system) won the segmentation task at MorphoChallenge 2008 for every language: English, Arabic, Turkish, German, and Finnish

  • Currently, ParaMor assumes suffix-based morphology


Bayesian word segmentation goldwater et al 2006 in submission
Bayesian word segmentation Induction, and Lexical Resource AcquisitionGoldwater et al. (2006; in submission)

  • Word segmentation results – comparison

  • See Narges & Andreas’s presentation for more on this model

Goldwater Unigram DP

Goldwater Bigram HDP


Multilingual morpheme segmentation snyder barzilay 2008
Multilingual morpheme Induction, and Lexical Resource Acquisitionsegmentation Snyder & Barzilay (2008)

  • Considers parallel phrases and tries to find morpheme correspondences

  • Stray morphemes don’t correspond across languages

  • Abstract morphemes cross languages: (ar,er), (o, e), (amos,ons), (an,ent), (habl, parl)


Morphology papers inputs outputs
Morphology Induction, and Lexical Resource AcquisitionPapers: Inputs & Outputs


Unsupervised methods2
Unsupervised Methods Induction, and Lexical Resource Acquisition

  • Sequence Labeling (Part-of-Speech Tagging)

  • Morphology Induction

  • Lexical Resource Acquisition

    .

pronoun

verb

preposition

det

noun

adverb

She

ran

to

the

station

quickly

un-supervise-dlearn-ing


Lexical resource acquisition
Lexical Induction, and Lexical Resource AcquisitionResource Acquisition

  • Bilingual Lexicons from Monolingual Corpora (Haghighi et al. 2008)

  • Narrative Event Chains (Chambers and Jurafsky 2008)

  • A Statistical Verb Lexicon for Semantic Role Labeling (Grenager and Manning 2006)


Bilingual lexicons from monolingual corpora haghighi et al 2008
Bilingual lexicons Induction, and Lexical Resource Acquisitionfrom monolingual corporaHaghighi et al. (2008)

Source Words

s

Target Words

t

Source

Text

Target

Text

Matching

m

nombre

estado

state

world

slide courtesy Haghighi et al.

política

name

mundo

nation


Bilingual lexicons from monolingual corpora haghighi et al 20081
Bilingual Induction, and Lexical Resource AcquisitionLexicons from Monolingual CorporaHaghighi et al. (2008)

Data Representation

Orthographic Features

Orthographic Features

estado

state

#st

#es

1.0

1.0

sta

tat

1.0

1.0

te#

do#

1.0

1.0

Source

Text

Target

Text

Context Features

Context Features

mundo

world

20.0

17.0

politica

politics

5.0

10.0

Used a variant of CCA (Canonical Correlation Analysis)

Trained with ViterbiEM to find best matching

sociedad

society

6.0

10.0

slide courtesy Haghighi et al.


Bilingual lexicons from monolingual corpora haghighi et al 20082
Bilingual Induction, and Lexical Resource AcquisitionLexicons from Monolingual CorporaHaghighi et al. (2008)

Feature Experiments

  • MCCA: Orthographic and context features

Precision

4k EN-ES Wikipedia Articles

(Haghighi et al. 2008)


Narrative events chambers jurafsky 2008
Narrative Induction, and Lexical Resource AcquisitioneventsChambers & Jurafsky (2008)

  • Given a corpus, identifies related events that constitute a “narrative” and (when possible) predict their typical temporal ordering

    • E.g.: criminal prosecution narrative, with verbs: arrest, accuse, plead, testify, acquit/convict

  • Key insight: related events tend to share a participant in a document

    • The common participant may fill different syntactic/semantic roles with respect to verbs: arrest.object, accuse.object, plead.subject


Narrative events chambers jurafsky 20081
Narrative Induction, and Lexical Resource AcquisitioneventsChambers & Jurafsky (2008)

  • A temporal classifier can reconstruct pairwise canonical event orderings, producing a directed graph for each narrative


Statistical verb lexicon grenager manning 2006
Statistical verb lexicon Induction, and Lexical Resource AcquisitionGrenager& Manning (2006)

  • From dependency parses, a generative model predicts semantic roles corresponding to each verb’s arguments, as well as their syntactic realizations

    • PropBank-style: arg0, arg1, etc. per verb (do not necessarily correspond across verbs)

    • Learned syntactic patterns of the form: (subj=give.arg0, verb=give, np#1=give.arg2, np#2=give.arg1) or (subj=give.arg0, verb=give, np#2=give.arg1, pp_to=give.arg2)

  • Used for semantic role labeling


Semanticity our proposed scale of semantic richness
Induction, and Lexical Resource AcquisitionSemanticity”: Our proposed scale of semantic richness

  • text < POS < syntax/morphology/alignments < coreference/semantic roles/temporal ordering < translations/narrative event sequences

  • We score each model’s inputs and outputs on this scale, and call the input-to-output increase “semantic gain”

    • Haghighi et al.’s bilingual lexicon induction wins in this respect, going from raw text to lexical translations


Robustness to language variation
Robustness to Induction, and Lexical Resource Acquisition language variation

  • About half of the papers we examined had English-only evaluations

  • We considered which techniques were most adaptable to other (esp. resource-poor) languages. Two main factors:

    • Reliance on existing tools/resourcesfor preprocessing (parsers, coreference resolvers, …)

    • Any linguistic specificityin the model (e.g. suffix-based morphology)


Summary
Summary Induction, and Lexical Resource Acquisition

We examined three areas of unsupervised NLP:

  • Sequence tagging: How can we predict POS (or topic) tags for words in sequence?

  • Morphology: How are words put together from morphemes (and how can we break them apart)?

  • Lexical resources: How can we identify lexical translations, semantic roles and argument frames, or narrative event sequences from text?

    In eight recent papers we found a variety of approaches, including heuristic algorithms, Bayesian methods, and EM-style techniques.


Questions

Target Label Induction, and Lexical Resource Acquisition

Prototypes

Questions?

Thanks to Noah and Kevin for their feedback on the paper; Andreas and Narges for their collaboration on the presentations; and all of you for giving us your attention!

subj=give.arg0verb=givenp#1=give.arg2np#2=give.arg1

  • un-supervise-dlearn-ing


ad