slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages PowerPoint Presentation
Download Presentation
Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

Loading in 2 Seconds...

play fullscreen
1 / 24

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages - PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages. Dan Garrette , Jason Mielens , and Jason Baldridge. Proceedings of ACL 2013. Semi-Supervised Training. HMM with Expectation-Maximization (EM). Need:. Large raw corpus. Tag dictionary.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages' - gwidon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Real-World Semi-Supervised Learning of POS-Taggers for

Low-Resource Languages

Dan Garrette, Jason Mielens, and Jason Baldridge

Proceedings of ACL 2013

slide2

Semi-Supervised Training

HMM with Expectation-Maximization (EM)

Need:

Large raw corpus

Tag dictionary

[Kupiec, 1992][Merialdo, 1994]

slide3

Previous Works:

  • Supervised Learning
    • Provide high accuracy for POS tagging (Manning, 2011).
    • Perform poorly when little supervision is available.
  • Semi-Supervised
    • Done by training sequence models such as HMM using the EM algorithm.
    • Work in this area has still relied on relatively
  • large amounts of data.
  • (Kupiec, 1992; Merialdo,1994).
slide4

Previous Works:

  • Goldberg et al.(2008)
    • Manually constructed lexicon for Hebrew to train HMM tagger.
    • Lexicon was developed over a long period of time by expert lexicographers.
  • Tackstrom et al. (2013)
    • Evaluated use of mixed type and token constraints generated by projecting information from high resource language to low resource languages.
    • Large parallel corpora required.
slide5

Low-Resource Languages

6,900 languages in the world

~30 have non-negligible quantities of data

No million-word corpus for any

endangered language

[Maxwell and Hughes, 2006][Abney and Bird, 2010]

slide6

Low-Resource Languages

Kinyarwanda (KIN)

Niger-Congo.

Morphologically-rich.

Malagasy (MLG)

Austronesian.

Spoken in Madagascar.

Also, English

slide7

Collecting Annotations

  • Supervised training is not an option.
  • Semi-supervised training:
    • Annotate some data by hand in 4 hours,
    • (in 30-minute intervals) for two tasks.
      • Type supervision.
      • Token supervision.
slide8

Tag Dict Generalization

These annotations are too sparse!

Generalize to the entire vocabulary

slide9

Tag Dict Generalization

Haghighi and Klein (2006) do this witha vector space.

We don’t have enough raw data

Das and Petrov (2011) do this witha parallel corpus.

We don’t have a parallel corpus

slide10

Tag Dict Generalization

Strategy: Label Propagation

• Connect annotations to raw corpus tokens

• Push tag labels to entire corpus

[Talukdar and Crammer. 2009]

slide11

Morphological Transducers

  • Finite-state transducers are used for morphological analysis.
  • FST accepts a word type and produces
  • a set of morphological features.
  • Power of FSTs:
    • Analyze out-of-vocabulary items by looking for known affixes and guessing the stem of the word.
slide12

Tag Dict Generalization

NEXT_walks

PREV_<b> NEXT_thug

PREV_the

TOK_the_4 TOK_the_1

TOK_the_9 TOK_thug_5

TOK_dog_2

TYPE_the

TYPE_thug

TYPE_dog

SUF1_e

PRE1_t

PRE2_th

SUF1_g

PRE1_d PRE2_do

slide13

Tag Dict Generalization

Type Annotations

_the__DT_____dog_NN____

SUF1_g

PRE2_th PRE1_t

TYPE_the

TYPE_thug

TYPE_dog

PREV_<b>

PREV_the

NEXT_walks

TOK_the_4

TOK_the_1

TOK_dog_2

TOK_thug_5

slide14

Tag Dict Generalization

Type Annotations

_the_________dog________

SUF1_g

PRE2_th PRE1_t

TYDTthe

TYNNog

TYPE_thug

PREV_<b>

PREV_the

NEXT_walks

TOK_the_4

TOK_the_1

TOK_dog_2

TOK_thug_5

slide15

Tag Dict Generalization

Type Annotations

_the________

SUF1_g

PRE2_th PRE1_t

dog

TYPE_the

TYPE_thug

TYPE_dog

PREV_<b>

PREV_the

NEXT_walks

TOK_the_4

TOK_the_1

TOK_dog_2

TOK_thug_5

Token Annotations

the dog walksDT NN VBZ

slide16

Tag Dict Generalization

Type Annotations

_the________

SUF1_g

PRE2_th PRE1_t

dog

TYPE_the

TYPE_thug

TYPE_dog

PREV_<b>

PREV_the

NEXT_walks

TODTe_4

TOK_the_1

TOKNN_2

TOK_thug_5

Token Annotations

the dog walks

____________

slide17

Model Minimization

  • • LP graph has a node for each corpus token.
  • Each node is labelled with distribution over POS tags.
  • Graph provides a corpus of sentences labelled with noisy tag distributions.
  • Greedily seek the minimal set of tagbigrams that describe the raw corpus.
  • Now use, HMM trained by EM.

[Ravi et al., 2010; Garrette and Baldridge, 2012]

overall accuracy
Overall Accuracy

All of these values were achieved using both FST and affix LP features.

slide24

Conclusion

  • Type Annotations are the most useful input from a linguist.
  • We can train effective POS-taggers on low resource languages given only a small amount of unlabeled text and a few hours of annotation by a non-native linguist.