1 / 51

Unsupervised Syntactic Category Induction using Multi-level Linguistic Features

Unsupervised Syntactic Category Induction using Multi-level Linguistic Features. Christos Christodoulopoulos Pre-viva talk. What did we ever do for the linguists?. Computational linguistic (or NLP) models Mostly supervised (until recently) Mostly on English (English ≈ WSJ sections 02-21)

jesse
Download Presentation

Unsupervised Syntactic Category Induction using Multi-level Linguistic Features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Syntactic Category Induction using Multi-level Linguistic Features Christos Christodoulopoulos Pre-viva talk

  2. What did we ever do for the linguists? • Computational linguistic (or NLP) models • Mostly supervised (until recently) • Mostly on English (English ≈ WSJ sections 02-21) • Mostly on the “fat head” of Zipf’s law

  3. Revolutionaries (without supervision) • NLP is good at spotting patterns • Unsupervised learning • Machine Learning • Not great at looking at the whole picture

  4. The “whole” picture Traditional NLP Pipeline Morphology Parts of Speech Syntax Alignments

  5. The “whole” picture Morphology Clark (2003) Virpioja et al. (2007) Geertzen & van Zaanen (2004) Parts of Speech Klein & Manning (2004) Naseem et al. (2009) Sirts & Alumäe (2012) Syntax Alignments Snyder et al. (2009)

  6. My thesis • Patterns that correspond to syntactic categories or parts of speech (PoS) • Motivated by linguistic theories • Holistic view of NLP • Instead of the pipeline approach • Computationally efficient • Cross-lingual analysis • Might provide linguistic insights

  7. My thesis do, make, give, take, know, bring, eat, see, hear, keep man, day, time, city, place, thing, priest, woman, wicked, spirit he, it, there, she, whosoever, soon, others, Satan, whoso, Elias 10 14 35 0 1 2 3 4 5 6 7 8 9 11 12 13 15 16 17 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 34 0 1 2 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 28 30 31 33 34 35 this, what, if, whoever, there, but, when, everyone, you, him τουτο, τι, εαν, τις, ιδου, ομως, οτε, παντες, συ, αυτος be, can, become, must, come, do, give, make, see, give ηναι, δυναται, γεινη, πρεπει, ελθη, καμη, δωση, καμω, ιδη, λαβη 3 8 27 29 32 is, does, gives, brings, comes, takes, says, knows,becomes,offers εισθαι, καμει, δωσει, φερει, λαβει, ελθει, ειπει, γνωρισει, γεινει, προσφερει Κυριος, Θεος, βασιλευς, υιος, Ιησους, λαος, ανθρωπος, Μωυσης, λογος, ιερευς αυτο, εκει, πλεον, ουχι, εκαστος, τουτον, ταυτην, ουδεν, ετη, πυρ this, there, since, not, whosoever, him, her, none, fire Lord, God, king, son, Jesus, people, man, Moses, word, priest

  8. What do we need? • Theory of syntactic categories • What are we looking for? • Clustering method • Review of existing methods • Multiple sources of information • Alignment method based on PoS

  9. What do we need? • Theory of syntactic categories • What are we looking for? • Clustering method • Review of existing methods • Multiple sources of information • Alignment method based on PoS

  10. Theory of syntactic categories • Not everyone agrees on what they are • Syntactic categories/PoS/Word classes? • Most agree that they capture more than one levels of language structure • Influence to my work: • Not easy to focus on any particular theory • Multiple sources of information is key • Semantic (noun, verb) • Morphological (conjunction) Plato, Aristotle • 8 parts of speech • Semantic, syntactic & morphological • Feature-based (e.g. ±Subject) • Purely syntactic • Formal semantics • <e,t>: nouns, adjectives, intr. verbs • Notional (pragmatic) definition • Morphological, syntactic & distributional Susan Schmerling Paul Schachter Dionysius Thrax Ray Jackendoff • ‘School account’ of 9 parts of speech • Semantic& syntactic Lindley Murray

  11. What do we need? • Theory of syntactic categories • What are we looking for? • Clustering method • Review of existing methods • Multiple sources of information • Alignment method based on PoS

  12. What do we need? • Theory of syntactic categories • What are we looking for? • Clustering method • Review of existing methods • Multiple sources of information • Alignment method based on PoS

  13. Evaluation • How will we know whether we have found good clusters? • Intrinsic • Test on existing PoS tagged data (gold-standard) • Extrinsic • Use clusters as input to another task • Both have issues when used with unsupervised methods

  14. Intrinsic evaluation • Clusters might not correspond to PoS • Cluster IDs instead of labels • Different sizes • Gold-standard follows specific linguistic theories • Gold-standard might not help downstream • Annotations are tuned to specific tasks

  15. The (intrinsic) elephant in the room • Cyclical problem • Trying to discover clusters that don’t (necessarily) correspond to gold-standard annotations • Evaluate them on gold-standard annotations • (Compromising) Solution: Test on multiple languages, using multiple systems • There is going to be some overlap

  16. Extrinsic evaluation • “Passing the buck” to the next task • If unsupervised and intrinsically evaluated • Performances might not be correlated • Intrinsic gains on task #1 do not correlate with gains on task #2 (Headden et al., 2008) • Depends on the degree of integration • How much of task #2’s input is the output of #1

  17. Evaluation • Intrinsic evaluation metrics • Mapping • Many-to-one (m-1), one-to-one, cross-validation • Widely used • Sensitive to size of induced tagset • Information-theoretic • Variation of information (vi), V-measure (vm) • Less sensitive • Less intuitive (especially vi)

  18. What do we need? • Theory of syntactic categories • What are we looking for? • Clustering method • Review of existing methods • Multiple sources of information • Alignment method based on PoS

  19. tl;dr: Christodoulopoulos et al., 2010; 2011 What do we need? Multiple systems examined • Theory of syntactic categories • What are we looking for? • Clustering method • Review of existing methods • Multiple sources of information • Alignment method based on PoS • Most successful properties: • Use of morphology features • One cluster per word type Average performance on 22 languages (2011 review) Average performance on 8 languages (2010 review)

  20. What do we need? • Theory of syntactic categories • What are we looking for? • Clustering method • Review of existing methods • Multiple sources of information • Alignment method based on PoS

  21. Bayesian Multinomial Mixture Model (BMMM) • Three key properties: • One tag per word • Helpful for unsupervised systems • Mixture Model (instead of HMM) • Easier to handle non-local features • Easy to add multiple features • e.g. morphology, alignments SPOILER ALERT!!! I’ll be adding dependencies

  22. BMMMBasic model structure θ α z φ f β For each word type i choose a class zi (conditioned on θ) For each word token j choose a feature fij (conditioned on φi) M nj Z

  23. BMMMExtended model θ α Token-level features f(1) φ(1) β(1) z ... ... ... m f(T) φ(T) β(T) nj nj Z Z Z φ(m) β(m) M Type-level features

  24. * * Development results(averaged over 8 languages)

  25. Final Results(using +morph system)

  26. What do we need? • Theory of syntactic categories • What are we looking for? • Clustering method • Review of existing methods • Multiple sources of information • Alignment method based on PoS

  27. What do we need? • Theory of syntactic categories • What are we looking for? • Clustering method • Review of existing methods • Multiple sources of information • Interdependence of linguistic structure • Alignment method based on PoS

  28. Putting the syntax in syntactic categories • Induced dependencies as BMMM features • DMV (Klein & Manning, 2004) • Basis for most dependency parsers • Uses parts of speech as terminal nodes • Proxy for a joint model • Induce PoS that help DMV induce dependencies… • …that help induce better PoS… • …repeat… For example: Cohen & Smith (2009); Headden et al. (2009); Gillenwater et al. (2010); Blunsom & Cohn (2011); Spitkovsky et al. (2010a,b, 2011a,b,c)

  29. The Iterated Learning (IL) Model BMMM BMMM L+R CONTEXT L+R CONTEXT Experiments on WSJ and CoNLL (≤10 words) MORPHOLOGY MORPHOLOGY This/32 is/12 a/32 tagged/1 sentence/28 ./0 This/15 is/9 a/15 tagged/1 sentence/28 ./41 DEPENDENCIES This/32 is/12 another/32 tagged/1 sentence/28 ./0 This/15 is/9 another/32 tagged/1 sentence/28 ./41 … … DMV DMV

  30. IL – DependenciesWSJ10 Results

  31. *** IL – DependenciesAverage over 9 CoNLLlanguages (≤10 words)

  32. IL – DependenciesShortcomings • DMV is not stateof the art • Best systems surpass it by more than 15% accuracy (for WSJ) • Trained/tested only on ≤10-word sentences • Hard to compare the PoS inducer • Not realistic Results not shown here. tl;dr: Slightly better results on PoS, much worse on deps. Interesting for further discussion. Replace DMV component with a state-of-the-art system (TSG-DMV) Use full-length-sentence corpora for training and testing

  33. * ** * Using full-length sentencesAverage over 9 CoNLL languages

  34. Recap • Both BMMM and DMV improve • Mostly in the first few iterations • Using full-length sentences: • Increase in BMMM above system with gold deps • DMV close to performance with gold PoS (but lower than ≤10-word case)

  35. What do we need? • Theory of syntactic categories • What are we looking for? • Clustering method • Review of existing methods • Multiple sources of information • Interdependence of linguistic structure • Alignment method based on PoS

  36. What do we need? • Theory of syntactic categories • What are we looking for? • Clustering method • Review of existing methods • Multiple sources of information • Interdependence of linguistic structure • Alignment method based on PoS

  37. Further IL experiments • Giza++ (Och & Ney, 2000) • Extension of the IBM 1-4 models (Brown et al. 1993) • Uses ‘word classes’ to condition alignment prob. • Can be replaced with BMMM • Hansards English-French corpus • Manually annotated alignments for 500 sentences • MULTEXT-East corpus • 1984 Novel in 8 languages (incl. English)

  38. IL – Alignments Hansards corpus

  39. IL – AlignmentsMULTEXT-East corpus

  40. Recap • Iterated learning between PoS and X • X = {Dependencies, Alignments, Morphology} • Effective proxy for joint inference • PoS induction helped by all other levels • A test for theories of PoS • A joint model of NLP

  41. A joint model of NLP Morphology Parts of Speech Syntax Alignments

  42. Induction chains

  43. *** ** ** * * * ** Induction chains **

  44. What do we need? And one more thing: • Massively parallel corpus • Bible translations • Collected from online versions of the Bible • Cleaned-up and verse-aligned (CES level 1 XML) • 100 languages • Theory of syntactic categories • What are we looking for? • Clustering method • Review of existing methods • Multiple sources of information • Interdependence of linguistic structure • Alignment method based on PoS

  45. Cross-lingual clusters tl;dr: My thesis do, make, give, take, know, bring, eat, see, hear, keep man, day, time, city, place, thing, priest, woman, wicked, spirit he, it, there, she, whosoever, soon, others, Satan, whoso, Elias • Unsupervised syntactic category induction • Theory of syntactic categories • Review of systems/evaluation metrics • Iterated learning & Induction chains • Holistic view of NLP (no more pipelines!) • Cross-lingual clusters • Tool for linguistic enquiry • Reveal similarities/differences across languages 10 14 35 0 1 2 3 4 5 6 7 8 9 11 12 13 15 16 17 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 34 3rd person present tense θελει να καμει (she) wants to (she) make ‘she wants to make’ Subjunctive mood ο δε Κυριος ας καμη το αρεστον the and Lord let (he) do the pleasing ‘and the Lord do [that which seemeth him] good’ 0 1 2 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 28 30 31 33 34 35 this, what, if, whoever, there, but, when, everyone, you, him τουτο, τι, εαν, τις, ιδου, ομως, οτε, παντες, συ, αυτος [sbjv] be, can, become, must, come, do, give, make, see, give ηναι, δυναται, γεινη, πρεπει, ελθη, καμη, δωση, καμω, ιδη, λαβη 3 8 27 29 32 is, does, gives, brings, comes, takes, says, knows,becomes,offers εισθαι, καμει, δωσει, φερει, λαβει, ελθει, ειπει, γνωρισει, γεινει, προσφερει Κυριος, Θεος, βασιλευς, υιος, Ιησους, λαος, ανθρωπος, Μωυσης, λογος, ιερευς αυτο, εκει, πλεον, ουχι, εκαστος, τουτον, ταυτην, ουδεν, ετη, πυρ this, there, since, not, whosoever, him, her, none, fire Lord, God, king, son, Jesus, people, man, Moses, word, priest

  46. Where can we go from here? • Fully joint models • Preliminary attempts for PoS & Dependencies • Evaluation methods • Non-gold-standard based (Smith, 2012) • “Syntactically aware” categories • CCG type induction (Bisk & Hockenmaier, 2012) • Linguistic analysis • Invite the Romantics back! THE END

  47. A fully joint model • Maximise jointly the distributions over PoS and dependency trees • Run a full training step of DMV every time BMMM samples a new PoS sequence • Intractable • Solution: • Train DMV on partial trees (up to a depth d) • Comparable results with best IL models (also, still quite slow)

  48. TSG-DMVBlunsom & Cohn (2010) • Tree Substitution Grammar • CFG subset of LTAG • Lexicalised • Eisner’s (2000) split-head constructions • Allows for modelling longer-range dependencies • Pitman-Yor process (Teh, 2006) over TSG trees

  49. IL – TSG-DMVAverage over 9 CoNLL languages (≤10 words)

  50. Using full-length sentences – TSG-DMVAverage over 9 CoNLL languages

More Related