1 / 25

John Tinsley

ACL 4 NCLT Seminar Presentation, 7 th June 2006. John Tinsley. Morphological Analysis of Spanish Using Finite-State Transducers. Introduction. What is this project about? Provide morphological information on Spanish strings Generate strings from morphologcal descriptions

Download Presentation

John Tinsley

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ACL 4 NCLT Seminar Presentation, 7th June 2006 John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers

  2. Introduction • What is this project about? • Provide morphological information on Spanish strings • Generate strings from morphologcal descriptions • What were my aims? • Robust, fast, application – easily integrated into other systems • 80% token coverage on unrestricted text • 100% coverage of Spanish morphology

  3. Design Methodology • Formalisation • Discovery of Spanish morphological rules • Implementation • Coding of morphological model with Xerox Finite-State Tools • Evaluation • Check for accuracy & well-formedness • Assess language coverage

  4. Formalisation

  5. Spanish Morphology - Verbs • Inflected for person, tense/mood, number • Regular verbs • 3 regular conjugations identified by infinitive endings • ‘-ar’, ‘-er’, and ‘-ir’ • Irregular verbs • 66 distinct irregularities • Varying degrees of irregularity

  6. Spanish Morphology - Nouns • Inflected for number, gender • 7 types of noun • Feminine, masculine, neutral, derivative, profession, number invariant, proper • Irregularities • All arise via pluralisation • Accentuation, character alterations

  7. Spanish Morphology - Adjectives • Inflected for number, gender • 4 types of adjective • Neutral, derivative, profession, irregular • Adverbs derived from adjectives by addition of suffix ‘mente’

  8. Implementation

  9. Xerox-Finite State Tools - lexc • Lexicon compiler • Compiles ‘continuation classes’ into lexical transducers

  10. Xerox Finite-State Tools - xfst • Xerox finite-state tool • Compiles regular expressions into networks • Regular expression replace rules [ String -> Replacement || left-context _ right-context ]

  11. Xerox Finite-State Tool - example • conocer - ‘to know’ • 1st person, pres. ind. ‘conozco’ • Lexical transducer mappings • conoc:conoc • er+Verb:ε • +PresInd:^PresInd • +1P+Sg:o

  12. Xerox Finite-State Tool - example cont… • Composed replace rule [ c -> {zc} || _ ^PresInd ] • Triggered by the ^PresInd tag • Makes required changes, remove trigger

  13. Verb Lexicon • Coded in lexc • Model has 3 regular paths • 66 varieties of irregularity • e.g. poder ‘to be able to’ LEXICON Irreg43 0:^UE^VSoue^PRET1^FR ErV ; [o -> {ue} || _Consonant^<4 [%^UE ?* [[%^PresInd | %^PresSubj] ?* [%^1PSg | %^2PSg | %^3PSg | %^3PPl] ]

  14. Noun Lexicon LEXICON NounFem ! Feminine Nouns !STEM !CONT. CLASS ! GLOSS acción fIsNounEs ; ! action LEXICON fIsNounEs ! feminine pluralised with 'es' +Noun:0 fNounPluralES ; LEXICON fNounPluralES +Sg+Fem:0 # ; +Pl+Fem:^NZ^NOes # ; [z -> c || _ %^NZ] [ó -> o || _ ?^<5 %^NO ]

  15. Adjective Lexicon • Same process as noun lexicon • Uses the same replace rules • One exception for adverbs LEXICON nIsAdjS +Adj:0 nAdjPluralS ; +Adj|+Adv:^AAOmente # ; [o -> a || _ %^NAO %^AAO {mente}]

  16. Other Transducers • Overgeneration Filter • llover ‘to rain’ • Capitalisation • Trigger Remover • Execution script ~[ $[{llov} ?* [[%+1P | %+2P] [%+Sg | %+Pl] | [%+3P %+Pl] ] ] [ a (->) A || .#. _ ] [ %^IE -> 0 ]

  17. Evaluation

  18. Testing • Accuracy • Maintaining integrity of existing rules • Projection • Subtraction • Well-formedness • Ensuring tag order

  19. Assessing Coverage • Aim – 80% on unrestricted text • Statistical predictions (Crystal 1997) • Corpus compilation and processing • Europarl, 3 corpora (http://people.csail.mit.edu/koehn/publications/europarl/ ) • Phase 1 – augmentation • Phase 2 – 81% coverage • Final assessment – 84.15% coverage

  20. Further Details • Generates approx. 44,000 unique morphological descriptions • Evaluation corpus – 1.26 analyses per input token on average

  21. Possible improvements • Increase coverage • lexicon augmentation • Disambiguation using POS tagger • More derivational morphology • Deal with different dialects of Spanish

  22. References • (Beesley & Karttunen 2003) Beesley, K. and Karttunen, L., Finite State Morphology, CSLI Publications, United States, 2003.  • (Claret 2005) Los Verbos Castellanos Conjugados, Sexta Edición, Editorial Claret, Barcelona, 2005 • (Crystal 1997) Crystal, D., The Cambridge Encyclopedia of Language. (2nd. ed.) Cambridge University Press, 1997 • Europarl - Europarl Parallel Corpus http://people.csail.mit.edu/koehn/publications/europarl/ - Last Accessed 19/05/2006 • (Kendris 1990) Kendris, C. Spanish Grammar. Barron’s, 1990. • (Mateo & Rojo Sastre 1997) Mateo, F. and Rojo Sastre, A.J. Collection Bescherelle - Les verbes espagnols. Hatier, 1997. • Real Academia Española – http://www.rae.es/ - Last Accessed 25/05/2006

  23. Conclusions Demonstration

  24. LEXICON ArVerbs !STEM !CONT. CLASS !GLOSS abord ArV ; !to approach LEXICON ArV ar+Verb:0 ArConj ; LEXICON ArConj !TAGS !CONT.CLASS +PresInd:^PresInd ArPresInd ; +PretInd:^PretInd ArPretInd ; LEXICON ArPresInd ! Present Indicative +1P+Sg:o^1PSg #; +2P+Sg:as^2PSg #; +3P+Sg:a^3PSg #;

More Related