1 / 11

Developing a Persian Part of Speech Tagger

Developing a Persian Part of Speech Tagger. Karine Megerdoomian University of California, San Diego karinem@ling.ucsd.edu. Part of speech tagging applications Machine translation Information extraction Parsing Overview of issues Encoding issues Long-distance dependencies

kasi
Download Presentation

Developing a Persian Part of Speech Tagger

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing a Persian Part of Speech Tagger Karine Megerdoomian University of California, San Diego karinem@ling.ucsd.edu

  2. Part of speech tagging applications • Machine translation • Information extraction • Parsing • Overview of issues • Encoding issues • Long-distance dependencies • Boundaries of words and phrases • Complex tokens and multiword expressions

  3. Types of Taggers • Knowledge-based taggers • Based on grammar rules • Can analyze complex structures • Cannot guess unknowns • Statistical taggers • Trained on a pre-tagged corpus • Can guess unknowns • Saturation • Hybrid taggers • Knowledge-based for tagging • Statistical for guessing and disambiguating

  4. Tagset Design • Tagset = Set of annotation tags used to classify analyzed tokens. • Tagset provides relevant linguistic information about syntactic and syntactic properties of the word. • Tagset design depends on final application of system • Part of speech • Part of speech types (for information retrieval) • Boundaries (for parsing) • Semantic information (for word sense disambiguation) • Tagset size should be small for probabilistic machines

  5. Script and Encoding • Diacritics are optionalپست • Multiple encodings Persian unicode (\u06a9 for ک and \u064a for ی) Arabic unicode (\u0643 for ک and \u06cc or \u0649 for ی) • Control characters unicode \u200c to mark final form of letters

  6. Word Boundaries • Optional whitespace • Can use a post-segmentation script to tokenizeرفتندمردم • Separated affixes می دادند vs. میدادند فلسطينی ها vs. فلسطينیها • Complex tokens • Two different POS categories بشيوه – اينکار – بهترست دردفتر

  7. Multiword Expressions • Lexical units • Usually need to be listed in lexiconبنابراين • Morphological units • Can be analyzed in morphological analyzerفروخته بوده اند • Phrasal verbs • Have syntactic properties  analyzed in intermediate level between morphology and syntax اظهار تأسف کردند ماشين لباسشوئی – خمرهای سرخ

  8. Phrasal Boundaries • Noun Phrase highly ambiguous • Short vowels are not written • Lack of overt boundary markers • No particles linking NP elements (اضافه is often not written) • Subject-Object-Verb order • Very long sentences • Relatively free word order • Boundary markers • Proper names and pronouns وزير خارجه آينده آمريکا • Morphemesی، تان/شان، را • Lack of boundary: اضافهانفجارهای اخيرعراق

  9. Long distance dependencies • Some tenses of the verb can only be determined if we take into account the co-occurrence of the prefix and the person inflection and auxiliary forms. Problem for linear approaches (e.g., two-level morphology) • Imperfect میگريختند • Compound Imperfect می گريخته است • Perfect گريخته است

  10. Phonetics and Phonology • Consistent phonological patterns • Form of the affix varies based on last character of the word - گدايانبيگانگاندانشجويي • Phonological rules apply across categories no need to list all possible forms if use rules • Mismatch between orthography and phonetics need to distinguish words based on their pronunciation • دانشجو vs. گاو

  11. Conclusion • Overview of the main challenges encountered in the development of a POS tagger for Persian. • Introduced certain criteria to be considered in designing an annotation set for POS tagging. • Contrasted various approaches and proposed possible methods for resolving these computational and linguistic issues.

More Related