Extending a persian morphological analyzer to blogs
This presentation is the property of its rightful owner.
Sponsored Links
1 / 21

EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS PowerPoint PPT Presentation


  • 52 Views
  • Uploaded on
  • Presentation posted in: General

EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS. Karine Megerdoomian University of Maryland, College Park [email protected] دانشگاه تهران. دومین کارگاه پژوهشی زبان فارسی و رایانه. Talk Outline. Persian Weblogs

Download Presentation

EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Extending a persian morphological analyzer to blogs

EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS

Karine Megerdoomian

University of Maryland,

College Park

[email protected]

دانشگاه تهران

دومین کارگاه پژوهشی زبان فارسی و رایانه


Talk outline

Talk Outline

  • Persian Weblogs

    • Persian is the 4th largest blog language in the world (~75,000 sites)

  • Description of a finite-state morphological analyzer for Persian

    • System description

    • Language issues and implementation

  • Computational issues in weblogs


Language of blogs

Language of Blogs

  • Contain both formal and informal morphology

  • Morphology

    • Informal text is very different from formal

      مرا گرفته است گرفتهتم

    • Features that don’t exist in formal

      فروشندهه؛ رفتش

    • Shortened verbal stems and inflection

      می گویند میگن


Language of blogs1

Language of Blogs

  • Morphology

    • Colloquial pronunciation

      غلطای املایی ؛ این سایتو ؛ دوستامونم ؛ دردناکه ؛ مثل منن

      ازشون ؛ خودتون ؛ نگاههایشان ؛ همسایهاشون

    • Spelling errors and non-standard punctuation & spacing

    • Emoticons  and hyperlinks


Language of blogs2

Language of Blogs

  • Lexicon

    • Wordforms follow pronunciation

      اوضاش ؛ برام ؛ نگامی کنم ؛ خونه ؛ تمبل ؛ همدیگه ؛ بش گفتم

    • Colloquial forms

      تو دانشگاه ؛ واسه استادام

    • New words

      لینکدونی ؛ دوستان کامنتگذار


Language of blogs3

Language of Blogs

  • Lexicon

    • Loan words

      چتروم ؛ آنلاین ؛ دانلود کنین

    • Interjections

      آاااخ! ؛ والا ؛ وای ؛ اوووه!

    • More idiomatic expressions

      دمشگرم آقا


Language of blogs4

Language of Blogs

  • Huge amount of variation!!

    • Need for flexible rules

    • Phonological rules to represent colloquial speech

    • Need to disambiguate(statistical component?)

  • Formal blog text is also different from traditional formal text


Language of blogs5

Language of Blogs

BBCخوابگرد

موافقاندموافقند

بینندهگانبینندگان

کتاباشکتابش

کمترکمتر

کافیستکافیست

حتاحتی


Finite state transducers fst

Finite-State Transducers (FST)

  • Two-level network or transducer

    • Input = lower-side of arc

    • Output = upper-side of arc

b

i

r

d

+Noun

+Pl

b

i

r

d

s


Ma system description

MA: System Description

  • Developed on Xerox Finite State Technology (XFST) [Karttunen & Beesley 1992]

  • Components:

    • Lexicon and morphology rules (lexc)

    • Phonological rules (regular expressions)

    • Compiled into a FST (finite-state transducer)

  • FST for each part of speech created separately then composed  final FST for morphological analysis


Ma system description1

MA: System Description

Input string

Phonology rules

Noun FST

Verb FST

Final FST

For Morphology

COMPOSITION

Adverb FST

Output string


Ma system description2

MA: System Description

  • Coverage: formal Persian language

    • Full verbal conjugation

    • Nonverbal inflection مسافرین ؛ فقرا

    • Productive derivational morphology سرسامآور

    • ~20 phonological rules

    • Proper nouns of people, places, organizations


Inflectional morphology

Inflectional Morphology

LEXICON Root

ktabNoun ;

LEXICON Noun

+Pl:ha# ;کتابها

+Pl:_ha# ; کتابها

+Sg:0# ; کتاب

+Pl:a# ;کتابا


Complex tokens

Complex Tokens

  • Two different POS categories

    بعقیدهشما ؛ اینکار؛ بهترست - دردفتر ؛ وگفت

    bh+Prep<eqydh+Noun+Sgبعقیده

    dr+Prep<dftr+Noun+Sgدردفتر

    ktab+Noun+Pl>av+Pron+Pers+Poss+1P+Plکتابهایمان

    برادرشهbradr+Noun+Sg>av+Pron+Pers+Poss+1P+Pl

    >bvdn+Verb+Ind+Pres+3P+Sg


Verbal morphology

Verbal Morphology

  • Two different stems


Verbal morphology1

LEXICON PastStem

tvanstInfl1 ;

rftInfl1 ;

xndydInfl1 ;

LEXICON PresentStem

tvanst:tvanInfl2 ;

rft:rvInfl2;

xndyd:xndInfl2;

LEXICON PstStemBlog

tvnstInflBlog1;

LEXICON PrStemBlog

tvanst:tvnInfl2 ;

rft:rInfl2;

Verbal Morphology


Long distance dependencies

Long Distance Dependencies

  • Some tenses of the verb can only be determined if we take into account the co-occurrence of the prefix and the person inflection / auxiliary problem for linear approaches


Long distance dependencies1

Long Distance Dependencies

  • Leads to very complex paths and continuation classes in lexc

  • Using filters largely increases the size of the FST

  • Use flag diacriticsfor unification (@U.Feature.Value@)

- Keeps FST small- Can apply constraints between non-adjacent morphemes


Phonology rules

Optional in informal blog text

Phonology Rules

  • Form of affixes may change based on the ending character of the stem

    Formal:کتابش ؛ چشمهایش/صدایش ؛ همسایهاش

    Informal:کتابش ؛ چشماش/صداش ؛ همسایش

define clitic1 [^NB  0 || Cons __ ] ;

define clitic2 [^NB  y || Vowel __ ] ;

define clitic3 [^NB  “\u200c” a || e __ ] ;

ktab^NBš

Sda^NBš

hmsaye^NBš


Evaluation

Evaluation

  • FST: 178,452 states; 928,982 arcs before optimization

  • Speed: 20.84 CPU time in seconds for 10 MB file, on SunSparcStation

  • Coverage=97.5%; Accuracy=95%

  • Unanalyzed tokens: proper nouns + missing lexicon words

  • No weblog language rules included yet!


Conclusion

Conclusion

  • Challenges in morphological analysis of Persian formal text  Solutions in XFST system

  • New issues and variance due to blog language

  • Need robust system:

Lexicon updated with colloquial forms

Flexible morphological rules + derivational morphology rules

Transliteration component for loan words

Statistical approach to disambiguate and to deal with unknowns


  • Login