Extending a persian morphological analyzer to blogs
Download
1 / 21

دومین کارگاه پژوهشی زبان فارسی و رایانه - PowerPoint PPT Presentation


  • 65 Views
  • Uploaded on

EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS. Karine Megerdoomian University of Maryland, College Park [email protected] دانشگاه تهران. دومین کارگاه پژوهشی زبان فارسی و رایانه. Talk Outline. Persian Weblogs

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'دومین کارگاه پژوهشی زبان فارسی و رایانه' - lucille


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Extending a persian morphological analyzer to blogs

EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS

Karine Megerdoomian

University of Maryland,

College Park

[email protected]

دانشگاه تهران

دومین کارگاه پژوهشی زبان فارسی و رایانه


Talk outline
Talk Outline

  • Persian Weblogs

    • Persian is the 4th largest blog language in the world (~75,000 sites)

  • Description of a finite-state morphological analyzer for Persian

    • System description

    • Language issues and implementation

  • Computational issues in weblogs


Language of blogs
Language of Blogs

  • Contain both formal and informal morphology

  • Morphology

    • Informal text is very different from formal

      مرا گرفته است گرفتهتم

    • Features that don’t exist in formal

      فروشندهه؛ رفتش

    • Shortened verbal stems and inflection

      می گویند میگن


Language of blogs1
Language of Blogs

  • Morphology

    • Colloquial pronunciation

      غلطای املایی ؛ این سایتو ؛ دوستامونم ؛ دردناکه ؛ مثل منن

      ازشون ؛ خودتون ؛ نگاههایشان ؛ همسایهاشون

    • Spelling errors and non-standard punctuation & spacing

    • Emoticons  and hyperlinks


Language of blogs2
Language of Blogs

  • Lexicon

    • Wordforms follow pronunciation

      اوضاش ؛ برام ؛ نگامی کنم ؛ خونه ؛ تمبل ؛ همدیگه ؛ بش گفتم

    • Colloquial forms

      تو دانشگاه ؛ واسه استادام

    • New words

      لینکدونی ؛ دوستان کامنتگذار


Language of blogs3
Language of Blogs

  • Lexicon

    • Loan words

      چتروم ؛ آنلاین ؛ دانلود کنین

    • Interjections

      آاااخ! ؛ والا ؛ وای ؛ اوووه!

    • More idiomatic expressions

      دمشگرم آقا


Language of blogs4
Language of Blogs

  • Huge amount of variation!!

    • Need for flexible rules

    • Phonological rules to represent colloquial speech

    • Need to disambiguate(statistical component?)

  • Formal blog text is also different from traditional formal text


Language of blogs5
Language of Blogs

BBCخوابگرد

موافقاند موافقند

بینندهگان بینندگان

کتاباش کتابش

کمتر کمتر

کافیست کافیست

حتا حتی


Finite state transducers fst
Finite-State Transducers (FST)

  • Two-level network or transducer

    • Input = lower-side of arc

    • Output = upper-side of arc

b

i

r

d

+Noun

+Pl

b

i

r

d

s


Ma system description
MA: System Description

  • Developed on Xerox Finite State Technology (XFST) [Karttunen & Beesley 1992]

  • Components:

    • Lexicon and morphology rules (lexc)

    • Phonological rules (regular expressions)

    • Compiled into a FST (finite-state transducer)

  • FST for each part of speech created separately then composed  final FST for morphological analysis


Ma system description1
MA: System Description

Input string

Phonology rules

Noun FST

Verb FST

Final FST

For Morphology

COMPOSITION

Adverb FST

Output string


Ma system description2
MA: System Description

  • Coverage: formal Persian language

    • Full verbal conjugation

    • Nonverbal inflection مسافرین ؛ فقرا

    • Productive derivational morphology سرسامآور

    • ~20 phonological rules

    • Proper nouns of people, places, organizations


Inflectional morphology
Inflectional Morphology

LEXICON Root

ktab Noun ;

LEXICON Noun

+Pl:ha # ; کتابها

+Pl:_ha # ; کتابها

+Sg:0 # ; کتاب

+Pl:a # ; کتابا


Complex tokens
Complex Tokens

  • Two different POS categories

    بعقیدهشما ؛ اینکار؛ بهترست - دردفتر ؛ وگفت

    bh+Prep<eqydh+Noun+Sgبعقیده

    dr+Prep<dftr+Noun+Sgدردفتر

    ktab+Noun+Pl>av+Pron+Pers+Poss+1P+Plکتابهایمان

    برادرشهbradr+Noun+Sg>av+Pron+Pers+Poss+1P+Pl

    >bvdn+Verb+Ind+Pres+3P+Sg


Verbal morphology
Verbal Morphology

  • Two different stems


Verbal morphology1

LEXICON PastStem

tvanst Infl1 ;

rft Infl1 ;

xndyd Infl1 ;

LEXICON PresentStem

tvanst:tvan Infl2 ;

rft:rv Infl2;

xndyd:xnd Infl2;

LEXICON PstStemBlog

tvnst InflBlog1;

LEXICON PrStemBlog

tvanst:tvn Infl2 ;

rft:r Infl2;

Verbal Morphology


Long distance dependencies
Long Distance Dependencies

  • Some tenses of the verb can only be determined if we take into account the co-occurrence of the prefix and the person inflection / auxiliary problem for linear approaches


Long distance dependencies1
Long Distance Dependencies

  • Leads to very complex paths and continuation classes in lexc

  • Using filters largely increases the size of the FST

  • Use flag diacriticsfor unification (@U.Feature.Value@)

- Keeps FST small- Can apply constraints between non-adjacent morphemes


Phonology rules

Optional in informal blog text

Phonology Rules

  • Form of affixes may change based on the ending character of the stem

    Formal: کتابش ؛ چشمهایش/صدایش ؛ همسایهاش

    Informal: کتابش ؛ چشماش/صداش ؛ همسایش

define clitic1 [^NB  0 || Cons __ ] ;

define clitic2 [^NB  y || Vowel __ ] ;

define clitic3 [^NB  “\u200c” a || e __ ] ;

ktab^NBš

Sda^NBš

hmsaye^NBš


Evaluation
Evaluation

  • FST: 178,452 states; 928,982 arcs before optimization

  • Speed: 20.84 CPU time in seconds for 10 MB file, on SunSparcStation

  • Coverage=97.5%; Accuracy=95%

  • Unanalyzed tokens: proper nouns + missing lexicon words

  • No weblog language rules included yet!


Conclusion
Conclusion

  • Challenges in morphological analysis of Persian formal text  Solutions in XFST system

  • New issues and variance due to blog language

  • Need robust system:

Lexicon updated with colloquial forms

Flexible morphological rules + derivational morphology rules

Transliteration component for loan words

Statistical approach to disambiguate and to deal with unknowns


ad