1 / 30

NOOJ Conference Inalco, Paris June 16th, 2012

Russian Module for NooJ: design and implementation. Conception and realization of grammatical & lexical resources for the Russian language for Max Silberztein’s Nooj software. NOOJ Conference Inalco, Paris June 16th, 2012. Vincent BÉNET INALCO CREE Recherche assistée par ordinateur.

elinor
Download Presentation

NOOJ Conference Inalco, Paris June 16th, 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Russian Module for NooJ: design and implementation Conception and realization of grammatical & lexical resourcesfor the Russian languagefor Max Silberztein’s Nooj software NOOJ Conference Inalco, Paris June 16th, 2012 Vincent BÉNET INALCO CREE Recherche assistée par ordinateur

  2. Russian Module for NooJ: design and implementation Design linguistics resources • Description of the realization Dictionaries / paradigms /grammars • Job left to be done…

  3. Writing lexical resources for the Russian language • Build dictionairies from texts • Create one « small » dictionary and many grammars for derivational formsраб + a (slave) раб + oт+а +ть (work)за +раб +от+ к+а (salary) • Complete one « big » existing dictionary and create manygrammars

  4. Writing lexical resources for the Russian language ZALIZNIAK’s grammatical dictionary : 96 000 entries complete dictionary, in inverted alphabetical order, with all grammatical annotation To obtain, to reach : Достигать нсв нп 1a$3(доcтигнуть//доcтичь)имеетсястрад Dostigat’ ipf nt 1a$3 (dostignut’/dostich’) has a passive form

  5. Writing lexical resources for the Russian language Encountered problems Classification complete but some tags are absent ( V, N…) Classification based on accent markers A lot Unformal unclassified added annotations The problem of accent markers was delayed Zalizniak’s dictionary was resorting, its classification was modified, simplified and completed for computer use

  6. The design of lexical resources for the Russian language has consisted in: 1. creatinggrammatical tags 2. recoding the dictionary with this tags 3. sorting the dictionary (inverted alphabetical order for each word) 4. fixing a paradigm model list (kartainstead ofzh1a ) 5. writing paradigms 6. problem with ë / e 7. allocating models to the words 8. verifying the results 9. testing with texts 10. Correcting and proofreading

  7. Writing lexical ressources for Russian 1. Creating tags and properties N, A, V, ADV …. V_Pers = 1 | 2 | 3 ; V_Asp = Ipf | Pf ; V_Type = Mvt ; V_Morph = Pvb | Simp | Sufx | PvbSufx ; V_SsAsp = Det | Indet ; V_Temps = Pre | Pa | Fu ; V_Mode = Inf | Ind | Imp | Cond | Ger | Prtp ; V_Voix = Act | Pss ; V_Genre = m | f | n ; V_Nombre = s | p ; V_Constr = intr | tr | sja ; V_Cas = Im | Vi | Ro | Da | Tv | Pr ; A_Forme = fc | fl | adv; A_Genre = m | f | n ; A_SGenr = an | inan ; A_Nombre = s | p; A_Cas = Im | Vi | Ro | Da | Tv | Pr | Zv; A_Deg = Comp | Sup ; ADV_Deg = Comp;

  8. Writing lexical ressources for Russian 2. recoding the dictionary 3. Sorting the dictionary to get inverted aphabetical ordering

  9. Writing lexical Russian resources 4. Paradigm model list #j1a=karta #jo1a=korova #j2a=nedelja #jo2a=boginja #j3a=kniga #jo3a=sobaka #j4a=tuča #jo4a=kassirša #j5a=ulica #jo5a=volčica #j6a=statuja #jo6a=feja #j7a=linija #jo7a=furija 5. writing paradigms карта = <E>/Im+f+s + <B>у/Vi+f+s + <B>ы/Ro+f+s + <B>е/Da+f+s + <B>ой/Tv+f+s + <B>е/Pr+f+s + <B>ы/Im+f+p + <B>ы/Vi+f+p + <B>/Ro+f+p + <B>ам/Da+f+p + <B>ами/Tv+f+p + <B>ах/Pr+f+p ;

  10. Writing lexical Russian resources 5. Paradigm for verbs взять = <E>/Inf | <B4>озьму/1+s+Pre | <B4>озьмешь/2+s+Pre | <B4>озьмет/3+s+Pre | <B4>озьмем/1+p+Pre | <B4>озьмете/2+p+Pre | <B4>озьмёшь/2+s+Pre | <B4>озьмёт/3+s+Pre | <B4>озьмём/1+p+Pre | <B4>озьмёте/2+p+Pre | <B4>озьмут/3+p+Pr | <B2>л/m+s+Pa | <B2>ла/f+s+Pa | <B2>ло/n+s+Pa | <B2>ли/p+Pa | <B4>озьми/2+s+Imp | <B4>озьмите/2+p+Imp | <B2>в/Ger | <B2>вши/Ger | <B2>вший/Prtp+Pa+Act+m+s+Im | <B2>вший/Prtp+Pa+Act+m+s+Vi | <B2>вшего/Prtp+Pa+Act+m+an+s+Vi | <B2>вшего/Prtp+Pa+Act+m+s+Ro | <B2>вшему/Prtp+Pa+Act+m+s+Da | <B2>вшим/Prtp+Pa+Act+m+s+Tv | <B2>вшем/Prtp+Pa+Act+m+s+Pr | <B2>вшая/Prtp+Pa+Act+f+s+Im | <B2>вшую/Prtp+Pa+Act+f+s+Vi | <B2>вшую/Prtp+Pa+Act+f+s+Vi | <B2>вшей/Prtp+Pa+Act+f+s+Ro | <B2>вшей/Prtp+Pa+Act+f+s+Da | <B2>вшей/Prtp+Pa+Act+f+s+Tv | <B2>вшею/Prtp+Pa+Act+f+s+Tv | <B2>вшей/Prtp+Pa+Act+f+s+Pr | <B2>вшее/Prtp+Pa+Act+n+s+Im | <B2>вшее/Prtp+Pa+Act+n+s+Vi | <B2>вшего/Prtp+Pa+Act+n+s+Vi | <B2>вшего/Prtp+Pa+Act+n+s+Ro | <B2>вшему/Prtp+Pa+Act+n+s+Da | <B2>вшим/Prtp+Pa+Act+n+s+Tv | <B2>вшем/Prtp+Pa+Act+n+s+Pr | <B2>вшие/Prtp+Pa+Act+p+Im | <B2>вшие/Prtp+Pa+Act+p+Vi | <B2>вших/Prtp+Pa+Act+an+p+Vi | <B2>вших/Prtp+Pa+Act+p+Ro | <B2>вшим/Prtp+Pa+Act+p+Da | <B2>вшими/Prtp+Pa+Act+p+Tv | <B2>вших/Prtp+Pa+Act+p+Pr | <B2>тый/Prtp+Pa+Pss+m+s+Im | <B2>тый/Prtp+Pa+Pss+m+s+Vi | <B2>того/Prtp+Pa+Pss+m+an+s+Vi | <B2>того/Prtp+Pa+Pss+m+s+Ro | <B2>тому/Prtp+Pa+Pss+m+s+Da | <B2>тым/Prtp+Pa+Pss+mo+s+Tv | <B2>том/Prtp+Pa+Pss+mo+s+Pr | <B2>тая/Prtp+Pa+Pss+f+s+Im | <B2>тую/Prtp+Pa+Pss+f+s+Vi | <B2>той/Prtp+Pa+Pss+f+s+Ro | <B2>той/Prtp+Pa+Pss+f+s+Da | <B2>той/Prtp+Pa+Pss+f+s+Tv | <B2>тою/Prtp+Pa+Pss+f+s+Tv | <B2>той/Prtp+Pa+Pss+f+s+Pr | <B2>тое/Prtp+Pa+Pss+n+s+Im | <B2>тое/Prtp+Pa+Pss+n+s+Vi | <B2>того/Prtp+Pa+Pss+n+s+Ro | <B2>тому/Prtp+Pa+Pss+n+s+Da | <B2>тым/Prtp+Pa+Pss+n+s+Tv | <B2>том/Prtp+Pa+Pss+n+s+Pr | <B2>тые/Prtp+Pa+Pss+p+Im | <B2>тые/Prtp+Pa+Pss+p+Vi | <B2>тых/Prtp+Pa+Pss+an+p+Vi | <B2>тых/Prtp+Pa+Pss+p+Ro | <B2>тым/Prtp+Pa+Pss+p+Da | <B2>тыми/Prtp+Pa+Pss+p+Tv | <B2>тых/Prtp+Pa+Pss+p+Pr | <B2>т/Prtp+Pa+Pss+m+s+fc | <B2>та/Prtp+Pa+Pss+f+s+fc | <B2>то/Prtp+Pa+Pss+n+s+fc | <B2>ты/Prtp+Pa+Pss+p+fc;

  11. Writing lexical ressources for Russian 6. Problem of letter ë / e (partially solved: two entries or two paradigms) ёжик,N+m+an+FLX=бульдог ёж,N+m+an+FLX=богач ежик,N+m+an+FLX=бульдог еж,N+m+an+FLX=богач жевать = <E>/Inf | <B5>ую/1+s+Pre | <B5>уёшь/2+s+Pre | <B5>уёт/3+s+Pre | <B5>уём/1+p+Pre | <B5>уёте/2+p+Pre | <B5>уешь/2+s+Pre | <B5>ует/3+s+Pre | <B5>уем/1+p+Pre | <B5>уете/2+p+Pre | <B5>уют/3+p+Pre

  12. Writing lexical Russian resources 7. Allocating models to words 8. verifiying paradigms abažur,N+m+inan+FLX=zavod abazinec,N+m+an+FLX=ukrainec abazin,N+m+an+FLX=artist abaz,N+m+inan+FLX=zavod abak,N+m+inan+FLX=čajnik abbat,N+m+an+FLX=artist

  13. Writing lexical resources for Russian 9. Testing with russian texts : « The nose » by Gogol « The gambler » by Dostoievsky «The Prisoner of the Caucasus» by Tolstoy «The lady with the dog » by Chekhov « Short stories » by Harms

  14. Writing lexical resources for Russian 10. Correcting errors : • -bad encoding (mixed latin/cyrillic letters) • A B E K M H O P C y X MOCKBA • errors in paradigms • bad allocation of model to words •  mobile vowel / palatalization

  15. Improving lexical resources • Increase the number of different models ? • To avoid generating unexpected or incongruous forms or failing to recognize existing forms. Читав ? Čitav ? Пиша ? Piša ? Счастие ? Ŝastiе ? Suppress word entries and / or forms ? - useless words: source of unnecessary ambiguities the names of letters a, б, в, и, к, о, с, у, я archaic unused words. - repetitions of the same word in different parts of speech ( adjectives / nouns; adjectives / pronouns; interjections/particles/parenthesis )

  16. Available lexical resources for Russian 1 COMPILED BASIC DICTIONAIRY containing : 1 dictionary of 45,000 nouns(350 paradigms) 1 dictionaryof20,000 adjectives (50 paradigms) 1 dictionaryof 25,000 verbs (600 paradigms) 1 dictionaryof 880 prepositions & conjunctions, numerals, pronouns , 1600 adverbs, parenthetical words etc… • COMPILED ADDITONNALS DICTIONARIES:(with facultative use) 1 dictionary of propers nouns ( cities, countries, rivers … first names with diminutives) 1 dictionary of substantives-adjectives

  17. Writing Russian grammars for Nooj designing disambiguation grammars for • -grammatical agreement between adjectives & nouns • case usage with numerals • case usage with prepositions • case usage with verbs designing grammars to locate syntagms • - date and time expression • - adverbial phrases of time , place … • idiomatic structures ( my name is, I’m.. old • verbs of motion

  18. Writing Russian grammars for Nooj Syntactic grammar for Russian

  19. Writing Russian grammars for Nooj Syntactic grammar for Russian

  20. Grammar to locate the verbs of motion

  21. Grammar to locate the verbs of motion

  22. The prepositions in Russian

  23. The disambiguation of « NA » (on, onto)

  24. Annotating and disambiguating texts the text with its ambiguities :

  25. Verifying grammars The text was disambiguated with the grammar of « NA » :

  26. The disambiguation of « V » (in, into)

  27. Russian grammars for Nooj All these grammars need improvement: • They are very sensitive to syntactic order : • fail to regognize structures if unusual ( expressive or non standard) order of word in Russian sentences. • There are no grammars (yet) : • to disambiguate adverbs / adjectives • to disambiguate adjectives / nouns • to disambiguate conjunctions / interjections

  28. To get reliable ressources for the Russian language : The job left to be done is to design and implement: • Data bank of verified and annotated texts • Efficient syntactic grammars • Develop semantic tagging • Unified or harmonized tags for (slavic, roman, german etc..) languages to allow further multilingual treatment

  29. Russian Module for NooJ http://www.nooj4nlp.net/pages/russian.html

  30. Russian Module for NooJ: design and implementation Спасибо за внимание Thank you for your attention Merci de votre attention NOOJ Conference Inalco June 16th, 2012 vincent.benet@inalco.fr INALCO

More Related