1 / 14

Application of INTEX in refinement and validation of Serbian WordNet

Application of INTEX in refinement and validation of Serbian WordNet. Ivan Obradovi ć, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić University of Belgrade. WordNet (WN).

chantel
Download Presentation

Application of INTEX in refinement and validation of Serbian WordNet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application of INTEX in refinement and validation of Serbian WordNet Ivan Obradović, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić University of Belgrade

  2. WordNet (WN) • a semantic network of concepts represented by synsets – sets of synonymous words (nouns, verbs, adjectives & adverbs) • contains explicitly coded descriptions of semantic relations • inspired by research in the field of psycholinguistics • initially developed at Princeton for the English language Fellbaum C. (ed.), (1998) WordNet: An Electronic Lexical Database, The MIT Press

  3. Multilingual WordNets • Featuring: the InterLingual Index (ILI) • EuroWordNet (EWN): Dutch, Italian, Spanish, German, French, Czech and Estonian • BalkaNet (BWN) five Balkan languages: Greek, Turkish, Bulgarian, Romanian and Serbian, as well as Czech Vossen, P. (ed.) (1998) EuroWordNet: A Multilingual Database with Lexical Semantic Networks, Kluwer Academic Publishers, Dordrecht Stamou S., Oflazer K., Pala K., Christoudoulakis D., Cristea D., Tufis D., Koeva S., Totkov G., Dutoit D., Grigoriadou M. (2002) BALKANET: A Multilingual Semantic Network for Balkan Languages, 1st International Wordnet Conference, Mysore, India, January 2002 (http://www.ceid.upatras.gr/Balkanet/files/balkanet-elsnet-ko-accept.pdf)

  4. The WN semantic network • based on a grouping of synonyms into synsets - representing network nodes • nodes are interconnected by arcs which describe particular semantic relations (hyperonymy, hyponymy, antonymy etc.) • in general, every synset is accompanied by a definition (gloss) and examples of usage that specify the meaning of the concept represented by the synset • the semantic network itself is an XML-document with a precisely established set of entities

  5. The Serbian version of WN • developed starting from the base concepts of the English WN using existing English/Serbian dictionaries in paper form • synset elements represented as the elements in DELAS or DELAC dictionaries without any additional morphosyntactic information • lexical meanings in Serbian coded with reference to the dictionary of Matica Srpska

  6. XML representation of a synset in Serbian WN (demonstrate, establish, prove, show) <SYNSET><ID>ENG171-00528591-v</ID> <SYNONYM> <LITERAL> dokazati <SENSE> 1 </SENSE> </LITERAL> <LITERAL> dokazivati <SENSE> 1 </SENSE> </LITERAL> <LITERAL> pokazati <SENSE> 3 </SENSE> </LITERAL> <LITERAL> pokazivati <SENSE> 3 </SENSE> </LITERAL></SYNONYM> <DEF> Utvrditi valxanost necyega, primerom, objasxnxenxem ili eksperimentom. (Establish the validity of something by example, explanation or experiment)</DEF> <USAGE> Anketa je pokazala da u tako nesxto veruje mali broj ispitanih. (The poll showed that few people believe in this)</USAGE> <POS>v</POS> <ILR>ENG171-00529622-v <TYPE>hypernym</TYPE></ILR><BCS>1</BCS> <STAMP>Dusko 2003/04/21</STAMP> </SYNSET>

  7. Problems in Serbian WN that might be solved using INTEX • lack of morphological and syntactic information related to lexemes • absence of precise criteria for the selection of lexemes for a particular synset • lack of information on relative relevance of each lexeme in a synset in terms of its lexical frequency

  8. Incorporation of morphosyntactic information into synsets using INTEX The DictWNSrp program • matches literals in WN with literals in selected Delas dictionaries and extracts morphosyntactic information from dictionaries • assigns morphosyntactic information to WN literals in cases of a 1-1 match • offers the user the option to confirm or alter the assigned information and resolve cases of homography (e.g. multiple matches) • transfers confirmed morphosyntactic information into the WN using the LNOTE element

  9. Resolving homography with the DictWNSrp program

  10. XML representation of a synset with assigned morphosyntactic information <SYNONYM> <LITERAL>dokazati <SENSE>1</SENSE> <LNOTE>V122+Perf+Tr+Iref+Ref</LNOTE> </LITERAL> <LITERAL>dokazivati <SENSE>1</SENSE> <LNOTE>V18+Imperf+Tr+Iref</LNOTE></LITERAL> <LITERAL>pokazati <SENSE>3</SENSE> <LNOTE>V122+Perf+Tr+Iref+Ref</LNOTE></LITERAL> <LITERAL>pokazivati <SENSE>3</SENSE> <LNOTE>V18+Imperf+Tr+Iref</LNOTE></LITERAL> </SYNONYM>

  11. Validation of lexemes from a synset on a corpus Phase One: The IntexWN program • selects and displays all synsets from WN for a given lexeme • constructs Intex graphs for all lexemes from selected synsets Phase Two: INTEX • produces concordances from a chosen corpus for graphs constructed by IntexWN Phase Three: User • checks the validity of synonymous relations of lexemes on concordances • decides on removing or adding new lexemes to the synset

  12. Constructing a graph for all lexemes from a synset with the IntexWN program

  13. Validation results for synset ENG171-11771798(being, beingness, existence) • Comments: • the lexemes used in the synset have been used to denote the given concept in 24% of concordances • the lexeme most frequently used to denote the given concept is postojanxe • although zxivot is the most frequent lexeme in the synset, it has been used to denote the given concept only in 10% of cases • bivstvo does not occur in the corpus and its exclusion from the synset could be considered if a similar result is obtained on a wider corpus

  14. Further developments • definition of more precise criteria for validation of lexemes in a synset based on their occurrence in corpora • investigation of possibilities for introducing relevance information in synsets • further development of the IntexWN program to include semantic relations, such as hyponymy/ hyperonymy etc. • introduction of near-synonym information into the Serbian WN using INTEX dictionaries (e.g. augmentatives/diminutives) • investigation of possibilities for introducing multi-lingual features into INTEX using the WN (to be used for parallel corpora)

More Related