1 / 43

Shyrokov Volodymyr, Bugakov Oleg Krygin Maxim, Sydorchuk Nadiia Ukrainian Lingua-Information Fund NASU

Shyrokov Volodymyr, Bugakov Oleg Krygin Maxim, Sydorchuk Nadiia Ukrainian Lingua-Information Fund NASU. Ukrainian National Linguistic Corpus and its application.

dermot
Download Presentation

Shyrokov Volodymyr, Bugakov Oleg Krygin Maxim, Sydorchuk Nadiia Ukrainian Lingua-Information Fund NASU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shyrokov Volodymyr, Bugakov OlegKrygin Maxim, Sydorchuk NadiiaUkrainian Lingua-Information FundNASU Ukrainian National Linguistic Corpusand its application

  2. The main results of theoretical studies and an overview of practical implementations received in ULIF-NASU are presented in the collective monograph “Corpus linguistics” Корпусна лінгвістика / Широков В.А., Бугаков О.В., Грязнухіна Т.О., Костишин О.М., Кригін М.Ю., Любченко Т.П., Рабулець О.Г., Сидоренко О.О., Сидорчук Н.М., Шевченко І.В., Шипнівська О.О., Якименко К.М. – К. – Довіра, 2005. – 471 с.

  3. UNLC statistics General Corpus • 4868storage objects; • 1013MB of the texts forindexing • more than 62mlntokens; Legislation Corpus • 5757 storage objects; • 151 MB of the texts forindexing • more than 18 mlntokens;

  4. Technological principles for creating UNLC Design and organization of the information architecture and functionality of UNLC is performed on the systems engineering of the virtual lexicographic laboratories. In accordance with the concept of virtual lexicographic laboratories, UNLC is designed using Service-Oriented Architecture (SOA) and Web-service technology. The Internet infrastructure is used as a communication infrastructure. The following technology standards are used: XML for data description;SOAP for exchange of the structured messages in the distributed systems;WSDL for service description;UDDІ for storing and providing the WSDL-descriptions on request. Windows Communication Foundation (WCF) is used for interaction between different levels of UNLC. It is a service-oriented system for data and message exchange that provides to the software components an opportunity to interact locally or remotely via a simplified unified software model of the cross-platform interaction. The necessary condition for bundled software functioning is the availability of high-powered means of security and data integrity.

  5. L_C E_LIB E_LING MDI B_D G_O Index MC_B The general scheme of linguistic corpus E_LIB – bibliographic subsystem (electronic library); E_LING – linguistic subsystem; MDI – subsystem for constructing the multidimensional index; Index – multidimensional index base. This item represents the database of results of MDI work; MC_B – microcontext base. This item is virtual and dynamically generated on user’s query. It returns a set of microcontexts that match a search prescription the user made.

  6. Bibliographic subsystem serves as a multipurpose information system that accumulates information of different kinds: serves as a tool to collect, store, model and use the natural language information in the digital form. The generalized objects for storing in the bibliographic system may be the objects in the electronic form in any data format. This enables providing manuscripts, audio, video and other multimedia information besides usual printed texts to the library. Functions of the bibliographic subsystem • forming a brief bibliographic description on the rules of bibliographing based on the metadata elements of the storage object recorded in the database; • forming a detailed bibliographic description of the storage object; • editing the metadata set for a bibliographic description in accordance with the changes made by a bibliographer • analysis of changes in the bibliographic record; • work with the file system objects; • editing, inserting, deleting profiles, specifications, vocabularies and their elements.

  7. Search by the bibliographic parameters • The user selects a search box of the boxes included in the search profile independently. If this is a text box, the user enters information, if the box has a limited set of values, the user selects the search value from a dictionary. • For the advanced search the combinations of logic operators “and” and “or” are used. • Search results are presented as a list of bibliographic descriptions. • The user can view a complete list of bibliographic parameters for each object, view a resource (the full text) and record the search results into the file.

  8. Linguistic corpus provides the full-text information processing and serves as a tool for retrieving the contexts on users’ search queries taking into account certain linguistic parameters Functions of the linguistic subsystem • creating the full-textindex; • purifyingthe full-textindex; • addingthe indexingobject; • indexingobjects; • removing an indexed object from the full-text index; • the full-text search of the words and phrases in all sources, or sources selected by the bibliographic description, with the ability to set the distance between the search words; • providingstatistics; • viewing the microcontexts; • recording the microcontexts of the words and phrases into the file; • servicefunctionsofservicing.

  9. Marking thestructuralelements • Structuring by the text settings – “section”, “part”, “paragraph”, “title”, “conclusions”, “summary”, “abstract”. • Marking the paragraphs. • Marking the words written in the letters of not Ukrainian alphabet. • Structuring the text by the sentences pointing out the beginning and end for each one. • Marking the text words, the grammatical codes of which are defined by special rules. This concerns: • а) the words with a hyphen, the first part of which is an abbreviation of the Ukrainian and Latin uppercase letters; • б) abbreviations; • в) the proper names unambiguously identified by the context • Marking the non-author text (quotes, direct speech). • Identifying the text units that have no morphological status and are not interpreted with the rules of morphological analyzer. • Marking the words or text fragments written with interspacing. • Marking places in the text that need to be edited later.

  10. Search by the linguistic parameters is realized due to the full-text index. The user enters a search phrase, sets the desired maximum number of words between the search ones and selects additional full-text search options, namely: • search in a certain subset of objects; • use of theword order; • use of thedistance between words; • use of thelemmatization; • use of thesynonymy. The result of the full-text search is a list of bibliographic descriptions. But unlike the bibliographic search, the user gets direct access to each localization of the search item in the text, ie to all the contexts that contain the search item. Choosing a source the user can view contexts where the search item is highlighted in red. The size (length) of the context can be changed.

  11. For further processing all the contexts, or contexts of a certain source can be recorded into the html-file specifying the source context, the time of creation, and search phrases.

  12. Applying UNLC • The source base of the linguistic information to create a fundamental academic lexicographic multivolume system “Ukrainian Language Dictionary”; • The database for linguistic research to identify new linguistic phenomena and formalize the existing ones; • The system for grammatical marking; • Statistical analysis of the text data; • The environment of accumulation and processing of the information objects of different nature; • The environment of interaction with the systems of grammar, synonymic and explanatory dictionaries. • Creation of different linguistic and information systems (LIS) by the corpus technologies: LIS “The Constitution of Ukraine”; LIS “T. G. Shevchenko Electronic Encyclopedia” • Linguistic expertises

  13. The explanatory “Ukrainian Language Dictionary”

  14. Editing system of the dictionary entry

  15. LIS “The Constitution of Ukraine”

  16. T. G. Shevchenko Electronic Encyclopedia

  17. LIS “Haidamaks”

  18. Linguistic expertise The principle of applying statistical methods in the linguistic expertise: Text  preliminary processing  statistical portrait  parameters of comparison or analysis  analysis  result. The program for research of the students’ works on plagiarism • forms a linguistic corpus of abstracts • compares any text with abstracts from the corpus by various criteria • creates and visualizes the result of comparison

  19. The window of the linguistic expertise program

  20. Selecting topics for comparison

  21. The result of text analysis When comparing the abstract text with the texts from the corpus of abstracts by one of the criteria, the two texts were found, which match the observable abstract on 63 and 53% respectively.

  22. Visualization of the program work results

  23. Comparison of the texts of the 20-volumeand 11-volume explanatory dictionaries

  24. The analysis of the political parties’ platforms

  25. The concordance statistics

  26. The most frequent lexemesin the programs of parties (blocs)

  27. Relative intensities of the key concepts in the election programs of the political parties in 2002

  28. Disambiguation in the text using statistical methods Lexical homonymy КОСА 1. Заплетеневолосся 2. Сільськогосподарськезнаряддядлякосіннятрави, збіжжятощо, щомаєвиглядвузькогозігнутоголеза, прикріпленогододержака 3. Вузьканамивнасмугасуходолу в морі, річцітощо, сполученаоднимкінцемізберегом Grammatical homonymy ПРАВ 1. право – іменник середнього роду, родовий відмінок, однина 2. правити – дієслово доконаного виду, наказовий спосіб, друга особа, однина 3. прати – дієслово недоконаного виду, минулий час, чоловічий рід, однина

  29. The scheme of the disambiguation algorithm Manual marking of the initial training textT0: receiving marking М(T0) Receiving statistics of the grammatical chains S0 Disambiguation by the statistical method in the training text Ti (receiving marking) Control of the received marking by the specialist, corrective actions, additional marking М(Ti) Receiving statistics Si Combining statistics Si andSi-1 Disambiguation in the text of a certain genre

  30. Ti={(w1)r1(w2)r2(w3)…(wN)}, where wi– word forms, ri– word forms delimiters, N – number of word forms in the text M: TM(T)={(v1, g1) (v2, g2) (v3, g3)…(vN, gN)}, where videfine the word form part of speech, gidefine the grammatical meaning,

  31. S(T)={([vi, gi] [vi+1, gi+1][vi+2, gi+2]), p([vi, gi] [vi+1, gi+1][vi+2, gi+2]),i=1, 2, … N;i – the ordinal number of the word form in the text} ([vi, gi] [vi+1, gi+1][vi+2, gi+2]) – a chain of grammatical meanings

  32. Disambiguation M´: TM´(T) M´(T)={(v1, g1)´ (v2, g2)´… (vN, gN)´}, where M´(T)M(T):

  33. Software The grammatical marking program interface

  34. ReceivingS(Ti) In the first column there are triples ([vi, gi] [vi+1, gi+1][vi+2, gi+2]) in the second column there is an information about punctuation within the chain in the third column there is a chain position relative to the sentence beginning in the forth column there is an absolute frequency of the triple([vi, gi] [vi+1, gi+1][vi+2, gi+2]) in the text

  35. Marking the unambiguous word formson the example of the Commercial Code

  36. Disambiguation in the Commercial Code text

  37. Results of automatic disambiguation

  38. Thank you for attention

More Related