1 / 25

Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

Principles of organizing a common morphological tagset and a search engine for PolUKR (Polish-Ukrainian Parallel Corpus) Польсько-Український паралельний корпус Polsko-Ukraiński Korpus Równoległy http://corpus.domeczek.pl.

tanika
Download Presentation

Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Principles of organizing a common morphological tagset and a search engine for PolUKR (Polish-Ukrainian Parallel Corpus) Польсько-Український паралельний корпус Polsko-Ukraiński Korpus Równoległyhttp://corpus.domeczek.pl Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences Olga Shypnivska, ULIF, Ukrainian Academy of Sciences Magdalena Turska,Warsaw University

  2. Main objectives and expected applications • at least 3 mln tokens ; representative • sentence-level alignment • morphological annotation with a common tagset • public access; user-friendly • linguistic material for • (independent) language learning • bilingual dictionaries • research on grammar and lexis • translation memory for humans and machines

  3. Statistics (prototype version)

  4. Search (present) • based on PERLregular expressions • any searched chain has to be “embraced” by “/”. E.g. /Холодна війна/ • special characters: Іalternative; )end of subchain [ i ]beginning and end of a defined character class ? 1 or 0 appearances; * 0 or more appearances + 1 or more appearances \sany empty character \wany letter, digit, underlining sign \bend of word, \ escape

  5. Examples of search formulae /jako/ „jako” /jako\s/ „jako, niejako, dwojako” /\bjako/ „jakość’ /norma\./ „norma” before a dot

  6. Sources of morphological information • Polish: IPI PAN corpus + … • Ukrainian: • grammatical dictionary by ULIF, UAS (Igor Shevchenko) lemma <> wordform • morphological analyzer (information is slightly different, built for homonymy disambiguation) • no lemmatization (so far)

  7. Types of tagsets SYMBOLS: encoding all possible grammatical characteristics of a wordform in one symbolEnglish (BNC), Ukrainian - takes little machine memory but requires too much of the human one CHAINS: contain codes corresponding to particular grammatical categories and/or their values; morphological characteristics of a wordform is represented by a sequence of such codes can be even more economic than symbols, if a query concerns morphological categories owned by several lexico-grammatical classes • positionalCzech every category (and its values) have a fixed position in a chain • flexemicPolish, Russian every category has its own subtagset

  8. Multext-East tagset for En Ro Sl Cz Bg Et Hu Hr Sr Re • chain-like; criticised • 14 PoS:N10, V15, A12, P(ron)17, Det10, T(he)6, adveRb6, S(adposition)4, C(onj)7, nuMeral12, Intjn2, X(residual), Yabbr5, Qparticle3 • only Bg and Hu do not have modal verbs and copulas • En Ro have determiners, Ro Hu Re have articles, Bg – has neither (analitism, segmentation); • Is a Bg noun formally indefinite if the article is attached to the adj? (cf. agglutinativity of Pl być) • negation as morphological category • Cz transgresivity (adverbial participle)

  9. Treatment of participles • Polish (no aspectual characteristics) (Here and further cited by: Adam Przepiórkowski i Marcin WolińskiA Flexemic Tagset for Polish.) • Ukrainian (aspect and tense) Дієслово, дієприслівник, доконаний вид, минулий час, активний стан VWпрочитавши Дієслово, дієприслівник, недоконаний вид, теперішній час, активний стан UQчитаючи (Here and further cited by: ШироковВ.А et al. Корпусна лінгвістика.) • PolUKR participle I (doing/having done) characterised by aspect

  10. Treatment of pronouns • notorious Slavonic pronoun problem: 296 unique tags for 309 pronouns • Polish: division into 1-2 p, 3p and siebie (ów, jak?) • Ukrainian: pro-noun, pro-adjective • Russian: also pro-predicative and pro-adverb • Czech: many subcategories on the level of SubPoS • PolUKR: Ua approach and Pl division into 1-2 and 3 person

  11. Treatment of predicatives • Polish: adverbs with modal semantics like można, trzeba (it is) allowed/one can, (it is) necessary, ?to • Ukrainian (code X0) includes adverbs of state like жарко, шкода, жаль(it is) hot, (it is) a pity • PolUKR moving the category from the morphological level to the semantic one

  12. http://www.ruscorpora.ru/search-main.html

  13. Search engine for PolUKR • choose the direction of the search (Ua>Pl or Pl<Ua) • search conditions for both languages (RvonW) • 3 levels of search: • exact form • (lemma) with the morphological choice • using Poliqarp-like tag formulas (for advanced users) • idea of subcategories (either a POS or a SUBPOS can be selected, but not both; similarly, one cannot select all subcategories of a POS), cf. aliases in IPI PAN corpus • alternative is ensured through tick-off boxes, so that one can choose EITHER „VERBfinite past” OR „NOUNdative neutral” OR sth else, etc.) • restrictions on choice within 1 of 10 POS

  14. Built-in restrictions on search

  15. Literature • INTERA unified tagset projectwww.elda.org/intera • Tomas Erjavec et al. Multext-East specifications for Slavic languages, Budapest, 2003. • Jan Hajič. Positional Tags: Quick Reference (Czech „HM” Morphology), 2000. • Adam Przepiórkowski and Marcin Woliński. A Flexemic Tagset for Polish. In: The Proceedings of the Workshop on Morphological Processing of Slavic Languages, EACL 2003. http://nlp.ipipan.waw.pl/~adamp/Papers/2003-eacl-ws12/ws12.pdf • Elena Paskaleva. Balcan South-East Corpora Aligned to English. In: The Proceedings of the Workshop on Common Natural Language Processing Paradigm for Balkan Languages, EACL 2003 • ШироковВ.А et al. Корпусна лінгвістика. Київ: Довіра, 2005.

More Related