1 / 17

From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo. Atle Grønn, Kjetil Rå Hauge, Elizaveta Khachaturyan, Ljiljana Šarić. 1. Corpus history at Oslo University.

nevaeh
Download Presentation

From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From RuN to RuN-Euro: a multilingual parallel corpus at the University of Oslo • Atle Grønn, Kjetil Rå Hauge, Elizaveta Khachaturyan, Ljiljana Šarić 1

  2. Corpus history at Oslo University At the University of Oslo, and notably at the Department of Literature, Area Studies and European languages, there is a strong tradition going back to the late Stig Johansson’s English-Norwegian Parallel Corpus, initiated in 1994. 2 Department of Literature, Area Studies and European Languages >

  3. Corpus history at Oslo University The English-Norwegian corpus has been continued as the Oslo Multilingual Corpus, with subcorpora in Norwegian, English, French, and German, and smaller sections for Dutch and Portuguese. In addition, there are related parallel corpora for English-Swedish and English-Finnish, compiled in Sweden and Finland, which are accessible from the same site. 3 Department of Literature, Area Studies and European Languages >

  4. RuN history Where Russian meets Norwegian - languages at the interfaces This three-year project, led by Atle Grønn, was started in 2008 with funding from the Norwegian Centre for International Cooperation in Higher Education (SIU) through its cooperation programme with Russia. Its objective is to bridge the gap between research and education in the field of advanced second language learning of Russian and Norwegian. 4 Department of Literature, Area Studies and European Languages >

  5. RuN activities Activities at RuN include new bachelors and masters courses, three major international conferences, teaching materials, research articles, MA/PhD-theses etc., and the RuN Corpus, a parallel Norwegian-Russian-English corpus. RuN web page 5 Department of Literature, Area Studies and European Languages >

  6. RuN-Euro In 2010 the project expanded into Bulgarian, French, Italian, Croatian, and Polish, with additional members: Kjetil Rå Hauge, Elizaveta Khachatourian, and Ljiljana Šarić; and technical assistants Vladislav Dorochin and Boris Orechov. The expansion was made possible through funding from the Faculty of Humantities at the University of Oslo. 6 Department of Literature, Area Studies and European Languages >

  7. Profile The RuN-Euro corpus, like the OMC, is heavily biased towards fiction. Priority is given to contemporary texts. However, there is no problem connected with including older texts, since the search interface allows for restricting the search to date of publication, authors, genre, and other parameters. 7 Department of Literature, Area Studies and European Languages >

  8. New acqusitions The language distribution of this year’s additions leans heavily towards Russian and Bulgarian, as e-texts are more readily available in these languages. Also, texts have been exchanged through cooperation with the Institute for Bulgarian Language at the Bulgarian Academy of Sciences. Next year will hopefully see more texts in the other languages, made available through OCR. 8 Department of Literature, Area Studies and European Languages >

  9. Production 2010 More than fifty additional texts will have been added at the end of 2010 (originals in red): Antoine de Saint-Exupéry, Le petit prince (Bg-BKS-En-Fr-It-No-Ru) Michail Bulgakov, Master i Margarita (Bg-En-It-No-Ru) Michail Bulgakov, Sobač´e serdce (Bg-En-No-Ru) Jostein Gaarder, Sofies verden (Bg-It-No-Ru) Il´ja Il´f and Evgenij Petrov, Dvenadcat´ stul´ev (Bg-BKS-En-Ru) Ernest Hemingway, The old man and the sea (Bg-BKS-En-Ru) Vladimir Nabokov, Lolita (Bg-En-No-Ru) Viktor Pelevin, Generation П (Bg-En-No-Ru) Anton Čechov, Rasskazy (Bg-En-No-Ru) 9 Department of Literature, Area Studies and European Languages >

  10. Production 2010, cont. Boris Akunin, Koronacija, ili Poslednij iz Romanov (En-No-Ru) Michail Bulgakov, Rokovye jajca (En-No-Ru) Ivo Andrić, Prokleta avlija (Bg-BKS-Ru) Ivan Vazov, Pod igoto (Bg-En-Fr) 10 Department of Literature, Area Studies and European Languages >

  11. Free tool A web-based sentence splitter has been built by project assistant Boris Orechov, based on Perl code by Jarle Ebeling of the OMC. The splitter has built-in lists of non-splitting abbreviations (Mr., Dr., ...) for English, Russian, and Bulgarian (the latter courtesy of the Institute for Bulgarian Language, Bulgarian Academy of Sciences). Source code is available. It splits into XML organised according to the specifications of the RuN-Euro project, or into plain return-delimited chunks that can be used as input to Hunalign or other aligners. http://nevmenandr.net/run/tools/ 11 Department of Literature, Area Studies and European Languages >

  12. Aligner (1) Output panes Testing panes Input panes 12 Department of Literature, Area Studies and European Languages >

  13. Aligner (2) The aligner, programmed in Java, uses language-specific information: bilingual word lists of frequent words with reasonably straight-forward translations: almost/nesten alone, single/alene already/allerede .... Hofland, Knut and Stig Johansson. 1998. "The Translation Corpus Aligner: A program for automatic alignment of parallel texts." In Johansson, Stig and Signe Oksefjell (eds.). Corpora and Crosslinguistic Research: Theory, Method, and Case Studies. Amsterdam: Rodopi.. Johansson and S. Oksefjell (1998), 87-100. 13 Department of Literature, Area Studies and European Languages >

  14. Database of texts Full list of texts (incompatible with Internet Explorer): http://www.nevmenandr.net/run/ 14 Department of Literature, Area Studies and European Languages >

  15. Glossa front end RuN-Euro and the OMC share a common web interface: “Glossa”, developed by the Text Laboratory at the Department for Linguistics and Nordic Languages. Glossa is a user-friendly graphic interface built on top of the IMS Corpus Workbench query system. Morphosyntactic tagging of the texts is provided by the Text Laboratory. http://www.hf.uio.no/iln/tjenester/sprak/glossa/index.html 15 Department of Literature, Area Studies and European Languages >

  16. Other parallel corpora In our selection of texts, we try not to duplicate files that already are included in other parallel corpora - that is, for the time being. What we would like to see in the future, however, is a "marketplace" for parallel corpora. 16 Department of Literature, Area Studies and European Languages >

  17. The parallel corpora bazaar? XML files XSL transformations (for transforming files from project A into files for project B) Lists of non-splitting abbreviations Bilingual word lists (for aligners) 17 Department of Literature, Area Studies and European Languages >

More Related