1 / 18

ScanLex Automatic generation of bilingual wordlists Netordbogen Oslo-meeting 9. January 2006

ScanLex Automatic generation of bilingual wordlists Netordbogen Oslo-meeting 9. January 2006. Anders Nøklestad Janne Bondi Johannessen. Procedure. Use parallel corpora Use Uplug ( http://uplug.sourceforge.net/ ) to automatically align sentences

harry
Download Presentation

ScanLex Automatic generation of bilingual wordlists Netordbogen Oslo-meeting 9. January 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ScanLexAutomatic generation of bilingual wordlistsNetordbogen Oslo-meeting 9. January 2006 Anders Nøklestad Janne Bondi Johannessen

  2. Procedure • Use parallel corpora • Use Uplug (http://uplug.sourceforge.net/) to automatically align sentences • Use the Wortschatz project system, Univ. of Leipzig, to find words that co-occur in language pairs (http://wortschatz.uni-leipzig.de/) - (With lots of help from Chris Biemann, UL) • For each word, select the translation with the most significant co-occurrence • Use Viggo Kann’s program to convert lists to VK’s Netordbog XML-format • + convert lists by XML stylesheet to HTML

  3. Languages • Languages: • Norwegian Bokmål • Norwegian Nynorsk • Swedish • Danish • Icelandic • English • Not enough data for Faroese

  4. Example: Sentence pair from the Opus corpus • (src)="s26"> Dette værktøj vil hjælpe dig med grafisk at indstille serveren for CUPS-printsystemet . De tilgængelige muligheder er grupperet i relaterede emner og du kan hurtigt få adgang til dem gennem ikonvisningen til venstre . • (trg)="s26"> Dette verktøyet hjelper deg å konfigurere CUPS-printersystemet på en grafisk måte . De mulige valgene er klassifisert hierarkisk , og kan finnes fort ved hjelp av treet til venstre .

  5. Wortschatz-format • Dette@da værktøj@da vil@da hjælpe@da dig@da med@da grafisk@da at@da indstille@da serveren@da for@da CUPS-printsystemet@da .@da De@da tilgængelige@da muligheder@da er@da grupperet@da i@da relaterede@da emner@da og@da du@da kan@da hurtigt@da få@da adgang@da til@da dem@da gennem@da ikonvisningen@da til@da venstre@da . • Dette@nb verktøyet@nb hjelper@nb deg@nb å@nb konfigurere@nb CUPS-printersystemet@nb på@nb en@nb grafisk@nb måte@nb .@nb De@nb mulige@nb valgene@nb er@nb klassifisert@nb hierarkisk@nb ,@nb og@nb kan@nb finnes@nb fort@nb ved@nb hjelp@nb av@nb treet@nb til@nb venstre@nb .

  6. Corpora • For all languages: • The KDE subcorpus of the OPUS corpus (http://logos.uio.no/opus) • For Norwegian Bokmål, Icelandic, and English: • The EEA agreement • Gathered from various web sites

  7. Corpus sizes • Norwegian Bokmål: 378,308 tokens • Norwegian Nynorsk: 299,944 tokens • Swedish: 1,070,290 tokens • Danish: 726,964 tokens • Icelandic: 208,143 tokens • English: 1,610,779 tokens

  8. Number of aligned sentence pairs

  9. Co-occurrence computation For words A and B, occurring a and b times, respectively, in a corpus of bi-sentences and together in k of n bi-sentences in total, the significance of their co-occurrence is given as: where λ=ab/n

  10. Word pairs prior to uniqification • 13968 scanlex_dais.txt • 37032 scanlex_danb.txt • 60554 scanlex_dann.txt • 76385 scanlex_dasv.txt • 73009 scanlex_enda.txt • 60014 scanlex_enis.txt • 99003 scanlex_ennb.txt • 62525 scanlex_ennn.txt • 69164 scanlex_ensv.txt • 56532 scanlex_isnb.txt • 15105 scanlex_isnn.txt • 17878 scanlex_issv.txt • 338349 scanlex_nbnn.txt • 108687 scanlex_nbsv.txt • 57473 scanlex_nnsv.txt

  11. Most significant candidates for ”indstille” • +--------------+------------------------+-----+ • | w1 | w2 | sig | • +--------------+------------------------+-----+ • | indstille@da | konfigurere@nb | 807 | • | indstille@da | Her@nb | 779 | • | indstille@da | kan@nb | 657 | • | indstille@da | du@nb | 522 | • | indstille@da | sesjonsbehandleren@nb | 287 | • | indstille@da | utseendet@nb | 225 | • | indstille@da | panelet@nb | 189 | • | indstille@da | panelets@nb | 168 | • | indstille@da | oppgavelinje@nb | 153 | • | indstille@da | sette@nb | 108 | • | indstille@da | -konfigurasjonen@nb | 61 | • | indstille@da | Windows-filsystemer@nb | 61 | • | indstille@da | cookie@nb | 61 | • | indstille@da | lines@nb | 61 | • | indstille@da | mode@nb | 61 | • | indstille@da | videomodusen@nb | 61 | • | indstille@da | SMB@nb | 60 | • | indstille@da | XFree86@nb | 60 | • | indstille@da | opp@nb | 58 | • | indstille@da | til@nb | 57 | • +--------------+------------------------+-----+

  12. Number of word pairs after uniqification

  13. The word pairs with the lowest significance values

  14. Results • The lists show the most significantly co-occurent word pairs • Word lists located at http://omilia.uio.no/scanlex/ws/

  15. Future • Convert bilingual word pair lists to multilingual list • Supply with word pairs from Lexin • Expand and improve wordlists through more and bigger parallel corpora • (The Bible? ENPC?) • Faroese

More Related