1 / 23

Free Swedish Word Lists or Hackers’ BLARK

Free Swedish Word Lists or Hackers’ BLARK. Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008. Anyone can use it in an application Anyone can study it and modify it Anyone can take a copy of it

meg
Download Presentation

Free Swedish Word Lists or Hackers’ BLARK

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Free Swedish Word Listsor Hackers’ BLARK Viggo Kann KTH, Stockholm GSLT meeting January 26, 2008

  2. Anyone can use it in an application Anyone can study it and modify it Anyone can take a copy of it Anyone can improve it, release the improvements to the public, so that the whole community benefits (based on four freedoms of free software, Richard Stallman) What is a free language resource?

  3. Strong free software culture • GNU project • FSF – Free Software Foundation • GPL – GNU General Public License • OSI – Open Software Initiative • Linux, TeX, Emacs, GCC, MySQL, PHP, Java, Python, Firefox

  4. First meeting of the Free Swedish Words group at KTH January 16 11 persons from around Sweden • Lars Aronsson: project Runeberg and Swedish Wikipedia (Wiktionary) • Lars Törnquist and Sven Lange: Swedish thesaurus built on Bring (1930) • Christian Mattson: Lexin dictionaries

  5. Niklas Johansson: Spelling error detection and correction in OpenOffice • Göran Andersson: DSSO – The large Swedish word list • Viggo Kann: Stava, Granskatagger, Synlex, Tvärslå Nordic dictionary • Per Starrbäck, Leif-Jöran Olsson, Tomas Padron-McCarthy, Erik Geijer

  6. Plans for more free words • Swedish synonyms in OpenOffice (Niklas) • Extending DSSO with synonyms, associations etc (Göran) • Building a free Swedish-English dictionary (Viggo) • Testing Swedish grammar checking in Languagetool/OpenOffice (Viggo&Niklas)

  7. …if you are a language technologist: Get funding Use resources that are free to use for researchers Hire linguists to do the heavy jobs …if you are a free software hacker: Use other free resources Collect data from lots of people using e.g. a wiki or a web form Typical ways to construct a resource

  8. Example: Synlex • Construct a Swedish dictionary of synonyms as a list of synonymous pairs • I don’t want to work a lot • I don’t want to pay anyone to work • The resulting list should become free

  9. Ideas • Automatically construct a large set of word pairs that might be synonyms • Use ten thousands of people, who are each willing to make a small contribution without payment, to check the word pairs

  10. More ideas • Use the Lexin on-line Swedish-English dictionary web site, that had 9 millions (now 25 M) of lookups each month • Users visit Lexin to translate words, and are thus probably motivated to help me • Each time a user makes a lookup, give her the opportunity to decide whether two words are synonyms or not

  11. My plan • Construct lots of possible synonyms • Sort out bad synonym pairs automatically • Ask lots of users if the rest of the pairs are good synonyms • Analyze the gradings done by the users and decide which pairs to keep

  12. Step 1: Construct lots of possible synonyms • If we have access to a Swedish-English dictionary SE and an English-Swedish dictionary ES, try to translate each word to English and back again to Swedish • {(w,v): y: ySE(w)  vES(y)} or{(w,v): y: ySE(w)  ySE(v)} • 616 000 word pairs were generated

  13. Step 2: Remove bad synonym pairs automatically • Use RI (Random Indexing)[Kanerva, Kristoferson, Holst 2000]to measure the distance between words represented in a large vector space • Keep pairs that have small enough distance in the vector space

  14. Step 3: Ask lots of users if the rest of the pairs are good synonyms When a user has sent a word to the Lexin dictionary he receives the translation followed by a question like: Are 'spread' and 'lengthen' synonyms? Answer using a scale from 0 to 5 where 0 means 'I don’t agree' and 5 means 'I do fully agree', or answer 'I don’t know'

  15. Step 4: Analyzing the gradings done by the users • 1.2 millions gradings were made in less than 2 months • Grading statistics were analyzed on several occasions • Some users sent comments

  16. More and more interesting gradings as time goes by

  17. Distribution of mean gradings of word pairs

  18. Some statistics (January 2008) • 2.8 M user gradings done • 75 000 pairs (graded ≥ 2) in dictionary • 108 000 pairs suggested by users • 62 000 unique pairs suggested • 20 000 of them have been accepted

  19. 5: rang (grade)rank (rank)slag (kind) 4: kategori (category)stånd (social class)årskurs (grade) 3: fack (sphere)grad (degree)grupp (group)kvalitet (quality)nivå (level) 3: sort (sort)standard (standard)stil (style) 2: skikt (layer)storleksordning (magnitude)typ (type) 1: poäng (point)stadga (stability) 0: uppdrag (mission)utbilda (educate) Example: Synonyms to klass (class)

  20. How to prevent abuse? • Many gradings of a word pair are needed before it’s considered to be good • The pair to be graded is randomly picked from a very large list • Word pairs suggested by users are spell checked before they are added to the very large list

  21. People's definition of synonymy • Exact meaning of 'synonym' wasn’t defined • Users will grade using their intuitive understanding of the concept of synonymy and the words in the pair • The produced dictionary will use the people's own definition of synonymy Hopefully this is exactly what they want!

  22. Links • www.dsso.seThe large Swedish word list • www.nada.kth.se/stava Spell checker • lexin.nada.kth.se/synlex.html75 000 synonyms • sv.wiktionary.org 50 000 word dictionary • www.thesauruslex.com Hyperlexicon

More Related