1 / 26

Knowledge Center for Processing Hebrew

Knowledge Center for Processing Hebrew. Alon Itai – CS Technion. Tools for underrepresented languages. Computer tools and especially the Internet are Anglophile. Search engines are not tooled for morphologically rich languages. Search “ dog ” “ dogs ” “ and dogs ”. כלבים מאולפים מחפשים בית

page
Download Presentation

Knowledge Center for Processing Hebrew

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Knowledge Center for Processing Hebrew Alon Itai – CS Technion

  2. Tools for underrepresented languages Computer tools and especially the Internet are Anglophile. • Search engines are not tooled for morphologically rich languages.

  3. Search “dog” “dogs” “and dogs” כלבים מאולפים מחפשים בית רוני אילוף כלבים אתרי קטגורית כלבים הב-הב אתר חיות המחמד של ישראל! קובי חזן אילוף כלבים היחידה המיוחדת לאילוף כלבים • כלב • כלב - ויקיפדיה • כלבים מאולפים מחפשים בית • כלב(יונק( • כלבים | כלב • אוגר זהבכלב הבית מכונה בלשון המדע – כלב זאב ביתי • עמותת SOS חיות - בחירת כלב מתאים • לוח חיות מחמד - כלבים חתולים דגים תוכים לאימוץ ומסירה - כלב • כלבים | כלבאתר המציע שידוכים בין גזעים, בייביסיטרים, תזונה וטיפוח, וטרינרים, פנסיונים, מאלפים ולוח מודעות. • Dogאתר הכלבים מכיל הרבה מידע, מאמרים, קורסים, תמונות וקטעי וידאו של כלבים וכל הקשור בהם • dogגזעי כלבים · תמונת החודש · הכלב והחוק · רפואה וטיפול · קורסים · מאמרים · לוח מודעות · כלבי הצלה · קטעי וידאו · תמונת השנה · פינת האימוץ ... זולו משחקים פאזלים - משחק לגיל הרך - פאזל חתול עם כלב על אלמנה וכלב ניופאונדלנד, כלבי רועים וכלב רועים בלגי - PETNET.co.il ליווי, עזרת זולת רפואית וכלב נחייה

  4. Tools for underrepresented languages. • Computer tools and especially the Internet are Anglophile. • Search engines are not tooled for morphologically rich languages. • Email and chats do not cope well with strange alphabets •  use (pidgin) English for communication,… • The local language is used less and less. אבגדהוזחטיכלמנסעפצקרשת

  5. The problem • Because of the small number of speakers, there is little economic incentive for commercial companies to develop tools. • Even when tools are available – no open source • Tools developed at Universities are not fit for general use:not robust enough no standard interfacelack of documentation

  6. Duplication of Effort • Every researcher has to redevelop her own tools, before conducting original research • For example: In Hebrew, there are many morphological analyzers: • Choueka and Shapira 1964, • Ornan 1987, Lavie et al. 1988, • Bentur et al. 1992, • Segal 1999, • HSPELL • Yona and Wintner 2005

  7. The Knowledge Center • In 2003, the Israeli Ministry of Science and Technology established a Knowledge Center for Processing Hebrew. • Its aim to develop products (software and databases) for processing Hebrew and make them available to the public, both in academia and industry. • Researchers from four universities are involved in the Center's activities.

  8. The researchers • Yoad Winter (Technion), • Shuly Wintner (Haifa University), • Michael Elhadad (Ben Gurion University), • Arnon Cohen (Ben Gurion University), • Yoram Singer (Hebrew University) • Eli Shamir (Hebrew University) • Alon Itai (Technion)

  9. The model • The ministry provides initial funds. • The Center should be self-sustainable – it should finance itself by selling products. The problems: • The market is too small, had it been large then there would have been no need for the center. • Contradicts our philosophy of open research and open code.

  10. Licensing Policy • Available under GPL – Gnu Public License. You get if for free if all products derived from it are also under GPL. • Payments only for special services. • Can get a non-exclusive license for commercial use.

  11. XML • All products are represented by XML. • Readable both by machines and by humans • Enables using off-shelf tools for on screen presentation and validation EXAMPLE -<item id=“17580” script=“formal” transliterated=“bwqr” undotted=“בוקר“ dotted=“בֹּקֶר“ > <noun gender=“masculine” number=“singular” plural=“im”> <replace gender=“masculine” number=“plural” script=“formal” transliterated=“bqarim” undotted=“בקרים“/> </noun> </item> Info for the morphological parser

  12. XML (2) • Facilitates interface between tools: • For example, the output of the morphological analyzer is the input for the morphological disambiguator. • Thus one can match different morphological analyzers with different disambiguators and compare their results

  13. Products • Morphological analyzers • Morphological disambiguators • Lexicon • Corpora • Speech data base • Tools for editing lexicons and tagging corpora. • PR: forum,…

  14. The lexicon by part of speech Total : 21,417

  15. GUI for editing the lexicon

  16. Morphological disambiguators • Roy Bar-Haim constructed a HMM-based parser which partitions each word in a corpus into morphemes – success rate 96%. • Erel Segal combined a Brill-like method with a priori occurrence probabilities . • Meni Adler used HMM on whole words. • All three disambiguators are available at the Center.

  17. Corpora

  18. Corpora (2) • 6000 sentences of manually tagged corpus (12,000 tokens).

  19. Tree bank • 6000 syntactically parsed sentences. • Used for automatic parsing.

  20. Conclusions • The Center is an example of cooperation between researchers in several universities. • Many users have downloaded the products. • 10 companies have purchased licenses.

  21. Conclusions (2) • Money is running out, … • The model requires money, experts, and commitment. • Not suitable for languages with very few speakers, or for poor communities.

  22. Modern Hebrew • Official Language of the State of Israel • Spoken by 7 M people • Related, but linguistically distinct, from Biblical Hebrew. • Morphologically rich

  23. Semitic Word Formation root + pattern  word pattern CaCaC yiCCoC root ktb katab (he wrote( yiktob (he will write) šabar (he broke) yišbor (he will break) šbr

  24. Writing System • Most vowels are omitted • Particles are prepended to words, Example: h – definite article, b – preposition (in) w – conjunction (and) wbbyt = w + b + ha +byt and in the house

  25. Morphological Ambiguity • Most words are morphologically ambiguous • Example: šbth שבתה • šavta = šbt + CaCCa = stopped working • šavta = šbh + CaCCa = took prisoner • šabatah = her Saturday • še-b-te = that in tea • še-b-ha-te = that in the tea • še-bit-h = that her daughter…

More Related