1 / 19

DIAC+: A Professional Diacritics Recovering System

DIAC+: A Professional Diacritics Recovering System. Dan Tufiş Alexandru Ceauşu tufis@racai.ro aceausu@racai.ro. Research Institute for Artificial Intelligence, Romanian Academy 13, Calea "13 Septembrie", 050711, Bucharest. Outlook. Motivations Related work and different approaches

ivy
Download Presentation

DIAC+: A Professional Diacritics Recovering System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DIAC+: A Professional Diacritics Recovering System Dan Tufiş Alexandru Ceauşu tufis@racai.roaceausu@racai.ro Research Institute for Artificial Intelligence, Romanian Academy 13, Calea "13 Septembrie", 050711, Bucharest

  2. Outlook • Motivations • Related work and different approaches • Diacritics in Romanian • DIAC+ Architecture • Evaluation • Implementation

  3. Motivations • Almost all European languages use diacritics • In most languages that use diacritical characters, they are usually not only decorative, but they may have grammatical and/or semantic meaning • The lack or the wrong use of the diacritics is extremely annoying especially in texts meant for publication. • Why the lack of diacritics still happens nowadays? • reuse of older texts • ergonomic factors (non-localized keyboards, multiple key-strokes for a diacritical character) • inappropriate authoring tools or character-set converters • typos

  4. Different approaches & related work • Word-based (dictionary supported) approaches: • El-Bèze et al (1994), Yarowsky (1994), Spriet & El-Bèze (1997), Simard (1998), Tufiş & Chiţu (1999) etc. • Character-based approaches: • Mihalcea (2002), Bobiceva (2008), Zweigenbaum and Grabar (2002), Wagacha et al. (2006), De Pauw et al. (2007) etc.

  5. Diacritics in Romanian (I) • Romanian language has 5 diacritical characters: ă,â,î,ş and ţ (plus their uppercase variants) • Two categories of words that may contain diacritics: • U-words (Unambiguous words): the class of legal words of Romanian, which when their diacritics are stripped-off, are not words of the language anymore: • padure (pădure - forest), tufis (tufiş - bush), cantar (cântar - balance), carare (cărare - pathway), casmir (caşmir - cashmere), macar (măcar - at least), fara (fără - without), cati (câţi - how many) Their recovery is trivial when a back-up lexicon is available

  6. Diacritics in Romanian (II) • A-words (Ambiguous words): the class of legal words of Romanian, which when their diacritics are stripped-off, are still words of the language; these words are never identified by a traditional spell-checker; for instance the string fata could mean any of the following: • fata – the girl, fată – a girl; or (about animals) gives birth , fâţa – the quick-swimming little fish/the coquette, fâţă – a quick-swimming little fish/a coquette, faţa – the face, faţă – a face, făta – (about animals) to give birth; gave birth, fătă – (about animals) just gave birth.

  7. Diacritics in Romanian (III) • Most A-words could be disambiguated based on grammatical information; those that cannot, are called S-words (Semantically ambiguous words). • The proper treatment of S-words (characterized by the same morpho-syntactic properties) require semantic disambiguation. • For the previous example, knowing the morpho-syntactic properties (Ncfsry: common nouns, feminine, definite forms and direct case), still leaves three diacritics restoration possibilities with very different meanings: • fata (Ncfsry) – the girl, • faţa (Ncfsry) – the face. • fâţa (Ncfsry) – the quick-swimming little fish/the coquette, A text may : • be completely diacritics-free (Tufişand Chitu 1999) or • partially contain diacritics (and not always in a correct way); this is a harder case

  8. Diacritics in Romanian (IV) In an ideal setting, with a full coverage dictionary and a text with no typographical error other than the missing diacritics, about 25% (#A-words/#Words) of the total number of words would remain ambiguous. Our supposedly error freetexts: 72722 (1.09%) typing errors (journalism texts) and29387 (0.84%) typing errors (juridical texts).

  9. Input text Tokenizer resources (i) Tokenization (ii) Hypotheses generation D0,D1,D2 Dictionaries Language model (iii) Tiered tagging (iv) Candidate selection Character model (v) Unknown words processing Output text & spelling alternatives DIAC+ Architecture

  10. Dictionaries D0, D1, D2 and Hypotheses Generation • LEX dictionary – normative lexicon <wordform><tag>>lemma>;  1million entries • D0 dictionary is the subset of LEX containing all the words with at least one diacritical character; • D1 dictionary is the diacritics stripped-off version of LEX; • D2 dictionary contains words in the current text which are neither in D0 nor in D1 and which are suspected of being typing errors; they are derived from the words in D0D1 differing by plus or minus one character or by switching two consecutive characters (additionally, the switched characters should be neighbors on the keyboard) • In the hypotheses generation step, a word is first searched in D0D1 • If the word cannot be found in D0D1 it is searched in the D2 dictionary. A word which is not found in any of the system's lexicons is considered unknown and irrecoverable by the word-based approach, and its processing is left in charge of a character-based recovery module. • a word W,occurringin the current text,may be associated with several entries in the LEX word-form lexicon <surface-formk MSDk>; the tagging step will be used to filter this set and eventually select the single contextually correct <surface-formi>.

  11. Tiered Tagging &Candidate selection • a special HMM language model in which the transition probabilities were computed from the regular training corpora (i.e. with diacritics) and the emission probabilities were computed from the diacritics stripped-off training corpora. • TT = a two step tagging process • Tagging with a reduced tagset LM (92 tags) • Recovering left-out information from the lexical tagset (615 tags) • Candidate selection. The U-words are replaced with their diacritical counterpart. The A-words which are not S-words are replaced by the surface-form identified by the MSD assigned by the tagger to the respective A-word. For the S-words, either the user is presented with a list of contextually meaningful choices or the replacement is automatically done based on lexical probabilities or some probabilistic preferences.

  12. Character Model and Unknown Words Processing (I) • Unknown word processing is used as backup for the candidate selection stage where no equivalent word-form was found in the lexicon. This case is quite rare – very few words are not covered by our almost 1,000,000 entries lexicon. The unknown word processing can be designed to work in parallel with the candidate selection phase. For processing unknown words, we used a character-based N-gram model similar to the one used in (Mihalcea, 2002). • We used SRILM - SRI Language Modeling Toolkit (Stolcke, 2002) to train several character models. The training corpus contained 5,124,277 characters (including spaces) in 48,308 sentences and the test corpus has 613,234 characters in 6,411 sentences.

  13. Character Model and Unknown Words Processing (II) We used Viterbi estimation with a 5-gram character model to find the most probable string for the unknown word.

  14. Evaluation (I) • Word-based vs. Character-based evaluations • The evaluation scenario • R=reference corpus, tokenized, tagged and lemmatized; hand validated (cca. 118,000 words and about 502,000 characters). • TT = the diacritics stripped-off version of R • RT = the tag and lemma stripped-of version of TT • Baseline system: from the Agenda Corpus (10 mio words) we derived a dictionary for which the head entries are non-diacritical forms of words and body of the entry is the list of diacritical counterparts each with the frequency in the corpus; the baseline system replaces a head word from this lexicon with the most frequent diacritical counterpart

  15. Word-based Evaluation

  16. Character-based Evaluation Evaluations in terms of characters, always looks much better (approx 4 times better) than the evaluations in terms of words!

  17. Implementation • Two versions: • Standalone (everything packed in one executable; rather slow for large MS Office documents) • Web-service (distributed among various programs and machines; much faster) • In both versions DIAC+ may work under the user supervision (as classical spell-checkers) or independently • generates a logfile documenting each correction (initial word-form, possible replacements and the actual one). Optionally, the logfile can include for each replacement the sentence in which it was operated. • The system can correct a few typographical errors such as transposed characters, wrong typed characters, or omitted characters. • The MS spell-checker underlines all the unknown words, thus allowing the user to further inspect spelling errors which are out of reach for DIAC+.

  18. Figure 2. Diacritics recovery in Microsoft Word 2003

  19. Thank you! Q/A ?

More Related