IKTA-27/2000 Development of a Part-of-Speech(POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July 1. 2000. - June 30. 2002. Consortium: University of Szeged,Department of Computer Science, MorphoLogic Ltd. Budapest Coordinator: Tibor Gyimóthy, PhD. University of Szeged, e-mail: firstname.lastname@example.org
IKTA-27/2000 Contributors University of Szeged: Zoltán Alexin e-mail: email@example.com Károly Bibok e-mail: firstname.lastname@example.org János Csirik e-mail: email@example.com Tibor Gyimóthy (coordinator) e-mail: firstname.lastname@example.org + 25 linguist students MorphoLogic Ltd.: Gábor Prószéky e-mail: email@example.com László Tihanyi e-mail: firstname.lastname@example.org
IKTA-27/2000 Hungarian Part-of-speech Tags The „házaitokkal”1: noun, plural, instrumental case owner (2nd person, plural) The „fognak”2: may have several POS tags e.g. auxiliary verb (future tense), 3rd person, plural or noun in genitive case, or noun in dative caseAttributes of a particular word (number, person, case, owner number, owner person, degree, …) are encoded. Words can be labeled using their possible codes.The task of the disambiguation is to slice up the continuous text into sentences and words, then map to each word its contextually correct POS tag. 1: with your houses 2: (they) will do something, or something that belongs to a tooth
IKTA-27/2000 • Motivation • Ambiguous words cause problems in many natural language processing systems. • Promising results have been published for part-of-speech tagging for European languages • Encourage IT research and development related to the Hungarian language. POS tagging and morphological parsing are only two of the open problems. There are other tasks: NP recognition (NP-chunking),shallow parsing, syntactic parsing, semantic encoding, text understanding, the co-reference problem etc. • Using the newest tools of Artificial Intelligence (machine learning algorithms), trying existing or designing new algorithms • Establishing a learning database for further studies • Development of a real disambiguation software (POS-tagger) based on the prototype elaborated in the current project. Enhancement of existing software versions, testing them against the learning database.
IKTA-27/2000 • Work phases • Work phase 1 : July 1, 2000 - December 31, 2000 • Work phase 2 : January 1, 2001 - December 31, 2001 • Work phase 3 : January 1, 2002 - June 30, 2002
IKTA-27/2000 • The goals of the complete project • Establishment of a medium-sized (at least 1 million words long) manually annotated Hungarian learning corpus, that can be used for other natural language processing tasks beyond the scope of this project. • Make a separate POS-tagger module that can be built into different applications. That module can efficiently disambiguate texts of different domains. • Writing a continuously adapting system that can follow temporal changes in the language. • The chosen technology should be applicable to other European languages. • The system should use a general encoding scheme that includes properties of • all European languages. The encoding should support the representation of the rare features of the Finno-Ugric Hungarian language. • The solution’s per-word accuracy, i.e. the ratio of the number of well annotated words to the number of all words should reach • at least 97-98%.
IKTA-27/2000 • The Results of the Work Phase 1. • The text of the 1 million word long corpus is already collected. • For the encoding of morpho-syntactic features of the words both the MSD and HuMor codes are used. • The corpus (the text augmented by annotations withthe disambiguation • information) is stored in an XML database. The TEIXLite DTD was used. • Software needed for Work Phase 2 is ready. • A case study was written, in which the application of machine learning algorithms was discussed in detail.
IKTA-27/2000 Current State of the Project • Collection of texts (completed) • Setting up a lexicon containing all words and theirpossible encodings (completed) • Manual disambiguation of the corpus(by the end of 2001) • Part-of-speech tagger prototype(by summer 2002)
IKTA-27/2000 The Hungarian Learning Corpus • Making a better training data set than the so-calledOrwell corpus (1997) • Conformance to the „Multext-East” - Hungarian MSD specification, i.e. complete subclassing of pronouns,numbers and adverbs (which is missing from the Orwell corpus) • Conformance to the Hungarian Academic Dictionary
Size: 1 000 000 words(plus punctuations) Selected textmodern, recent texts XML technology Full MSD encoding Size: 100 000 tokens (words and punctuations) Single roman(special written language) SGML technology Partial implementationof the Hungarian MSD IKTA-27/2000 Comparison of the IKTA and the Orwell corpora
IKTA-27/2000 MSD Lexical Encoding I. • Can be applied to represent the morpho-syntactic featuresof many European languages, such as English, German, French, ...,and also for Hungarian, Slovak, Czech, Roman, Russian, etc. • Each language is an instance of the general encoding scheme • Codes are strings, whose first character denotes the maincategory; the individual attributes or features are encoded byletters at fix positions within the code string. Unused attributesare denoted by a ‘-’. • The main categories: (A - adjective, C - conjunctive, D - determiner, I - interjection, M - numeral, N - noun, P - pronoun, Q - particle, R - adverb, S - adposition, T - article, V - verb, X - residual (unknown), Y - abbreviation.
IKTA-27/2000 MSD Lexical Encoding II. • The code of the word ablakokban is: Nc-p2 (Noun, common, - (gender), plural, inessive • The code of the word legeslegnagyobbaké is: Afe-pn------s (Adjective, type, degree, - (gender), plural, nominative, - (definiteness), - (clitic), - (animate), - (formation), - (owner number), - (owner person), owned number • The HuMor codes are converted to MSD by the morphological parsing software using a conversion table • Information not encoded in MSD (combination of words, suffixes) are lost • For Hungarian, there are about 10,000 different MSD codes, but only a fraction of them occur in everyday written language. A big portion of codes virtually never occur.
IKTA-27/2000 Some Data • The size of the lexicon was 162,332 words (the preliminarymorphological parsing was done by the HuMor (Hungarian Morphology) parser system; then labels have been individually checked and corrected when it was necessary • A 200,000 word long part of the corpus was sliced, and preliminary studies and statistics were done for this part. • The number of ambiguous words increased dramatically. 55-60%of the words are proved to be ambiguous (115,542 of 202,604, i.e. 57.03%), while in the case of the Orwell corpus this ratio was 25,526 of 80,708 (31.62%); ambiguous words have 2.92 labels on average; 1499 ambiguity-classes were found (versus 566 classes in the Orwell corpus).
IKTA-27/2000 Hungarian POS-tagger • It is written in Prolog • For each ambiguity-class*, a separate learning task is generated.The learned rule-sets are tested and refined. The rule-sets obtainedby different learning tools can be combined. • It is based on context rules of the form:choose_C1_…_Cn(-Predicted, +CurrentWord, +LeftContext, +RightContext) • Resolution of ambiguities within a sentence is done in a non-deterministic way *Words having the same set of morphological annotations belong to the same ambiguity-class
IKTA-27/2000 Software components • Lexicon maintenance utility for the HuMor parser • Text XML conversion utilities • XSLT scripts for maintaining the XML databases • Morphological parser program (HuMor: Hungarian Morphology) • Disambiguation program • Learning task generator programs • Parsers that can read back the output of the learning tools, then can convert the results to a standardized Prolog rule-set. • Programs for maintaining rule-sets, POS-taggers, test utilities.
IKTA-27/2000 A Sample Screen of the Disambiguation Program
IKTA-27/2000 Sample Data from the First200,000-Word Section of the Corpus • Number of ambiguity-classes: 1,499 • Most frequent ambiguity class: A, a - ['I', 'Nc-sn', 'Pd3-sn', 'Tf'] although the correct label was ‘Tf’ in 12,185 cases from the all12,194 occurrences • Some other frequent classes: Az, az - ['Pd3-sn', 'Tf'] (annotated by‘Tf’ in 5,447 cases from 6,207), Mármint, Márpedig - ['Ccsp', 'Ccsw'] (got ‘Ccsp’ in 3,978 cases from the all 5,548 occurrences) • The most embarrassing class was: Aztán, Csakhogy - ['Ccsp', 'Ccsw', 'Rx'] where the ‘Ccsp’ was correct in 2,457 cases of 4,853. Using a probabilistic tagger (which chooses the most probable tag) this class itself causes 2,396 errors, that is 2,08% of theappr. 100,000 ambiguous words • In the case of 855 classes only one label has been chosen; in 736 cases the number of occurrences was less than 10; 430 classes occurred only once in the text.