1 / 27

Siham Boulaknadel, Béatrice Daille and Driss Aboutajdine LINA University of Nantes GSCM_LRIT University of Rabat

A multi-word term extraction program for Arabic language. LREC 28-30 May 2008 = Marrakech. Siham Boulaknadel, Béatrice Daille and Driss Aboutajdine LINA University of Nantes GSCM_LRIT University of Rabat. Outline. Multi-word term Motivation Approach Comparing statistical methods

aric
Download Presentation

Siham Boulaknadel, Béatrice Daille and Driss Aboutajdine LINA University of Nantes GSCM_LRIT University of Rabat

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A multi-word term extraction program for Arabic language LREC 28-30 May 2008 = Marrakech Siham Boulaknadel, Béatrice Daille and Driss Aboutajdine LINA University of Nantes GSCM_LRIT University of Rabat

  2. Outline • Multi-word term • Motivation • Approach • Comparing statistical methods • Conclusion and future work

  3. Terms • Refer to a defined concept ... (ISO 704). • Represent a limited number of part of speech: nouns, verbs, adjectives, and adverbs. • Given subject domain

  4. Multi-word terms تتكون أكاسيد النيتروجين كناتج لجميع عمليات الاحتراق التي تتم في درجات الحرارة العالية[wikipidea] Nitrogen oxides consists of all combustion processes taking place at high temperature MWTs extracted • أكاسيد النيتروجين • درجات الحرارة العالية • عمليات الاحتراق

  5. Motivation • Frequent MWTs • Application • for building index from unstructured documents • for enhancing document retrieval system

  6. MWT extraction systemConcept extraction Corpus Identification of Term Candidates - linguistic filtering (shallow parsing) Filtering of Term Candidates - statistical significance (LLR, FLR, MI3,T-score) Candidate list

  7. MWT evaluation • unithood: measure the strengh of association of the constituents of MWU • United nations [environment domain] • Unithood • termhood: measure relatedness to existing domainspecific concepts. • Soil degradation [environment domain] • Termhood Unithood

  8. MWT patterns

  9. MWT variations • Multiple forms for the same concept • Variations types • Inflexional morphology • Number • N1 N2 / N1 N2 + suffix(ات, ون) • تلوث المحيط «ocean pollution » • تلوث المحيطات« oceans pollution » • Definite form • N Adj / Prefix(ال) + N prefix(ال) + Adj • تلوث كيميائي « chemical polution » • التلوث الكيميائي « the chemical pollution » • Derivational morphosyntactic phenomena • N1 ADJ /N1 PREP N2 • بئر نفطي => بئر من النفط « oil well » • Syntactically (modification postposition) • N1 N2 / N1 N2 ADJ • درجة الحرارة« degree of temperature » • درجة الحرارة العالية« high degree of temperature »

  10. Comparing statistical filtering • Mutual Information (MI3) (Daille, 1994) as baseline • Loglikelihood (Dunning, 1994) • t-Score (Church, 1991) • FLR (Nakagawa and Mori, 2003)

  11. Experiment Data • Arabic specific domain corpus on environment • Compiled from the web “Al-Khat Alakhdar” “Akhbar Albiae” from 2004-2006 • 475,148 words • Motivation • The no-availability of Arabic specific domain corpora

  12. Gold standard • Reference list • Arabic environment terminology : Agrovoc • Total: 65,000 unique known terms ( single and MWT) • Dynamic search • Eurodicautom

  13. Preprocessing • Moving diacritics • Buckwalter’s transliteration • Diab’s parsing (Diab, 2004) • Input • wlm yHtsb AlHkm Almjry sAndwr bwl rklp jzA' SHyHp Avr Erqlp dAxl AlmnTqp mn qbl AlysAndrw. • Output • w/CC lm/RP yHtsb/VBP Al/DT Hkm/NN Al/DT mjry/JJ sAndwr/NNP bwl/NNP rklp/NN jzA'/NN SHyHp/JJ Avr/IN Erqlp/NN dAxl/IN Al/DT mnTqp/NN mn/IN qbl/NN Al/DT ysAndrw/NNP ./PUNC

  14. Evaluation and results • For each association score • Examine the first candidates term • Compute precision (termhood) for 100 candidates term • Precision (termhood) is quotient of attested MWT and all extracted sequences. • the loglikelihood is the best measure

  15. Summary & future work • Develop MWT extraction for Arabic • Define MWT patterns and variations • Obtain best results than european languages • Improvement of system • Adding new variation • Improve lemmatisation

  16. Introduction • MWT’s are sufficiently informative to help human readers get a feel of the essential topics • Use in many text related applications: • Text clustering • Document similarity • Document summarization

  17. Related Work • Linguistic Approach • Based on linguistic pre-processing and annotations (result of taggers, shallow parsers) • Detect recurrent syntactic term formation patterns • Noun + Noun • (Adj | Noun) + Noun

  18. Systems based on linguistics • Ananiadou, S. (1994) recognises single-word terms from domain of Immunology based on morphological analysis of term formation patterns (internal term make up) • Justeson & Katz (1995, TERMS) extract complex terms based on two characteristics (which distinguishes them from non terms) • the syntactic patterns are restricted • terms appear with the same form throughout the text, omissions of modifiers are avoided

  19. Systems based on linguistics • The text is tagged; a filter is applied to extract terms ((A|N) + | ((A|N) * (N P)?) (A|N)*) N AN / NA / AAN / ANN / NAN / NNN / NPN • Filtering based on simple POS pattern • A pattern must occur above a certain threshold to be considered a valid term pattern. • Recall: 71% Precision: 71% -- 96% • LEXTER (Bourigault, 1994) • Extracts French compound terms based on surface syntactic analysis and text heuristics • Terms are identified according to certain syntactic patterns

  20. Uses a boundary method to identify the extent of terms • categories or sequences of categories that are not found in term patterns form the boundaries e.g. verbs, any preposition (except de and à) followed by a determiner. Non productive sequences become boundaries. • Precision: 95% although tests have shown that lots of noise is generated

  21. Approaches using statistical information • Main measures used: • Frequency of occurrence • Mutual Information • C/NC value • Experiments also with loglike coefficient [Dunning, 1993]

  22. Frequency of occurrence • Simplest and most popular method for Domain independent, requires no external resources • Some filtering is used in form of syntactic patterns • Systems using frequency of occurrence • Dagan & Church (TERMIGHT, 1994) • Enguehard & Pantera (1994) • Lauriston (TERMINO, 1996)

  23. Mutual Information • ‘The amount of information provided by the occurrence of the event represented by yi about the occurrence of the event represented by xk is defined as’ I(xk,yi)  log P(xk,yi) / P(xk) P(,yi) Fano (1961:27-28) • This measure is about how much a word tells us about the other. • Problems for MI come from data sparseness; • Damerau (1993) and Daille (1994) used MI for the extraction of candidate terms (only for two-word candidate terms)

  24. C/NC value (Frantzi & Ananiadou) • C/value • total frequency of occurrence of string in corpus • frequency of string as part of longer candidate terms • number of these longer candidate terms • length of string (in number of words)

  25. NC value NC-value(a) = 0.8 * C-value(a) + 0.2 * CF(a) a is the candidate term, C-value(a) is the C-value for the candidate term a, CF(a) is the context factor for the candidate term a • we obtain the CF by summing up the weights for its term context words, multiplied by their frequency appearing with this candidate term.

  26. Hybrid approaches • Combination of linguistic information (filters), shallow parsing results and statistical measures • Daille, B., Frantzi & Ananiadou

  27. Thank You

More Related