rule based approach in arabic nlp tools systems and resources
Download
Skip this Video
Download Presentation
Rule-based approach in Arabic NLP: Tools, Systems and Resources

Loading in 2 Seconds...

play fullscreen
1 / 83

Rule-based approach in Arabic NLP: Tools, Systems and Resources - PowerPoint PPT Presentation


  • 1005 Views
  • Uploaded on

CITALA'09. Rule-based approach in Arabic NLP: Tools, Systems and Resources. Dr Khaled Shaalan Professor, Faculty of Computers & Information, Cairo University On Secondment to BUiD, UAE [email protected]{buid.ac.ae, gmail.com}. Agenda. Objective Language Tasks

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Rule-based approach in Arabic NLP: Tools, Systems and Resources' - blanca


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
rule based approach in arabic nlp tools systems and resources
CITALA'09

Rule-based approach in Arabic NLP: Tools, Systems and Resources

Dr Khaled Shaalan

Professor, Faculty of Computers & Information, Cairo University

On Secondment to BUiD, UAE

[email protected]{buid.ac.ae,

gmail.com}

CITALA2009 - Morroco

agenda
Agenda
  • Objective
  • Language Tasks
  • NLP Approaches
  • Rule-based Arabic Analysis and generation tools
  • Rule-based Arabic NLP applications
  • Some Arabic NLP Free Resources
  • Major and Arabic mailing lists
  • Conclusion
objective
Objective
  • To show how rule-based approach has successfully used to develop Arabic natural language processing tools and applications.
separating language tasks
Separating Language Tasks
  • English vs. French vs. Arabic vs . . .
  • spoken language (dialogue) vs written test vs hand written script
  • Genuine Script vs transliterated (Romanized) script
  • Vocalized (vowelized) vs non-vocalized
  • Understanding vs. generation
  • First language learner vs second language learner
  • Classical or Qur’anical Arabic vs Modern Standard Arabic vs colloquial (dialects)
  • Stem-based vs root-based
rules
Rules
  • Situation/Action
    • If match(stem.prefix, def_article)then romve(stem.prefix,Stem_FS)
    • If match(stem.definitness,indefinite)then morph_gen(stem.definitness,Stem_FS)
common mistake
Common Mistake
  • Rule-based approach is not a rule-based expert systems !!!!!!!
  • Both consist of rules.
  • Rule-based expert systems solves the problem by Recognize-Act Cycle
    • Loop
    • Conflict resolution strategy
recognize act cycle
Domain Knowledge

Rule

Base

New

Rule

n

1

Conflict

Resolution

Match

Execute

New

Fact

Fact

Base

Working Memory

Recognize-Act Cycle

loop

  • Match: Rules are compared to working memory to determine matches. if no rule matches then stop
  • Conflict Resolution: Select or enable a single rule for execution
  • Execute: Fire the selected rule
    • Add new fact, or
    • Learn a new rule

end loop

nlp approaches
NLP Approaches
  • Rule-based
  • Statistical-based
nlp approaches 1
Relies on hand-constructed rules that are to be acquired from language specialists

requires only small amount of training data

development could be very time consuming

developers do not need language specialists expertise

requires large amount of annotated training data (very large corpora)

automated

NLP Approaches (1)

Rule-based

Statistical-based

nlp approaches 2
some changes may be hard to accommodate

not easy to obtain high coverage of the linguistic knowledge

useful for limited domain

Can be used with both well-formed and ill-formed input

High quality based on solid linguistic

some changes may require re-annotation of the entire training corpus

Coverage depends on the training data

Not easy to work with ill-formed input as both well-formed and ill-formed are still probable

Less quality - does not explicitly deal with syntax

NLP Approaches (2)

Rule-based

Statistical-based

rule based arabic nlp tools
Rule-based Arabic NLP tools
  • Morphological Analyzers
  • Morphological Generators
  • Syntactic Analyzers
  • Syntactic Generators
morphological analysis
Morphological Analysis
  • Breakdown the inflected Arabic word into a root/stem, affixes, features.
  • Example: sa- ‘uEty- kumA (ﺳﺄﻋﻂﯾﻜﻤﺎ) - ‘will I give you…’
rules augmented transition network atn technique
Rules - Augmented Transition Network (ATN) technique
  • Rules associated with arcs represent the context-sensitive knowledge about the relation between a root and inflections.
  • More than one rule may be associated with one arc.
  • Conditions associated with the arcs are placed in such a way that the arc to be traversed first is the one that leads to the most probable solution.
types of rules
Types of Rules
  • Remove Prefix or Suffix
  • Remove doubled letter
  • Add/change Hamza, Weak letter,…
analysis of the verb i saw you remove suffixes
Analysis of the verb "شاهدتك" (I saw you): Remove suffixes

شاهدت

شاهدتك

last1 = “ك”

last2 = “ت”

شاهد

S10

S3

S0

S1

S2

  • stem: "شاهد" (saw)
  • perfect
  • 1st person sg pronoun: "ت"
  • 2nd person sg pronoun "ك"
analysis of the verb they are playing remove prefix suffix
Analysis of the verb ”يلعبون“ (they are playing): Remove prefix & suffix

لعبون

لعبون

لعب

Begin2 = “ي”

last2 = “ون”

S10

S3

S0

S1

S2

  • stem: “لعب" (played)
  • imperfect
  • Plural subject
issues in the morphological analysis
Issues in the morphological analysis
  • Overgeneration (too many output)
  • Ambiguity
  • Reconstruction of vowels
  • MultiWord/compound Expressions
  • Out-of-Vocabulary (OOV)
  • Handling ill-formed input
    • Detection (spell checking)
    • Correction- relaxation “ه” instead of “ة”
  • Prevent ill-formed output
    • Check the compatibility (the prefix “ف” cannot come after the prefix “ب” (or “ك”)).
morphological generation
Morphological generation
  • Synthesis of an inflected Arabic word from a given root/stem according to a combination of morphological properties that include:
    • definiteness (definite article “ال”),
    • gender (masculine, feminine),
    • number (singular, dual, plural),
    • case (nominative, genitive, accusative,…),
    • person (first, second, third)
types of rules22
Types of Rules
  • synthesis of inflected
    • Noun
    • Verb
    • particle
synthesis of inflected nouns
Synthesis of inflected Nouns
  • definite noun
  • feminine noun
  • pluralize noun
  • dual noun
  • attach a prefix preposition
  • attach a suffix pronoun
  • end case
  • ….
synthesis of feminine noun
Synthesis of feminine noun
  • If noun.gender = masculineThen attach suffix feminine letter
  • Example:
    • ”زوج“)husband)  “زوجة”(wife)
synthesis of suffix pronoun
Synthesis of suffix pronoun
  • If pronoun.person = first and pronoun.number = singular Then attach first person singular suffix pronoun
  • Example:
    • “زوجة”(wife)  “زوجتي” (my wife)
synthesis of inflected verbs very complex rich in form and meaning
Synthesis of inflected Verbs(very complex-rich in form and meaning)
  • conjugate a verb with tense
  • conjugate a verb with number
  • conjugate a verb with prefix pronoun
  • conjugate a verb with suffix pronoun
  • ….
rule synthesize first person plural of assimilated verbs
Rule: synthesize first person plural of assimilated verbs

Input: first person singular past verb

Output: inflected verb

Example: نصل- سنصل - وصلنا

If verb.tense = future

then remove first weak & attach_prefix(""سن)

else if verb.tense = present

then remove first weak & attach_prefix(""ن)

else attach_suffix(verb.stem,"نا")

issues in the morphological generation
Issues in the morphological generation
  • MultiWord/compound Expressions
  • Out-of-Vocabulary (OOV)
  • Some forms need special handling:
    • Substitution: This man – هذا الرجل
    • literal numbers (complex nouns)
    • Arabic script
      • ‘ل’ + ‘ال’  ‘للـ’
      • ”زملاء“ + ”ي“  ‘زملاءي’  ‘زملائي’
      • ”غرفة“  “غرفتان”
types of rules30
Types of Rules
  • Grammatical rules:
    • Describe sentence and phrase structures, and ensure the agreement relations between various elements in the sentence.
  • Parsing
    • Accepts the input and generates the sentence structure (parse tree)
parsing of the sentence the student sg f is diligent sg f
Parsing of the sentence “الطالبة مجتهدة”The student (sg,f) is diligent (sg,f)

الطالبة مجتهدة

noun (definite,fem,sg)

noun (indefinite,fem, sg)

definite(definite, fem, sg)

enunciative (indefinite,fem, sg) Inchoative (defined, fem, sg)

nominal sentence

  • Agreement:
  • Number
  • Gender

Nominal sentence -> definite_Inchoative(Number,Gender) indefinite_enuciative(Number,Gender)

issues in the syntactic analysis
Issues in the syntactic analysis
  • Ambiguity (more than parse tree)
    • Disambiguation techniques
  • Handling ill-formed input
    • Detection (grammar checking)
    • Recovering (Partial parsing - parses = chunks to be related)
types of rules34
Types of Rules
  • Determine phrase structures
  • Determine syntactic structure
  • Ensure the agreement relations between various elements in the sentence.
rule verb subject agreement
Rule: verb-subject agreement

Input: verb and inflected subject (a pre-verbal NP )

Output: inflected verb agreed with its inflected subject

synthesize_verb(Subject.number,verb.stem)

synthesize_verb(Subject.gender,verb.stem)

an agreement example
الأولادزارواخمسمتاحفقديمة

Adj-noun counted-Num verb-Subject

(G) (G) (N,G)

An agreement example:

الأولاد زارواخمس متاحف قديمة

the-boys visited-they five museum old

The boys visited five old museums

issues in the syntactic generation
Issues in the syntactic generation
  • Word order (VSO,SVO, etc.)
  • Agreement (full/partial)
  • dropping the subject pronoun (called Pro-drop), i.e., to have a null subject, when the inflected verb includes subject affixes.
  • Syntax that captures the source/intended meaning
    • My son is 8 = أبني عمره ثماني سنوات
    • I did not understand the last sentence = أنا لم أفهم الجملة الأخيرة
a rule based arabic nlp applications
A Rule-based Arabic NLP applications
  • Named Entity Recognition
  • Machine translation
  • Transferring Egyptian Colloquial Dialect into Modern Standard Arabic
what is entity recognition
What is entity recognition?
  • Identifying, extracting, and normalizing entities from documents such as names of people, locations, or companies.
  • Makes unstructured data more structured
slide40
Politics of Ukraine

In July 1994, Leonid Kuchma was elected as Ukraine's second president in free and fair elections. Kuchma was reelected in November 1999 to another five-year term, with 56 percent of the vote. International observers criticized aspects of the election, especially slanted media coverage; however, the outcome of the vote was not called into question. In March 2002, Ukraine held its most recent parliamentary elections, which were characterized by the Organization for Security and Cooperation in Europe (OSCE) as flawed, but an improvement over the 1998 elections. The pro-presidential For a United Ukraine bloc won the largest number of seats, followed by the reformist Our Ukraine bloc of former Prime Minister Viktor Yushchenko, and the Communist Party. There are 450 seats in parliament, with half chosen from party lists by proportional vote and half from individual constituencies.

Entity Extractor

Person

Date

Location

person entity recognition 1
Person Entity Recognition (1)

Example: ‘الملك الأردني عبدالله الثاني’ The Jordanian king Abdullah II

  • We want to have a rule that recognizes a person name composed of a first namefollowed by optional last names, based on a preceding person indicator pattern.
person entity recognition 2
Person Entity Recognition (2)

The Rule component of this example:

  • Name Entity: عبدالله [Abdullah]
  • indicator pattern:
    • an honorificsuch as "الملك" [The king]
    • Nasab: (optional) inflected from a location name "الأردني" [Jordanian].
  • The rule also matches an optional ordinalnumber appearing at the end of some names such as "الثاني" [II].
person entity recognition 3
Person Entity Recognition (3)

((honorfic+(location(ية|ي))?)+

first_Name(last_Name)?+(number)?)

  • This (Regular Expression) rule can recognize:
    • الملكعبدالله
    • الملك الأردني عبدالله
    • الملك الأردني عبدالله الثاني
    • الملكة الأردنيةرانيا
issues in the arabic ner
Issues in the Arabic NER
  • Complex Morphological System (inflections)
  • Non-casing language (No initial capital for proper nouns)
  • Non-standardization and inconsistency in Arabic written text (typos, and spelling variants)
  • Ambiguity
machine translation
Machine Translation
  • Direct
  • Transfer
  • Interlingua
mt approaches mt pyramid
Interlingua

Direct

Transfer

MT ApproachesMT Pyramid

Source syntax

Target syntax

Source word

Target word

Analysis

Generation

english to arabic transfer based approach
English-to-Arabic Transfer based Approach

source sentence

(English)

Morphological

& syntactic Analysis

Rules of English

English Dic.

Sentence Analysis

English Parse Tree

English-to-Arabic

Transformation Rules

Bi-ling Dic.

Transfer

Arabic Parse Tree

Morphological Gen. &

Synthesis Rules of

Arabic

Arabic Dic.

Sentence Synthesis

Target sentence

(Arabic)

transfer approach
Transfer approach
  • Involves analysis, transfer, and generation components
  • If you have an Arabic parser & Arabic syntactic generator, All you need is to acquire the transfer rules and build the transfer component
simple transfer
Simple Transfer

(1) [wi:$1, wi+1:$2, …, wk:$k] (1  i  k)

[wk:$k, wk-1:$k-1, …, wi:$i] (1  i  k)

slide50
np

np

noun

تقييم

sg

noun

networks

pl

np

np

noun

أداء

sg

np

noun

performance

sg

np

noun

شبكة

pl

noun

evaluation

sg

Networks performance evaluation  تقييم أداء شبكة

transfer

issues in the transfer based mt approach
Issues in the Transfer-based MT approach
  • Synonyms of a word
    • Acquisition  “اكتساب” or “استخلاص”.
  • Agreement
    • intelligent tutoring systems  “نظم التعليم الذكية” or “نظم التعليم الذكي”
  • Problems with prepositions
    • did you do fungal analysis? 

“هل قمت بـتحليل الفطر؟”

interlingua mt multilingual translation
Interlingua MT – Multilingual translation
  • Interlingua = Semantic Representation
  • Deep analysis –
    • no need for transfer component)
    • Only analysis and generation components
  • Add Arabic analyzer to translate to other languages
  • Add Arabic generator to translate from other languages
analysis of arabic to interlingua
Preprocessor

Arabic Grammar Rules

Sentence

Analyzer

Arabic

Lexicon

Parser

Morphological Analyzer

Arabic Morphology Rules

Map

Lexicon

Ontology

Mapper

Analysis of Arabic to Interlingua

العميل: أنا أرغب في حجز غرفة في الفندق

Parse Tree

Interlingua(IF)

c:introduce-topic+reservation+disposition+room (room-spec=(room, specifier=hote,identifiability=yes),disposition=(desire,who=i))

generating arabic from interlingua
Map

Lexicon

Ontology

Mapper

Feature Structure

Map Rules

Sentence

Generator

Arabic

Lexicon

Generator

Arabic Grammar Rules

Morphological Generator

Arabic Morphology Rules

Generating Arabic from Interlingua

Interlingua(IF)

c:introduce-topic+reservation+disposition+room (room-spec=(room, specifier=hote,identifiability=yes),disposition=(desire,who=i))

العميل: أنا أرغب في حجز غرفة في الفندق

issues in the interlingua approach
Issues in the interlingua approach
  • Interlingua:
    • language-neutral representation
    • captures the intended meaning of the source sentence
  • Requires a fully-disambiguating parser
transferring egyptian colloquial dialect into modern standard arabic
Transferring Egyptian Colloquial Dialect into Modern Standard Arabic
  • Be able to reuse MSA processing tools with colloquial Arabic by transferring colloquial Arabic words into their corresponding MSA words.
  • Facilitate the communication with colloquial Arabic speakers
  • Restore the Arabic dialect to the standard language in use nowadays.
a one to one transfer example
A one-to-one transfer example

امتي؟

Mapping

متي؟

when?

a one to many transfer example
A one-to-many transfer example

عال

On-the

Mapping

ال

the

علي

on

a complete sentence example
A complete sentence example

جيت امتي؟

You-came when?

  • Step (1)
    • جيت  جئت
    • امتي  متي
  • Step (2)
    • the New Segment Position for
    • the word “امتى” is
    • start of sentence (SoS)

Mapping

جئت متي؟

reordering

متي جئت؟

When did-you-come ?

issues in the transfer to msa
Issues in the transfer to MSA
  • More investigations are needed
arabic morphological analyzers
Arabic Morphological Analyzers
  • Tim Buckwalter Morphological
    • http://www.qamus.org/
    • http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49
  • Xerox
      • http://www.cis.upenn.edu/~cis639/arabic/input/keyboard_input.html
arabic morphological analyzers63
Arabic Morphological Analyzers
  • Aramorph
    • http://www.nongnu.org/aramorph/english/index.html
arabic spell checker
Arabic spell checker
  • Aspell
    • http://aspell.net/
    • http://www.freshports.org/arabic/aspell
arabic morphological generation
Arabic Morphological Generation
  • Sarf
    • http://sourceforge.net/projects/sarf
tokenization pos tagging
Tokenization & POS tagging
  • ArabicSVMTools: The tools utilize the Yamcha SVM tools to tokenize, POS tag and Base Phrase Chunk Arabic text
    • http://www1.cs.columbia.edu/~mdiab/
    • http://www1.cs.columbia.edu/~mdiab/software/AMIRA-1.0.tar.gz
tokenization pos tagging67
Tokenization & POS tagging
  • MADA: a full morphological tagger for Modern Standard Arabic.
    • http://www1.cs.columbia.edu/~rambow/software-downloads/MADA_Distribution.html
pos tagging
POS tagging
  • Stanford Log-linear Part-Of-Speech Tagger
    • http://nlp.stanford.edu/software/tagger.shtml
    • http://nlp.stanford.edu/software/stanford-arabic-tagger-2008-09-28.tar.gz
tokenization pos tagging69
Tokenization & POS tagging
  • Attia's Finite State Tools for Modern Standard Arabic
    • http://www.attiaspace.com/getrec.asp?rec=htmFiles/fsttools
arabic parsers
Arabic Parsers
  • Dan Bikel’s Parser
    • http://www.cis.upenn.edu/~dbikel/
    • http://www.cis.upenn.edu/~dbikel/software.html
  • Attia Arabic Parser
    • http://www.attiaspace.com/
    • http://decentius.aksis.uib.no/logon/xle.xml
arabic wordnet
Arabic wordnet
  • Arabic WordNet
    • http://www.globalwordnet.org/AWN/
    • http://personalpages.manchester.ac.uk/staff/paul.thompson/AWNBrowser.zip
translation resources
Translation resources
  • Tools: GIZA++, MOSES, Pharaoh, Rewrite and BLEU
      • http://www.statmt.org/
  • APIs:
    • http://code.google.com/apis/ajax/playground/#translate
    • http://code.google.com/apis/ajax/playground/#batch_translate
transliterate
Transliterate
  • Transliterate
    • http://code.google.com/apis/ajax/playground/#transliterate_arabic
mailing lists just to be connected to the nlp community
Mailing Lists – just to be connected to the NLP community
conclusion 1
Conclusion (1)
  • Arabic requires the treatment of the language constituents at all levels: morphology, syntax, and semantics.
  • Most of the researches in Arabic NLP are mainly concentrated on the analysis part aiming at automated understanding of Arabic language.
conclusion 2
Conclusion (2)
  • Arabic NLP in general is significantly under developed.
  • In order to bridge this gab and help Arabic NLP research to catch up with the many recent advances of Latin languages, we need collaborative efforts from the Arabic research community.
conclusion 3
Conclusion (3)
  • We need Public Domain (in Electronic Form) for:
    • Linguistic resources such as large Arabic (bilingual) Corpora and treebanks.
    • Machine readable (bilingual) dictionaries
    • Morphological Analyzers
    • Parsers
conclusion 4
Conclusion (4)
  • We need to secure fund for:
    • Exchanging visits (experience Expert Network)
    • Buy software
    • Secure dedicated RA’s and/or PhD students for the NLP task.
references 1 journals
References (1) - Journals
  • Khaled Shaalan, Hafsa Raza, NERA: Named Entity Recognition for Arabic, the Journal of the American Society for Information Science and Technology (JASIST), John Wiley & Sons, Inc., NJ, USA, 60(7):1–12, July 2009.
  • Shaalan, K., Monem, A. A., Rafea, A., Arabic Morphological Generation from Interlingua: A Rule-based Approach, in IFIP International Federation for Information Processing, Vol. 228, Intelligent Information ProcessingIII, eds. Z. Shi, Shimohara K., Feng D., (Boston:Springer), PP. 441-451, 2006.
  • Shaalan, K., Talhami H., and Kamel I., Morphological Generation for Indexing Arabic Speech Recordings, The International Journal of Computer Processing of Oriental Languages (IJCPOL), World Scientific Publishing Company, 20(1)1:14, 2007.
references 2 journals
References (2) - Journals
  • Shaalan K. An Intelligent Computer Assisted Language Learning System for Arabic Learners, Computer Assisted Language Learning: An International Journal, Taylor & Francis Group Ltd., 18(1 & 2): 81-108, February 2005.
  • Shaalan K. Arabic GramCheck: A Grammar Checker for Arabic, Software Practice and Experience, John Wiley & sons Ltd., UK, 35(7):643-665, June 2005.
  • Shaalan K., Rafea, A., Abdel Monem, A., Baraka, H., Machine Translation of English Noun Phrases into Arabic, The International Journal of Computer Processing of Oriental Languages (IJCPOL), World Scientific Publishing Company, 17(2):121-134, 2004.
  • Rafea A., Shaalan K., Lexical Analysis of Inflected Arabic words using Exhaustive Search of an Augmented Transition Network, Software Practice and Experience, John Wiley & sons Ltd., UK,23(6):567-588, June 1993.
references 3 workshops conferences
References (3) – workshops & conferences
  • Hosny, A., Shaalan, K., Fahmy, A., Automatic Morphological Rule Induction for Arabic, In the Proceedings of The LREC'08 workshop on HLT & NLP within the Arabic world: Arabic Language and local languages processing: Status Updates and Prospects, 31st May, PP. 97-101, 2008.
  • Shaalan, K., Abo Bakr, H., Ziedan, I., Transferring Egyptian Colloquial into Modern Standard Arabic, International Conference on Recent Advances in Natural Language Processing (RANLP – 2007) , Borovets, Bulgaria, PP. 525-529, September 27-29, 2007.
  • Shaalan, K., Abdel Monem, A., Rafea, A., Baraka, H., Generating Arabic Text from Interlingua, In the Proceedings of the 2nd Workshop on Computational Approaches to Arabic Script-based Languages, CAASL-2, Linguistic Institute, Stanford, California, USA, PP. 137-144, July 21-22, 2007.
references 4 workshops conferences
References (4) – workshops & conferences
  • Othman E., Shaalan K., and Rafea A., Towards Resolving Ambiguity in Understanding Arabic Sentence, In the Proceedings of the International Conference on Arabic Language Resources and Tools, NEMLAR, PP. 118-122, 22nd–23rd Sept., Egypt, , 2004.
  • Othman E., Shaalan K., and Rafea A. A Chart Parser for Analyzing Modern Standard Arabic Sentence, In proceedings of the MT Summit IX Workshop on Machine Translation for Semitic Languages: Issues and Approaches, New Orleans, Louisiana, USA., September, 2003.
thank you merci

Thank you!Merci!

Shukran!

شكرا

ad