rule based approach in arabic nlp tools systems and resources
Skip this Video
Download Presentation
Rule-based approach in Arabic NLP: Tools, Systems and Resources

Loading in 2 Seconds...

play fullscreen
1 / 83

Rule-based approach in Arabic NLP: Tools, Systems and Resources - PowerPoint PPT Presentation

  • Uploaded on

CITALA'09. Rule-based approach in Arabic NLP: Tools, Systems and Resources. Dr Khaled Shaalan Professor, Faculty of Computers & Information, Cairo University On Secondment to BUiD, UAE [email protected]{,}. Agenda. Objective Language Tasks

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Rule-based approach in Arabic NLP: Tools, Systems and Resources' - blanca

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
rule based approach in arabic nlp tools systems and resources

Rule-based approach in Arabic NLP: Tools, Systems and Resources

Dr Khaled Shaalan

Professor, Faculty of Computers & Information, Cairo University

On Secondment to BUiD, UAE

[email protected]{,}

CITALA2009 - Morroco

  • Objective
  • Language Tasks
  • NLP Approaches
  • Rule-based Arabic Analysis and generation tools
  • Rule-based Arabic NLP applications
  • Some Arabic NLP Free Resources
  • Major and Arabic mailing lists
  • Conclusion
  • To show how rule-based approach has successfully used to develop Arabic natural language processing tools and applications.
separating language tasks
Separating Language Tasks
  • English vs. French vs. Arabic vs . . .
  • spoken language (dialogue) vs written test vs hand written script
  • Genuine Script vs transliterated (Romanized) script
  • Vocalized (vowelized) vs non-vocalized
  • Understanding vs. generation
  • First language learner vs second language learner
  • Classical or Qur’anical Arabic vs Modern Standard Arabic vs colloquial (dialects)
  • Stem-based vs root-based
  • Situation/Action
    • If match(stem.prefix, def_article)then romve(stem.prefix,Stem_FS)
    • If match(stem.definitness,indefinite)then morph_gen(stem.definitness,Stem_FS)
common mistake
Common Mistake
  • Rule-based approach is not a rule-based expert systems !!!!!!!
  • Both consist of rules.
  • Rule-based expert systems solves the problem by Recognize-Act Cycle
    • Loop
    • Conflict resolution strategy
recognize act cycle
Domain Knowledge















Working Memory

Recognize-Act Cycle


  • Match: Rules are compared to working memory to determine matches. if no rule matches then stop
  • Conflict Resolution: Select or enable a single rule for execution
  • Execute: Fire the selected rule
    • Add new fact, or
    • Learn a new rule

end loop

nlp approaches
NLP Approaches
  • Rule-based
  • Statistical-based
nlp approaches 1
Relies on hand-constructed rules that are to be acquired from language specialists

requires only small amount of training data

development could be very time consuming

developers do not need language specialists expertise

requires large amount of annotated training data (very large corpora)


NLP Approaches (1)



nlp approaches 2
some changes may be hard to accommodate

not easy to obtain high coverage of the linguistic knowledge

useful for limited domain

Can be used with both well-formed and ill-formed input

High quality based on solid linguistic

some changes may require re-annotation of the entire training corpus

Coverage depends on the training data

Not easy to work with ill-formed input as both well-formed and ill-formed are still probable

Less quality - does not explicitly deal with syntax

NLP Approaches (2)



rule based arabic nlp tools
Rule-based Arabic NLP tools
  • Morphological Analyzers
  • Morphological Generators
  • Syntactic Analyzers
  • Syntactic Generators
morphological analysis
Morphological Analysis
  • Breakdown the inflected Arabic word into a root/stem, affixes, features.
  • Example: sa- ‘uEty- kumA (ﺳﺄﻋﻂﯾﻜﻤﺎ) - ‘will I give you…’
rules augmented transition network atn technique
Rules - Augmented Transition Network (ATN) technique
  • Rules associated with arcs represent the context-sensitive knowledge about the relation between a root and inflections.
  • More than one rule may be associated with one arc.
  • Conditions associated with the arcs are placed in such a way that the arc to be traversed first is the one that leads to the most probable solution.
types of rules
Types of Rules
  • Remove Prefix or Suffix
  • Remove doubled letter
  • Add/change Hamza, Weak letter,…
analysis of the verb i saw you remove suffixes
Analysis of the verb "شاهدتك" (I saw you): Remove suffixes



last1 = “ك”

last2 = “ت”







  • stem: "شاهد" (saw)
  • perfect
  • 1st person sg pronoun: "ت"
  • 2nd person sg pronoun "ك"
analysis of the verb they are playing remove prefix suffix
Analysis of the verb ”يلعبون“ (they are playing): Remove prefix & suffix




Begin2 = “ي”

last2 = “ون”






  • stem: “لعب" (played)
  • imperfect
  • Plural subject
issues in the morphological analysis
Issues in the morphological analysis
  • Overgeneration (too many output)
  • Ambiguity
  • Reconstruction of vowels
  • MultiWord/compound Expressions
  • Out-of-Vocabulary (OOV)
  • Handling ill-formed input
    • Detection (spell checking)
    • Correction- relaxation “ه” instead of “ة”
  • Prevent ill-formed output
    • Check the compatibility (the prefix “ف” cannot come after the prefix “ب” (or “ك”)).
morphological generation
Morphological generation
  • Synthesis of an inflected Arabic word from a given root/stem according to a combination of morphological properties that include:
    • definiteness (definite article “ال”),
    • gender (masculine, feminine),
    • number (singular, dual, plural),
    • case (nominative, genitive, accusative,…),
    • person (first, second, third)
types of rules22
Types of Rules
  • synthesis of inflected
    • Noun
    • Verb
    • particle
synthesis of inflected nouns
Synthesis of inflected Nouns
  • definite noun
  • feminine noun
  • pluralize noun
  • dual noun
  • attach a prefix preposition
  • attach a suffix pronoun
  • end case
  • ….
synthesis of feminine noun
Synthesis of feminine noun
  • If noun.gender = masculineThen attach suffix feminine letter
  • Example:
    • ”زوج“)husband)  “زوجة”(wife)
synthesis of suffix pronoun
Synthesis of suffix pronoun
  • If pronoun.person = first and pronoun.number = singular Then attach first person singular suffix pronoun
  • Example:
    • “زوجة”(wife)  “زوجتي” (my wife)
synthesis of inflected verbs very complex rich in form and meaning
Synthesis of inflected Verbs(very complex-rich in form and meaning)
  • conjugate a verb with tense
  • conjugate a verb with number
  • conjugate a verb with prefix pronoun
  • conjugate a verb with suffix pronoun
  • ….
rule synthesize first person plural of assimilated verbs
Rule: synthesize first person plural of assimilated verbs

Input: first person singular past verb

Output: inflected verb

Example: نصل- سنصل - وصلنا

If verb.tense = future

then remove first weak & attach_prefix(""سن)

else if verb.tense = present

then remove first weak & attach_prefix(""ن)

else attach_suffix(verb.stem,"نا")

issues in the morphological generation
Issues in the morphological generation
  • MultiWord/compound Expressions
  • Out-of-Vocabulary (OOV)
  • Some forms need special handling:
    • Substitution: This man – هذا الرجل
    • literal numbers (complex nouns)
    • Arabic script
      • ‘ل’ + ‘ال’  ‘للـ’
      • ”زملاء“ + ”ي“  ‘زملاءي’  ‘زملائي’
      • ”غرفة“  “غرفتان”
types of rules30
Types of Rules
  • Grammatical rules:
    • Describe sentence and phrase structures, and ensure the agreement relations between various elements in the sentence.
  • Parsing
    • Accepts the input and generates the sentence structure (parse tree)
parsing of the sentence the student sg f is diligent sg f
Parsing of the sentence “الطالبة مجتهدة”The student (sg,f) is diligent (sg,f)

الطالبة مجتهدة

noun (definite,fem,sg)

noun (indefinite,fem, sg)

definite(definite, fem, sg)

enunciative (indefinite,fem, sg) Inchoative (defined, fem, sg)

nominal sentence

  • Agreement:
  • Number
  • Gender

Nominal sentence -> definite_Inchoative(Number,Gender) indefinite_enuciative(Number,Gender)

issues in the syntactic analysis
Issues in the syntactic analysis
  • Ambiguity (more than parse tree)
    • Disambiguation techniques
  • Handling ill-formed input
    • Detection (grammar checking)
    • Recovering (Partial parsing - parses = chunks to be related)
types of rules34
Types of Rules
  • Determine phrase structures
  • Determine syntactic structure
  • Ensure the agreement relations between various elements in the sentence.
rule verb subject agreement
Rule: verb-subject agreement

Input: verb and inflected subject (a pre-verbal NP )

Output: inflected verb agreed with its inflected subject



an agreement example

Adj-noun counted-Num verb-Subject

(G) (G) (N,G)

An agreement example:

الأولاد زارواخمس متاحف قديمة

the-boys visited-they five museum old

The boys visited five old museums

issues in the syntactic generation
Issues in the syntactic generation
  • Word order (VSO,SVO, etc.)
  • Agreement (full/partial)
  • dropping the subject pronoun (called Pro-drop), i.e., to have a null subject, when the inflected verb includes subject affixes.
  • Syntax that captures the source/intended meaning
    • My son is 8 = أبني عمره ثماني سنوات
    • I did not understand the last sentence = أنا لم أفهم الجملة الأخيرة
a rule based arabic nlp applications
A Rule-based Arabic NLP applications
  • Named Entity Recognition
  • Machine translation
  • Transferring Egyptian Colloquial Dialect into Modern Standard Arabic
what is entity recognition
What is entity recognition?
  • Identifying, extracting, and normalizing entities from documents such as names of people, locations, or companies.
  • Makes unstructured data more structured
Politics of Ukraine

In July 1994, Leonid Kuchma was elected as Ukraine's second president in free and fair elections. Kuchma was reelected in November 1999 to another five-year term, with 56 percent of the vote. International observers criticized aspects of the election, especially slanted media coverage; however, the outcome of the vote was not called into question. In March 2002, Ukraine held its most recent parliamentary elections, which were characterized by the Organization for Security and Cooperation in Europe (OSCE) as flawed, but an improvement over the 1998 elections. The pro-presidential For a United Ukraine bloc won the largest number of seats, followed by the reformist Our Ukraine bloc of former Prime Minister Viktor Yushchenko, and the Communist Party. There are 450 seats in parliament, with half chosen from party lists by proportional vote and half from individual constituencies.

Entity Extractor




person entity recognition 1
Person Entity Recognition (1)

Example: ‘الملك الأردني عبدالله الثاني’ The Jordanian king Abdullah II

  • We want to have a rule that recognizes a person name composed of a first namefollowed by optional last names, based on a preceding person indicator pattern.
person entity recognition 2
Person Entity Recognition (2)

The Rule component of this example:

  • Name Entity: عبدالله [Abdullah]
  • indicator pattern:
    • an honorificsuch as "الملك" [The king]
    • Nasab: (optional) inflected from a location name "الأردني" [Jordanian].
  • The rule also matches an optional ordinalnumber appearing at the end of some names such as "الثاني" [II].
person entity recognition 3
Person Entity Recognition (3)



  • This (Regular Expression) rule can recognize:
    • الملكعبدالله
    • الملك الأردني عبدالله
    • الملك الأردني عبدالله الثاني
    • الملكة الأردنيةرانيا
issues in the arabic ner
Issues in the Arabic NER
  • Complex Morphological System (inflections)
  • Non-casing language (No initial capital for proper nouns)
  • Non-standardization and inconsistency in Arabic written text (typos, and spelling variants)
  • Ambiguity
machine translation
Machine Translation
  • Direct
  • Transfer
  • Interlingua
mt approaches mt pyramid



MT ApproachesMT Pyramid

Source syntax

Target syntax

Source word

Target word



english to arabic transfer based approach
English-to-Arabic Transfer based Approach

source sentence



& syntactic Analysis

Rules of English

English Dic.

Sentence Analysis

English Parse Tree


Transformation Rules

Bi-ling Dic.


Arabic Parse Tree

Morphological Gen. &

Synthesis Rules of


Arabic Dic.

Sentence Synthesis

Target sentence


transfer approach
Transfer approach
  • Involves analysis, transfer, and generation components
  • If you have an Arabic parser & Arabic syntactic generator, All you need is to acquire the transfer rules and build the transfer component
simple transfer
Simple Transfer

(1) [wi:$1, wi+1:$2, …, wk:$k] (1  i  k)

[wk:$k, wk-1:$k-1, …, wi:$i] (1  i  k)

























Networks performance evaluation  تقييم أداء شبكة


issues in the transfer based mt approach
Issues in the Transfer-based MT approach
  • Synonyms of a word
    • Acquisition  “اكتساب” or “استخلاص”.
  • Agreement
    • intelligent tutoring systems  “نظم التعليم الذكية” or “نظم التعليم الذكي”
  • Problems with prepositions
    • did you do fungal analysis? 

“هل قمت بـتحليل الفطر؟”

interlingua mt multilingual translation
Interlingua MT – Multilingual translation
  • Interlingua = Semantic Representation
  • Deep analysis –
    • no need for transfer component)
    • Only analysis and generation components
  • Add Arabic analyzer to translate to other languages
  • Add Arabic generator to translate from other languages
analysis of arabic to interlingua

Arabic Grammar Rules






Morphological Analyzer

Arabic Morphology Rules





Analysis of Arabic to Interlingua

العميل: أنا أرغب في حجز غرفة في الفندق

Parse Tree


c:introduce-topic+reservation+disposition+room (room-spec=(room, specifier=hote,identifiability=yes),disposition=(desire,who=i))

generating arabic from interlingua




Feature Structure

Map Rules






Arabic Grammar Rules

Morphological Generator

Arabic Morphology Rules

Generating Arabic from Interlingua


c:introduce-topic+reservation+disposition+room (room-spec=(room, specifier=hote,identifiability=yes),disposition=(desire,who=i))

العميل: أنا أرغب في حجز غرفة في الفندق

issues in the interlingua approach
Issues in the interlingua approach
  • Interlingua:
    • language-neutral representation
    • captures the intended meaning of the source sentence
  • Requires a fully-disambiguating parser
transferring egyptian colloquial dialect into modern standard arabic
Transferring Egyptian Colloquial Dialect into Modern Standard Arabic
  • Be able to reuse MSA processing tools with colloquial Arabic by transferring colloquial Arabic words into their corresponding MSA words.
  • Facilitate the communication with colloquial Arabic speakers
  • Restore the Arabic dialect to the standard language in use nowadays.
a one to one transfer example
A one-to-one transfer example





a one to many transfer example
A one-to-many transfer example








a complete sentence example
A complete sentence example

جيت امتي؟

You-came when?

  • Step (1)
    • جيت  جئت
    • امتي  متي
  • Step (2)
    • the New Segment Position for
    • the word “امتى” is
    • start of sentence (SoS)


جئت متي؟


متي جئت؟

When did-you-come ?

issues in the transfer to msa
Issues in the transfer to MSA
  • More investigations are needed
arabic morphological analyzers
Arabic Morphological Analyzers
  • Tim Buckwalter Morphological
  • Xerox
arabic morphological analyzers63
Arabic Morphological Analyzers
  • Aramorph
arabic spell checker
Arabic spell checker
  • Aspell
arabic morphological generation
Arabic Morphological Generation
  • Sarf
tokenization pos tagging
Tokenization & POS tagging
  • ArabicSVMTools: The tools utilize the Yamcha SVM tools to tokenize, POS tag and Base Phrase Chunk Arabic text
tokenization pos tagging67
Tokenization & POS tagging
  • MADA: a full morphological tagger for Modern Standard Arabic.
pos tagging
POS tagging
  • Stanford Log-linear Part-Of-Speech Tagger
tokenization pos tagging69
Tokenization & POS tagging
  • Attia's Finite State Tools for Modern Standard Arabic
arabic parsers
Arabic Parsers
  • Dan Bikel’s Parser
  • Attia Arabic Parser
arabic wordnet
Arabic wordnet
  • Arabic WordNet
translation resources
Translation resources
  • Tools: GIZA++, MOSES, Pharaoh, Rewrite and BLEU
  • APIs:
  • Transliterate
mailing lists just to be connected to the nlp community
Mailing Lists – just to be connected to the NLP community
conclusion 1
Conclusion (1)
  • Arabic requires the treatment of the language constituents at all levels: morphology, syntax, and semantics.
  • Most of the researches in Arabic NLP are mainly concentrated on the analysis part aiming at automated understanding of Arabic language.
conclusion 2
Conclusion (2)
  • Arabic NLP in general is significantly under developed.
  • In order to bridge this gab and help Arabic NLP research to catch up with the many recent advances of Latin languages, we need collaborative efforts from the Arabic research community.
conclusion 3
Conclusion (3)
  • We need Public Domain (in Electronic Form) for:
    • Linguistic resources such as large Arabic (bilingual) Corpora and treebanks.
    • Machine readable (bilingual) dictionaries
    • Morphological Analyzers
    • Parsers
conclusion 4
Conclusion (4)
  • We need to secure fund for:
    • Exchanging visits (experience Expert Network)
    • Buy software
    • Secure dedicated RA’s and/or PhD students for the NLP task.
references 1 journals
References (1) - Journals
  • Khaled Shaalan, Hafsa Raza, NERA: Named Entity Recognition for Arabic, the Journal of the American Society for Information Science and Technology (JASIST), John Wiley & Sons, Inc., NJ, USA, 60(7):1–12, July 2009.
  • Shaalan, K., Monem, A. A., Rafea, A., Arabic Morphological Generation from Interlingua: A Rule-based Approach, in IFIP International Federation for Information Processing, Vol. 228, Intelligent Information ProcessingIII, eds. Z. Shi, Shimohara K., Feng D., (Boston:Springer), PP. 441-451, 2006.
  • Shaalan, K., Talhami H., and Kamel I., Morphological Generation for Indexing Arabic Speech Recordings, The International Journal of Computer Processing of Oriental Languages (IJCPOL), World Scientific Publishing Company, 20(1)1:14, 2007.
references 2 journals
References (2) - Journals
  • Shaalan K. An Intelligent Computer Assisted Language Learning System for Arabic Learners, Computer Assisted Language Learning: An International Journal, Taylor & Francis Group Ltd., 18(1 & 2): 81-108, February 2005.
  • Shaalan K. Arabic GramCheck: A Grammar Checker for Arabic, Software Practice and Experience, John Wiley & sons Ltd., UK, 35(7):643-665, June 2005.
  • Shaalan K., Rafea, A., Abdel Monem, A., Baraka, H., Machine Translation of English Noun Phrases into Arabic, The International Journal of Computer Processing of Oriental Languages (IJCPOL), World Scientific Publishing Company, 17(2):121-134, 2004.
  • Rafea A., Shaalan K., Lexical Analysis of Inflected Arabic words using Exhaustive Search of an Augmented Transition Network, Software Practice and Experience, John Wiley & sons Ltd., UK,23(6):567-588, June 1993.
references 3 workshops conferences
References (3) – workshops & conferences
  • Hosny, A., Shaalan, K., Fahmy, A., Automatic Morphological Rule Induction for Arabic, In the Proceedings of The LREC'08 workshop on HLT & NLP within the Arabic world: Arabic Language and local languages processing: Status Updates and Prospects, 31st May, PP. 97-101, 2008.
  • Shaalan, K., Abo Bakr, H., Ziedan, I., Transferring Egyptian Colloquial into Modern Standard Arabic, International Conference on Recent Advances in Natural Language Processing (RANLP – 2007) , Borovets, Bulgaria, PP. 525-529, September 27-29, 2007.
  • Shaalan, K., Abdel Monem, A., Rafea, A., Baraka, H., Generating Arabic Text from Interlingua, In the Proceedings of the 2nd Workshop on Computational Approaches to Arabic Script-based Languages, CAASL-2, Linguistic Institute, Stanford, California, USA, PP. 137-144, July 21-22, 2007.
references 4 workshops conferences
References (4) – workshops & conferences
  • Othman E., Shaalan K., and Rafea A., Towards Resolving Ambiguity in Understanding Arabic Sentence, In the Proceedings of the International Conference on Arabic Language Resources and Tools, NEMLAR, PP. 118-122, 22nd–23rd Sept., Egypt, , 2004.
  • Othman E., Shaalan K., and Rafea A. A Chart Parser for Analyzing Modern Standard Arabic Sentence, In proceedings of the MT Summit IX Workshop on Machine Translation for Semitic Languages: Issues and Approaches, New Orleans, Louisiana, USA., September, 2003.
thank you merci

Thank you!Merci!