Rule based approach in arabic nlp tools systems and resources
1 / 83

Rule-based approach in Arabic NLP: Tools, Systems and Resources - PowerPoint PPT Presentation

  • Uploaded on

CITALA'09. Rule-based approach in Arabic NLP: Tools, Systems and Resources. Dr Khaled Shaalan Professor, Faculty of Computers & Information, Cairo University On Secondment to BUiD, UAE [email protected]{,}. Agenda. Objective Language Tasks

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Rule-based approach in Arabic NLP: Tools, Systems and Resources' - blanca

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Rule based approach in arabic nlp tools systems and resources l.jpg


Rule-based approach in Arabic NLP: Tools, Systems and Resources

Dr Khaled Shaalan

Professor, Faculty of Computers & Information, Cairo University

On Secondment to BUiD, UAE

[email protected]{,}

CITALA2009 - Morroco

Agenda l.jpg

  • Objective

  • Language Tasks

  • NLP Approaches

  • Rule-based Arabic Analysis and generation tools

  • Rule-based Arabic NLP applications

  • Some Arabic NLP Free Resources

  • Major and Arabic mailing lists

  • Conclusion

Objective l.jpg

  • To show how rule-based approach has successfully used to develop Arabic natural language processing tools and applications.

Separating language tasks l.jpg
Separating Language Tasks

  • English vs. French vs. Arabic vs . . .

  • spoken language (dialogue) vs written test vs hand written script

  • Genuine Script vs transliterated (Romanized) script

  • Vocalized (vowelized) vs non-vocalized

  • Understanding vs. generation

  • First language learner vs second language learner

  • Classical or Qur’anical Arabic vs Modern Standard Arabic vs colloquial (dialects)

  • Stem-based vs root-based

Rules l.jpg

  • Situation/Action

    • If match(stem.prefix, def_article)then romve(stem.prefix,Stem_FS)

    • If match(stem.definitness,indefinite)then morph_gen(stem.definitness,Stem_FS)

Common mistake l.jpg
Common Mistake

  • Rule-based approach is not a rule-based expert systems !!!!!!!

  • Both consist of rules.

  • Rule-based expert systems solves the problem by Recognize-Act Cycle

    • Loop

    • Conflict resolution strategy

Recognize act cycle l.jpg

Domain Knowledge















Working Memory

Recognize-Act Cycle


  • Match: Rules are compared to working memory to determine matches. if no rule matches then stop

  • Conflict Resolution: Select or enable a single rule for execution

  • Execute: Fire the selected rule

    • Add new fact, or

    • Learn a new rule

      end loop

Nlp approaches l.jpg
NLP Approaches

  • Rule-based

  • Statistical-based

Nlp approaches 1 l.jpg

Relies on hand-constructed rules that are to be acquired from language specialists

requires only small amount of training data

development could be very time consuming

developers do not need language specialists expertise

requires large amount of annotated training data (very large corpora)


NLP Approaches (1)



Nlp approaches 2 l.jpg

some changes may be hard to accommodate

not easy to obtain high coverage of the linguistic knowledge

useful for limited domain

Can be used with both well-formed and ill-formed input

High quality based on solid linguistic

some changes may require re-annotation of the entire training corpus

Coverage depends on the training data

Not easy to work with ill-formed input as both well-formed and ill-formed are still probable

Less quality - does not explicitly deal with syntax

NLP Approaches (2)



Rule based arabic nlp tools l.jpg
Rule-based Arabic NLP tools

  • Morphological Analyzers

  • Morphological Generators

  • Syntactic Analyzers

  • Syntactic Generators

Morphological analysis l.jpg
Morphological Analysis

  • Breakdown the inflected Arabic word into a root/stem, affixes, features.

  • Example: sa- ‘uEty- kumA (ﺳﺄﻋﻂﯾﻜﻤﺎ) - ‘will I give you…’

Rules augmented transition network atn technique l.jpg
Rules - Augmented Transition Network (ATN) technique

  • Rules associated with arcs represent the context-sensitive knowledge about the relation between a root and inflections.

  • More than one rule may be associated with one arc.

  • Conditions associated with the arcs are placed in such a way that the arc to be traversed first is the one that leads to the most probable solution.

Types of rules l.jpg
Types of Rules

  • Remove Prefix or Suffix

  • Remove doubled letter

  • Add/change Hamza, Weak letter,…

Analysis of the verb i saw you remove suffixes l.jpg
Analysis of the verb "شاهدتك" (I saw you): Remove suffixes



last1 = “ك”

last2 = “ت”







  • stem: "شاهد" (saw)

  • perfect

  • 1st person sg pronoun: "ت"

  • 2nd person sg pronoun "ك"

Analysis of the verb they are playing remove prefix suffix l.jpg
Analysis of the verb ”يلعبون“ (they are playing): Remove prefix & suffix




Begin2 = “ي”

last2 = “ون”






  • stem: “لعب" (played)

  • imperfect

  • Plural subject

Issues in the morphological analysis l.jpg
Issues in the morphological analysis

  • Overgeneration (too many output)

  • Ambiguity

  • Reconstruction of vowels

  • MultiWord/compound Expressions

  • Out-of-Vocabulary (OOV)

  • Handling ill-formed input

    • Detection (spell checking)

    • Correction- relaxation “ه” instead of “ة”

  • Prevent ill-formed output

    • Check the compatibility (the prefix “ف” cannot come after the prefix “ب” (or “ك”)).

Morphological generation l.jpg
Morphological generation

  • Synthesis of an inflected Arabic word from a given root/stem according to a combination of morphological properties that include:

    • definiteness (definite article “ال”),

    • gender (masculine, feminine),

    • number (singular, dual, plural),

    • case (nominative, genitive, accusative,…),

    • person (first, second, third)

Types of rules22 l.jpg
Types of Rules

  • synthesis of inflected

    • Noun

    • Verb

    • particle

Synthesis of inflected nouns l.jpg
Synthesis of inflected Nouns

  • definite noun

  • feminine noun

  • pluralize noun

  • dual noun

  • attach a prefix preposition

  • attach a suffix pronoun

  • end case

  • ….

Synthesis of feminine noun l.jpg
Synthesis of feminine noun

  • If noun.gender = masculineThen attach suffix feminine letter

  • Example:

    • ”زوج“)husband)  “زوجة”(wife)

Synthesis of suffix pronoun l.jpg
Synthesis of suffix pronoun

  • If pronoun.person = first and pronoun.number = singular Then attach first person singular suffix pronoun

  • Example:

    • “زوجة”(wife)  “زوجتي” (my wife)

Synthesis of inflected verbs very complex rich in form and meaning l.jpg
Synthesis of inflected Verbs(very complex-rich in form and meaning)

  • conjugate a verb with tense

  • conjugate a verb with number

  • conjugate a verb with prefix pronoun

  • conjugate a verb with suffix pronoun

  • ….

Rule synthesize first person plural of assimilated verbs l.jpg
Rule: synthesize first person plural of assimilated verbs

Input: first person singular past verb

Output: inflected verb

Example: نصل- سنصل - وصلنا

If verb.tense = future

then remove first weak & attach_prefix(""سن)

else if verb.tense = present

then remove first weak & attach_prefix(""ن)

else attach_suffix(verb.stem,"نا")

Issues in the morphological generation l.jpg
Issues in the morphological generation

  • MultiWord/compound Expressions

  • Out-of-Vocabulary (OOV)

  • Some forms need special handling:

    • Substitution: This man – هذا الرجل

    • literal numbers (complex nouns)

    • Arabic script

      • ‘ل’ + ‘ال’  ‘للـ’

      • ”زملاء“ + ”ي“  ‘زملاءي’  ‘زملائي’

      • ”غرفة“  “غرفتان”

Types of rules30 l.jpg
Types of Rules

  • Grammatical rules:

    • Describe sentence and phrase structures, and ensure the agreement relations between various elements in the sentence.

  • Parsing

    • Accepts the input and generates the sentence structure (parse tree)

Parsing of the sentence the student sg f is diligent sg f l.jpg
Parsing of the sentence “الطالبة مجتهدة”The student (sg,f) is diligent (sg,f)

الطالبة مجتهدة

noun (definite,fem,sg)

noun (indefinite,fem, sg)

definite(definite, fem, sg)

enunciative (indefinite,fem, sg) Inchoative (defined, fem, sg)

nominal sentence

  • Agreement:

  • Number

  • Gender

Nominal sentence -> definite_Inchoative(Number,Gender) indefinite_enuciative(Number,Gender)

Issues in the syntactic analysis l.jpg
Issues in the syntactic analysis

  • Ambiguity (more than parse tree)

    • Disambiguation techniques

  • Handling ill-formed input

    • Detection (grammar checking)

    • Recovering (Partial parsing - parses = chunks to be related)

Types of rules34 l.jpg
Types of Rules

  • Determine phrase structures

  • Determine syntactic structure

  • Ensure the agreement relations between various elements in the sentence.

Rule verb subject agreement l.jpg
Rule: verb-subject agreement

Input: verb and inflected subject (a pre-verbal NP )

Output: inflected verb agreed with its inflected subject



An agreement example l.jpg


Adj-noun counted-Num verb-Subject

(G) (G) (N,G)

An agreement example:

الأولاد زارواخمس متاحف قديمة

the-boys visited-they five museum old

The boys visited five old museums

Issues in the syntactic generation l.jpg
Issues in the syntactic generation

  • Word order (VSO,SVO, etc.)

  • Agreement (full/partial)

  • dropping the subject pronoun (called Pro-drop), i.e., to have a null subject, when the inflected verb includes subject affixes.

  • Syntax that captures the source/intended meaning

    • My son is 8 = أبني عمره ثماني سنوات

    • I did not understand the last sentence = أنا لم أفهم الجملة الأخيرة

A rule based arabic nlp applications l.jpg
A Rule-based Arabic NLP applications

  • Named Entity Recognition

  • Machine translation

  • Transferring Egyptian Colloquial Dialect into Modern Standard Arabic

What is entity recognition l.jpg
What is entity recognition?

  • Identifying, extracting, and normalizing entities from documents such as names of people, locations, or companies.

  • Makes unstructured data more structured

Slide40 l.jpg

Politics of Ukraine

In July 1994, Leonid Kuchma was elected as Ukraine's second president in free and fair elections. Kuchma was reelected in November 1999 to another five-year term, with 56 percent of the vote. International observers criticized aspects of the election, especially slanted media coverage; however, the outcome of the vote was not called into question. In March 2002, Ukraine held its most recent parliamentary elections, which were characterized by the Organization for Security and Cooperation in Europe (OSCE) as flawed, but an improvement over the 1998 elections. The pro-presidential For a United Ukraine bloc won the largest number of seats, followed by the reformist Our Ukraine bloc of former Prime Minister Viktor Yushchenko, and the Communist Party. There are 450 seats in parliament, with half chosen from party lists by proportional vote and half from individual constituencies.

Entity Extractor




Person entity recognition 1 l.jpg
Person Entity Recognition (1)

Example: ‘الملك الأردني عبدالله الثاني’ The Jordanian king Abdullah II

  • We want to have a rule that recognizes a person name composed of a first namefollowed by optional last names, based on a preceding person indicator pattern.

Person entity recognition 2 l.jpg
Person Entity Recognition (2)

The Rule component of this example:

  • Name Entity: عبدالله [Abdullah]

  • indicator pattern:

    • an honorificsuch as "الملك" [The king]

    • Nasab: (optional) inflected from a location name "الأردني" [Jordanian].

  • The rule also matches an optional ordinalnumber appearing at the end of some names such as "الثاني" [II].

Person entity recognition 3 l.jpg
Person Entity Recognition (3)



  • This (Regular Expression) rule can recognize:

    • الملكعبدالله

    • الملك الأردني عبدالله

    • الملك الأردني عبدالله الثاني

    • الملكة الأردنيةرانيا

Issues in the arabic ner l.jpg
Issues in the Arabic NER

  • Complex Morphological System (inflections)

  • Non-casing language (No initial capital for proper nouns)

  • Non-standardization and inconsistency in Arabic written text (typos, and spelling variants)

  • Ambiguity

Machine translation l.jpg
Machine Translation

  • Direct

  • Transfer

  • Interlingua

Mt approaches mt pyramid l.jpg




MT ApproachesMT Pyramid

Source syntax

Target syntax

Source word

Target word



English to arabic transfer based approach l.jpg
English-to-Arabic Transfer based Approach

source sentence



& syntactic Analysis

Rules of English

English Dic.

Sentence Analysis

English Parse Tree


Transformation Rules

Bi-ling Dic.


Arabic Parse Tree

Morphological Gen. &

Synthesis Rules of


Arabic Dic.

Sentence Synthesis

Target sentence


Transfer approach l.jpg
Transfer approach

  • Involves analysis, transfer, and generation components

  • If you have an Arabic parser & Arabic syntactic generator, All you need is to acquire the transfer rules and build the transfer component

Simple transfer l.jpg
Simple Transfer

(1) [wi:$1, wi+1:$2, …, wk:$k] (1  i  k)

[wk:$k, wk-1:$k-1, …, wi:$i] (1  i  k)

Slide50 l.jpg

























Networks performance evaluation  تقييم أداء شبكة


Issues in the transfer based mt approach l.jpg
Issues in the Transfer-based MT approach

  • Synonyms of a word

    • Acquisition  “اكتساب” or “استخلاص”.

  • Agreement

    • intelligent tutoring systems  “نظم التعليم الذكية” or “نظم التعليم الذكي”

  • Problems with prepositions

    • did you do fungal analysis? 

      “هل قمت بـتحليل الفطر؟”

Interlingua mt multilingual translation l.jpg
Interlingua MT – Multilingual translation

  • Interlingua = Semantic Representation

  • Deep analysis –

    • no need for transfer component)

    • Only analysis and generation components

  • Add Arabic analyzer to translate to other languages

  • Add Arabic generator to translate from other languages

Analysis of arabic to interlingua l.jpg


Arabic Grammar Rules






Morphological Analyzer

Arabic Morphology Rules





Analysis of Arabic to Interlingua

العميل: أنا أرغب في حجز غرفة في الفندق

Parse Tree


c:introduce-topic+reservation+disposition+room (room-spec=(room, specifier=hote,identifiability=yes),disposition=(desire,who=i))

Generating arabic from interlingua l.jpg





Feature Structure

Map Rules






Arabic Grammar Rules

Morphological Generator

Arabic Morphology Rules

Generating Arabic from Interlingua


c:introduce-topic+reservation+disposition+room (room-spec=(room, specifier=hote,identifiability=yes),disposition=(desire,who=i))

العميل: أنا أرغب في حجز غرفة في الفندق

Issues in the interlingua approach l.jpg
Issues in the interlingua approach

  • Interlingua:

    • language-neutral representation

    • captures the intended meaning of the source sentence

  • Requires a fully-disambiguating parser

Transferring egyptian colloquial dialect into modern standard arabic l.jpg
Transferring Egyptian Colloquial Dialect into Modern Standard Arabic

  • Be able to reuse MSA processing tools with colloquial Arabic by transferring colloquial Arabic words into their corresponding MSA words.

  • Facilitate the communication with colloquial Arabic speakers

  • Restore the Arabic dialect to the standard language in use nowadays.

A one to one transfer example l.jpg
A one-to-one transfer example Standard Arabic





A one to many transfer example l.jpg
A one-to-many transfer example Standard Arabic








A complete sentence example l.jpg
A complete sentence example Standard Arabic

جيت امتي؟

You-came when?

  • Step (1)

    • جيت  جئت

    • امتي  متي

  • Step (2)

    • the New Segment Position for

    • the word “امتى” is

    • start of sentence (SoS)


جئت متي؟


متي جئت؟

When did-you-come ?

Issues in the transfer to msa l.jpg
Issues in the transfer to MSA Standard Arabic

  • More investigations are needed

Arabic nlp free resources l.jpg

Arabic NLP Free Resources Standard Arabic

Arabic NLP Free Resources

Arabic morphological analyzers l.jpg
Arabic Morphological Analyzers Standard Arabic

  • Tim Buckwalter Morphological



  • Xerox


Arabic morphological analyzers63 l.jpg
Arabic Morphological Analyzers Standard Arabic

  • Aramorph


Arabic spell checker l.jpg
Arabic spell checker Standard Arabic

  • Aspell



Arabic morphological generation l.jpg
Arabic Morphological Generation Standard Arabic

  • Sarf


Tokenization pos tagging l.jpg
Tokenization & POS tagging Standard Arabic

  • ArabicSVMTools: The tools utilize the Yamcha SVM tools to tokenize, POS tag and Base Phrase Chunk Arabic text



Tokenization pos tagging67 l.jpg
Tokenization & POS tagging Standard Arabic

  • MADA: a full morphological tagger for Modern Standard Arabic.


Pos tagging l.jpg
POS tagging Standard Arabic

  • Stanford Log-linear Part-Of-Speech Tagger



Tokenization pos tagging69 l.jpg
Tokenization & POS tagging Standard Arabic

  • Attia's Finite State Tools for Modern Standard Arabic


Arabic parsers l.jpg
Arabic Parsers Standard Arabic

  • Dan Bikel’s Parser



  • Attia Arabic Parser



Arabic wordnet l.jpg
Arabic wordnet Standard Arabic

  • Arabic WordNet



Translation resources l.jpg
Translation resources Standard Arabic

  • Tools: GIZA++, MOSES, Pharaoh, Rewrite and BLEU


  • APIs:



  • Transliterate l.jpg
    Transliterate Standard Arabic

    • Transliterate


    Mailing lists just to be connected to the nlp community l.jpg
    Mailing Lists – just to be connected to the NLP community Standard Arabic

    Conclusion 1 l.jpg
    Conclusion (1) Standard Arabic

    • Arabic requires the treatment of the language constituents at all levels: morphology, syntax, and semantics.

    • Most of the researches in Arabic NLP are mainly concentrated on the analysis part aiming at automated understanding of Arabic language.

    Conclusion 2 l.jpg
    Conclusion (2) Standard Arabic

    • Arabic NLP in general is significantly under developed.

    • In order to bridge this gab and help Arabic NLP research to catch up with the many recent advances of Latin languages, we need collaborative efforts from the Arabic research community.

    Conclusion 3 l.jpg
    Conclusion (3) Standard Arabic

    • We need Public Domain (in Electronic Form) for:

      • Linguistic resources such as large Arabic (bilingual) Corpora and treebanks.

      • Machine readable (bilingual) dictionaries

      • Morphological Analyzers

      • Parsers

    Conclusion 4 l.jpg
    Conclusion (4) Standard Arabic

    • We need to secure fund for:

      • Exchanging visits (experience Expert Network)

      • Buy software

      • Secure dedicated RA’s and/or PhD students for the NLP task.

    References 1 journals l.jpg
    References (1) - Journals Standard Arabic

    • Khaled Shaalan, Hafsa Raza, NERA: Named Entity Recognition for Arabic, the Journal of the American Society for Information Science and Technology (JASIST), John Wiley & Sons, Inc., NJ, USA, 60(7):1–12, July 2009.

    • Shaalan, K., Monem, A. A., Rafea, A., Arabic Morphological Generation from Interlingua: A Rule-based Approach, in IFIP International Federation for Information Processing, Vol. 228, Intelligent Information ProcessingIII, eds. Z. Shi, Shimohara K., Feng D., (Boston:Springer), PP. 441-451, 2006.

    • Shaalan, K., Talhami H., and Kamel I., Morphological Generation for Indexing Arabic Speech Recordings, The International Journal of Computer Processing of Oriental Languages (IJCPOL), World Scientific Publishing Company, 20(1)1:14, 2007.

    References 2 journals l.jpg
    References (2) - Journals Standard Arabic

    • Shaalan K. An Intelligent Computer Assisted Language Learning System for Arabic Learners, Computer Assisted Language Learning: An International Journal, Taylor & Francis Group Ltd., 18(1 & 2): 81-108, February 2005.

    • Shaalan K. Arabic GramCheck: A Grammar Checker for Arabic, Software Practice and Experience, John Wiley & sons Ltd., UK, 35(7):643-665, June 2005.

    • Shaalan K., Rafea, A., Abdel Monem, A., Baraka, H., Machine Translation of English Noun Phrases into Arabic, The International Journal of Computer Processing of Oriental Languages (IJCPOL), World Scientific Publishing Company, 17(2):121-134, 2004.

    • Rafea A., Shaalan K., Lexical Analysis of Inflected Arabic words using Exhaustive Search of an Augmented Transition Network, Software Practice and Experience, John Wiley & sons Ltd., UK,23(6):567-588, June 1993.

    References 3 workshops conferences l.jpg
    References (3) – workshops & conferences Standard Arabic

    • Hosny, A., Shaalan, K., Fahmy, A., Automatic Morphological Rule Induction for Arabic, In the Proceedings of The LREC'08 workshop on HLT & NLP within the Arabic world: Arabic Language and local languages processing: Status Updates and Prospects, 31st May, PP. 97-101, 2008.

    • Shaalan, K., Abo Bakr, H., Ziedan, I., Transferring Egyptian Colloquial into Modern Standard Arabic, International Conference on Recent Advances in Natural Language Processing (RANLP – 2007) , Borovets, Bulgaria, PP. 525-529, September 27-29, 2007.

    • Shaalan, K., Abdel Monem, A., Rafea, A., Baraka, H., Generating Arabic Text from Interlingua, In the Proceedings of the 2nd Workshop on Computational Approaches to Arabic Script-based Languages, CAASL-2, Linguistic Institute, Stanford, California, USA, PP. 137-144, July 21-22, 2007.

    References 4 workshops conferences l.jpg
    References (4) – workshops & conferences Standard Arabic

    • Othman E., Shaalan K., and Rafea A., Towards Resolving Ambiguity in Understanding Arabic Sentence, In the Proceedings of the International Conference on Arabic Language Resources and Tools, NEMLAR, PP. 118-122, 22nd–23rd Sept., Egypt, , 2004.

    • Othman E., Shaalan K., and Rafea A. A Chart Parser for Analyzing Modern Standard Arabic Sentence, In proceedings of the MT Summit IX Workshop on Machine Translation for Semitic Languages: Issues and Approaches, New Orleans, Louisiana, USA., September, 2003.

    Thank you merci l.jpg

    Thank you! Standard ArabicMerci!