morphology 2 a case study of developing bengali morph analyzer and generator l.
Skip this Video
Download Presentation
Morphology 2 A case study of developing Bengali morph analyzer and generator

Loading in 2 Seconds...

play fullscreen
1 / 50

Morphology 2 A case study of developing Bengali morph analyzer and generator - PowerPoint PPT Presentation

  • Uploaded on

Morphology 2 A case study of developing Bengali morph analyzer and generator. Sudeshna Sarkar IIT Kharagpur. Two level morphology. PC-KIMMO, a morphological parser based on Kimmo Koskenniemi's model of two-level morphology ( Koskenniemi 1983 ).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Morphology 2 A case study of developing Bengali morph analyzer and generator' - elina

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
morphology 2 a case study of developing bengali morph analyzer and generator

Morphology 2A case study of developing Bengali morph analyzer and generator

Sudeshna Sarkar

IIT Kharagpur

two level morphology
Two level morphology
  • PC-KIMMO, a morphological parser based on Kimmo Koskenniemi's model of two-level morphology ( Koskenniemi 1983).
  • Koskenniemi's model of two-level morphology was based on the traditional distinction that linguists make between
    • morphotactics, which enumerates the inventory of morphemes and specifies in what order they can occur, and
    • morphophonemics, which accounts for alternate forms or "spellings" of morphemes according to the phonological context in which they occur.
For example, the word chasedis analyzed morphotactically as the stem chase followed by the suffix -ed.
  • However, the addition of the suffix -ed apparently causes the loss of the final e of chase; thus chase and chas are allomorphs or alternate forms of the same morpheme.
  • Koskenniemi's model is "two-level" in the sense that a word is represented as a direct, letter-for-letter correspondence between its lexical or underlying form and its surface form. For example, the word chased is given this two-level representation (where + is a morpheme boundary symbol and 0 is a null character):

Lexical form: c h a s e + e d

Surface form: c h a s 0 0 e d

main components of karttunen s kimmo parser
Main components of Karttunen's KIMMO parser
  • the rules component: two-level rules that accounted for regular phonological or orthographic alternations, such as chase versus chas.
  • lexical component: list all morphemes (stems and affixes) in their lexical form and specify morphotactic constraints.
Englex: a two-level description of English morphology
  • Englex consists of a set of orthographic rules, a 20,000-entry lexicon of roots and affixes, and a word grammar. With Englex and PC-KIMMO, you can morphologically parse English words and text.
generative rules and 2 level rules
Generative rules and 2-level rules
  • Two-level rules are similar to the rules of standard generative phonology, but differ in several crucial ways. Rule R1 is an example of a generative rule.

R1 t ---> c / ___ i

Rule R2 is the analogous two-level rule.

R2 t:c => ___ i

Generative rules

    • Transformational rules
    • Sequential application
    • Unidirectional

Two-level rules

    • Declarative – talk about correspondences
    • They apply is parallel
    • Bidirectional
hindi noun analysis
Hindi noun analysis

A. Noun analysis

Nouns are categorised into 20 different paradigms based on the following criterion:

1. Vowel ending.

2. Valid suffix of a word.

3. Gender, Number, Person and Case information.

A snapshot of the analysis in shown in table 2.1.

There are 20,000 Nouns classified in 20 such paradigms.

hindi verb analysis
Hindi verb analysis

B. Verb Analysis

The Verb Group represents the following grammatical prop-


1. Tense : Present, Past and Future.

2. Aspect: Durative, Stative, Infinitive, Habitual and Per-

fective etc.

3. Modal: Abilitive, Deontic, Probabilitative etc.

4. Gender: Male, Female, Dual.

5. Person: 1st , 2nd and 3rd.

These values formed the basis to list Verb Groups according

to their TAM-GNP values. A TAM-GNP matrix having all

possible VGs is developed.

IITB morph analyzer Presently there are 622 unique

paradigms in the TAM-GNP matrix

morphology verb
Morphology: Verb

Attribute 1: Root

Val 0: root word of the given surface form of the word

Attribute 2: Category

Val 0: verb (v)

Attribute 3: Person

Val 0: first, Val 1: second normal, Val 2: second familiar, Val 3: third normal, Val 4: formal (second/third)

Attribute 4: Tense

Val 0: Present, Val 1: Past, Val 2: Future

Attribute 5: Aspect

Val 0: simple, Val 1: continuous Val 2: perfect

Attribute 6: Modality

Attribute 8: Specificity

Val 0: non-specific, Val 1: specific

Attribute 9: Emphasizer

Val 0: none, Val 1: only, Val 2: also

Attribute 10: Polarity

Val 0: positive Val 1: negative

attributes values verb
Attributes & Values (Verb) :


  • First Person-(1),Ami
  • Second Formal-(2),Apani
  • Second Normal-(3),tumi
  • Second Familiar-(4),tui
  • Third Normal-(5),se
  • Third Formal-(6),tini
  • Unspecified
attributes values verb13
Attributes & Values (Verb) :






attributes values verb14
Attributes & Values (Verb) :







attributes values verb15
Attributes & Values (Verb) :


  • Indicative-(1),kara
  • Imperative-(2),kar
  • Subjunctive-(3),karale
attributes values verb16
Attributes & Values (Verb) :




information verbs
  • Total Numbers of Categories (Based on Syllabic Structure) : 20
  • Rules:214/Category
  • Total Numbers of Rules : 214x20=4280(apprx.)
classification nouns
Classification : Nouns
  • Morphological Classification Based on Different Types of Nouns:
  • 1.Animate (example: mAnuSha)
  • 2.Inanimate(example: mATi)
  • 3.Abstract/Qualitative(example: daYA)
  • 4.Verbal(example : bhojana)
  • 5.Collective(example: pAla)
  • 6.The Singular (example: chandra)
  • 7.Compounded(example: riksAoYAlA)
sub classification nouns
Sub Classification :Nouns
  • Sub Classification based on “Root Endings”:
  • 1.a-ending root (animate “mAnusha”)
  • 2.A- ending root (animate “bAlikA”)
  • 3.i- ending root (animate “pAkhi”)
  • 4.I- ending root (animate “khukI”)
  • 5.e- ending root (animate “chhele”)
  • 6.o- ending root (animate “myA;o”)
  • 7.u-ending root (animate “shishu”)
  • 8.U- ending root (animate “badhU”)
classification pronouns
Classification :Pronouns

Morphological Analysis Based on Different Natures of Pronouns:

1.Personal (Ami,Apani,-)

2.Inclusive (saba,sakala,ubhaYa,-)



5.Denoting Others (anya,para,-)

6.Near Demonstrative (e,ihA,-)

7.Far Demonstrative (o,uhA,-)

8.Reflexive (nija,nijenije,-)

9.Indeffinite (keu,kichhu,-)

morphology pronoun
Morphology : Pronoun


  • Number
    • Val 0: singular, Val 1: plural, Val 2: honorary plural
  • Form
    • Val 0: direct, Val 1: oblique
  • Specificity
    • Val 0: non-specific, Val 1: specific
  • Case
    • Val 0: Nom., Val 1: Acc., Val 2: Genitive, Val 3: Locative
  • Emphatic Marker
    • Val 0: none, Val 1: only, Val 2: also
  • Ellipses
    • Val 0: false, Val 1: true
  • Nature
  • Types
bengali pos categories noun
Bengali POS Categories (Noun)

Bengali Noun has the following attributes:

Number, Specificity, Ellipses, Form, Case and Emphasizer

  • Number has 2 values (Singular and Plural)
  • Specificity has 2 values (Specific and non_specific)
  • Ellipses has 2 values (Elliptic and non_elliptic)
  • Form has 2 values (Direct and Oblique)
  • Case has 5 values (Nominative, Accusative, Genitive, Locative, Instrumental)
  • Emphasizer has 3 values (None, Only, Also)
adjective morphology
Adjective Morphology


  • Val 0: root word of the given surface form of the word


  • Val 0: non-specific, Val 1: specific


  • Val 0: none, Val 1: only, Val 2: also


  • Val 0: normal, Val 1: superlative, Val 2: Comparative


  • Val 0: masculine Val 1: feminine Val 2: neuter
adverb morphology
Adverb Morphology


  • Val 0: root word of the given surface form of the word


  • Val 0: none, Val 1: only, Val 2: also


  • Val 0: normal, Val 1: superlative, Val 2: Comparative
postposition morphology
Postposition Morphology


  • Val 0: root word of the given surface form of the word


  • Val 0: none, Val 1: only, Val 2: also
morphological generator

Morphological Generator

Developed at

IIT Kharagpur


Morphological Generator uses certain linguistic resources and generates the surface form from a given input.

The following linguistic resources are required

  • Root Dictionary
  • Morphological Rules
    • Rule/Attribute Type Declaration (RATD)
    • Morphotactics
    • Paradigm Tables
    • Orthographic Rewrite Rules
  • Exception List
format of the root dictionary
Format of the root dictionary

<root_word>:<category, paradigm_no;>+

  • root_word: The root word in UTF-8
  • category: Part-of-speech category
  • paradigm_no: A specific non-negative number referring to the paradigm table to be used for generation of the surface form for the root_word, when used as a particular POS-category.
  • +: denotes one or more occurrence of the <category, paradigm_no;>

Example for Hindi:

  • कर: NN,0; VM,1;
  • आम: NN,1; JJ, 0;
The first line of the RATD is

<#categories> <cat_tag >+

#categories: The total number of distinct categories, for which morphological generation is required.

cat_tag: The category tag as used in the root dictionary, for which the generation is required.




This is followed by the declarations related to the #categories categories. The declaration for each category consists of meta declaration line followed by #morphotactics lines specifying the morphotactic rules. The meta declaration for a category is as follows:

<cat_tag> <file_name> <#paradigms> <#morphotactics><#attributes> <#values_for_attribute>+

  • cat_tag: As defined above
  • file_name: The name of the file that contains the morphotactics, paradigm tables and rewrite rules of the particular category.
  • #paradigms: Total number of paradigms for the category
  • #morphotactics: Total number of linear morphotactic rules for the category
  • #attributes: Total number of attributes that govern the morphology
  • #values_for_attribute: The number of values for each of the attributes.


NN nn.txt 5 1 2 2 2


The morphotactics are specified linearly in the following format

{ ‘(’ { attribute_id, }+ ‘)’ }+

  • For example, the morphotactic rule (0, 2)(3)(1, 4) means that the suffix marking for the features 0 and 2 is followed by the suffix marking feature 3 and then the suffix marking the features 1 and 4.
  • We assume a linear morphology
  • We assume that inflections are in the form of suffixes only (i.e. no prefix or infix)
  • In the above example, it is not possible to split the suffixes marking for features 0 and 2, and 1 and 4. In other words, the suffixes for these features are fusional as far as (0,2) or (1,4) feature combinations are considered, but the morphology is agglutinative in general.
  • There can be more than one morphotactic rule for a category in a language. In that case, the first rule is taken as the default one, whereas the other rules are triggered only under special circumstances, which are to be specified with the rule by assigning some specific value to the feature, like (0, 2=5)(3)(1, 4) implies that the rule is triggered only when Attribute 2 has a value of 5.
morphotactics example
Morphotactics example
  • Bengali noun morphology
  • Attribute 0: Number Val 0: singular, Val 1: plural
  • Attribute 1: Obliqueness Val 0: direct, Val 1: oblique
  • Attribute 2: Specificity Val 0: non-specific, Val 1: specific
  • Attribute 3: Case Val 0: Nom., Val 1: Acc., Val 2: Genitive, Val 3: Locative
  • Attribute 4: Emphasizer Val 0: none, Val 1: only, Val 2: also
  • Attribute 5: Ellipses Val 0: false, Val 1: true

Bengali nouns follow one of the following two morphotactics

    • (0,1,2)(3)(4)
    • (0,1,2)(5=1)(0,1,2)(3)(4)

The second rule is triggered only in the case of ellipses.

paradigm table
Paradigm Table
  • The category specific files (e.g. nn.txt in the earlier example) store the paradigm tables and orthographic rewrite rules.
  • There are paradigm tables corresponding to every paradigm number for each of the feature/feature-combination in the morphotactics. Thus, if there are #paradigms for Bengali nouns, then there are 4*#paradigms paradigm tables. The 4 tables per paradigm corresponds to (0,1,2), (3), (4), and (5).
  • However, several paradigms might share some of the tables. Therefore, in the declaration, a particular table can stand for more than one paradigm.
Paradigm table contains the list of suffices for a particular combination of attributes.


<Attributes a1, a2>

<ParadigmNumber x1, x2, x3>

<Suffixes s11, s12, s13,…, s21, s22, s23,…>

The Number of suffices in a table is equal to the multiplication of the values of the attributes in that combination.

Example: If the combination is (0,1) and 1st attribute has 10 values and 2nd attribute has 3 values, the table for the combination (0,1) will contain 10×3 = 30 suffices (may be some of them are NULL).

orthographic rules
Orthographic Rules

Orthographic rules are specified as rewrite rules of the following forms

input  output / left_context, right_context

We also have provisions to specify two layer rules, where on the top layer specifies the rule on strings, and on the bottom layer, the features are indicated.

Thus, a rule of type

input  output / left_context, right_context

[att1] [root], [att2]

means that when the suffix corresponding to the attribute att1 has the pattern input, and it is immediately preceded by the pattern left_context, which belongs to the root and followed by the pattern right_context, which belongs to another suffix corresponding to some attribute att2, then input should be replaced by the pattern output.

ratd for bengali
RATD for Bengali
  • NN nn.txt nn_rule.txt mean_noun.txt 1 1 6 2 2 2 2 5 3
  • QC qc.txt qc_rule.txt mean_card.txt 1 1 4 4 2 2 3
  • VM vm.txt vm_rule.txt mean_verb.txt 1 2 5 6 10 3 2 2
  • PN pn.txt pn_rule.txt mean_pron.txt 1 2 7 2 2 2 2 2 5 3
  • AV av.txt av_rule.txt mean_adv.txt 1 1 2 3 3
  • AJ aj.txt aj_rule.txt mean_adj.txt 1 1 2 3 3
  • PS ps.txt ps_rule.txt mean_psp.txt 1 1 1 3
  • OT ot.txt ot_rule.txt mean_oth.txt 1 1 1 1
  • UT ut.txt ut_rule.txt mean_quot.txt 1 1 1 3
  • QF qf.txt qf_rule.txt mean_quan.txt 1 1 2 2 3
  • QO qo.txt qo_rule.txt mean_ord.txt 1 1 1 3
  • symbols: aAbcdDeghiIjklmn.;NoprsStTuUyY
orthographic rules39
Orthographic Rules

The format is similar to two level morphological rules. Each rule has 4 parts


Here input is changed to output provided left_context is preceded by and right_context is followed by input. Suffix is ended by #.


“give^ing# = giving” can be written by the rule

Rule 1: e^:NULL/giv,ing#

If we say all “e-ending” words are inflected like “give” then we can write the rule

Rule 2: e^:NULL/*,ing#

If we say all “a-ending” and “o-ending” words are simply concatenated when added with “ing#” we can write

Rule 3: ^:NULL/*~,ing# (Where ~ symbol means either ‘a’ or ‘o’)

orthographic rules contd
Orthographic Rules Contd..

The Orthographic rules are best designed by FSM (Deterministic).

FSM will help to decide whether the rule is satisfied by the input word. If “yes” finding out the portion to be replaced is not very tricky.

If no Orthographic rule is triggered suffix is simply concatenated.

If following the FSM, input word reach the final state, we say the rule is triggered.

building fsm

























Building FSM

Example FSM for Rule 2:


orthographic rules for bengali verb
Orthographic Rules for Bengali Verb
  • oYA^L:o/*A,*
  • no^e:Ya/*A,K
  • oYA^e:Ya/*A,K
  • AoYA^:eY/X,echh*
  • yAoYA^:giY/*,echh*
  • AoYA^:e/X,M*
  • oYA^:NULL/*A,b*
  • AoYA^:e/y,t*
  • yAoYA^:ge/*,l*
  • eoYA^:iY/*,echh*
  • eoYA^:i;/*,iK
  • eoYA^:ich/*,chh*
  • eoYA^:i/*,P*
  • oYA^:NULL/*e,Q*
  • eoYA^u:i/*,*
  • eoYA^:NULL/*,R*
  • eoYA^:A/*,o*
  • eoYA^a:Ao/*,ni
  • eoYA^e:eYa/*,K
  • oYA^:uY/*$,echh*
  • oYA^:uch/*$,chh*
  • oYA^:u/*$,V*
  • oYA^i:u/*$,sa*
  • YA^:;/*$o,o#
  • YA^a:;o/*$o,ni*
  • YA^e:NULL/*$o,naK
  • YA^u:NULL/*$o,*
  • A^e:a/*$oY,K
  • ^y:;i/*,#
  • ^a:;o/*,#
  • AWA^aie:eWe/*,#
  • yAoYA^aie:giYe/*,*
  • AoYA^aie:eYe/X,#
  • eoYA^aie:iYe/*,#
  • Ano^aie:iYe/*,#
  • oYA^aie:uYe/*$,#
  • ^y:;i/*,#
  • ^a:;o/*,#
  • AWA^aie:eWe/*,#
  • A^aie:e/*,#
  • yAoYA^aie:giYe/*,*
  • AoYA^aie:eYe/X,#
  • eoYA^aie:iYe/*,#
  • Ano^aie:iYe/*,#
  • oYA^aie:uYe/*$,#
  • AWA^aie:eWe/*,#
  • A^:NULL/B,~*
  • A^:a/B,$*
  • no^:ch/*A,chh*
  • oYA^:ch/*A,chh*
  • Ano^:iY/*,echh*
  • no^:NULL/*A,E*
  • no^F:NULL/*A,G*
  • oYA^F:NULL/*A,G*
  • no^:NULL/*A,iK
  • oYA^:NULL/*A,iK
  • no^L:o/*A,*
input format
Input Format

Input to the Morphological Generator is started with the root of the word followed by the POS Category and Attribute names and their values.


karA VM Person 3 Tense 2 Emp 2

In Bengali Person and Tense combine to give a suffix which will be added first and Emphasizer will give another suffix which will be added next.

See Morphotactic for Bengali Verb.

input format contd
Input Format Contd.

In Bengali, Person can have 6 values and Tense (which is actually TAM) can have 10 values. The suffices In the Paradigm table is arranged in the following way.

First entry is Person 0 Tense 0

Second entry is Person 0 Tense 1

Third entry is Person 0 Tense 2 …

10th entry is Person 0 Tense 9

11th entry is Person 1 Tense 0

So Person 3 Tense 2 will be the entry number

(Person input) × (TAM value) + TAM input +1

= 3 × 10 + 2 + 1 = 33

Get 33rd entry from the Paradigm table for (0,1) and use the Orthographic rule to get the correct word.

bengali verb paradigms and morphotactics
Bengali Verb Paradigms and Morphotactics


<Attributes 1 2 > /* 1 indicates Person and 2 indicates TAM */


i chhi echhi lAma chhilAma echhilAma ba tAma NULL ini isa chhisa echhisa li chhili echhili bi tisa NULL isani o chha echha le chhile echhile be te NULL ani

ena chhena echhena lena chhilena echhilena bena tena una enani e chhe echhe la chhila echhila be ta uka eni ena chhena echhena lena chhilena echhilena bena tena una enani



<Attributes 3 > /*Case*/


NULL i o


Morphotactic rule

  • (0,1)(2)(3)
  • (3=2)(2)
bengali noun paradigms and morphotactics
Bengali Noun Paradigms and Morphotactics


<Attributes 0 1 2 > /* Number, Specificity, Ellipses 2×2×2 = 8 entries*/





<Attributes 3 4 > /* Form, Case 2 × 5 = 10 entries */





<Attributes 5 > /* Emphasizer 3 entries */


NULL i o


Morphotactic rule


example bengali verb
Example (Bengali Verb)

Example: the Input is

balA Verb Person 1 TAM 1 Case 0

First Morphotactic rule is triggered.

Person can have 6 values and TAM can have 10 values. So the extracted suffix number from the paradigm table 1,2 is

10×(Person value) +(TAM value) + 1 = 10×1 + 1 + 1 = 12

i.e., chhisais to be added first.

From the paradigm table (3) extracted suffix is NULL.

i.e., NULL is to be added next.

example contd
Example Contd.

Now balA^chhisa# is the input which will search for suitable Orthographic rule.

Suppose there is an orthographic rule

A^:a/B,$* Where B:*-Y and $: consonant

Then the FSM for this rule will bring the input to the final state. i.e., the rule is triggered. Now “A^” is replaced by “a” and the output is “balachhisa”

exception list
Exception List:

Some words which do not match with other words in the orthographic change on those which are changed completely when inflected are said to be exceptions.

Those words if added in Orthographic rule will cause a large number of rules with a huge complexity.

We handled those words mentioning in a separate file which include the exception words along with all its inflections.