Morphology 2 a case study of developing bengali morph analyzer and generator l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 50

Morphology 2 A case study of developing Bengali morph analyzer and generator PowerPoint PPT Presentation


  • 171 Views
  • Uploaded on
  • Presentation posted in: General

Morphology 2 A case study of developing Bengali morph analyzer and generator. Sudeshna Sarkar IIT Kharagpur. Two level morphology. PC-KIMMO, a morphological parser based on Kimmo Koskenniemi's model of two-level morphology ( Koskenniemi 1983 ).

Download Presentation

Morphology 2 A case study of developing Bengali morph analyzer and generator

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Morphology 2 a case study of developing bengali morph analyzer and generator l.jpg

Morphology 2A case study of developing Bengali morph analyzer and generator

Sudeshna Sarkar

IIT Kharagpur


Two level morphology l.jpg

Two level morphology

  • PC-KIMMO, a morphological parser based on Kimmo Koskenniemi's model of two-level morphology ( Koskenniemi 1983).

  • Koskenniemi's model of two-level morphology was based on the traditional distinction that linguists make between

    • morphotactics, which enumerates the inventory of morphemes and specifies in what order they can occur, and

    • morphophonemics, which accounts for alternate forms or "spellings" of morphemes according to the phonological context in which they occur.


Slide3 l.jpg

  • For example, the word chasedis analyzed morphotactically as the stem chase followed by the suffix -ed.

  • However, the addition of the suffix -ed apparently causes the loss of the final e of chase; thus chase and chas are allomorphs or alternate forms of the same morpheme.

  • Koskenniemi's model is "two-level" in the sense that a word is represented as a direct, letter-for-letter correspondence between its lexical or underlying form and its surface form. For example, the word chased is given this two-level representation (where + is a morpheme boundary symbol and 0 is a null character):

    Lexical form: c h a s e + e d

    Surface form: c h a s 0 0 e d


Main components of karttunen s kimmo parser l.jpg

Main components of Karttunen's KIMMO parser

  • the rules component: two-level rules that accounted for regular phonological or orthographic alternations, such as chase versus chas.

  • lexical component: list all morphemes (stems and affixes) in their lexical form and specify morphotactic constraints.


Slide5 l.jpg

  • Englex: a two-level description of English morphology

  • Englex consists of a set of orthographic rules, a 20,000-entry lexicon of roots and affixes, and a word grammar. With Englex and PC-KIMMO, you can morphologically parse English words and text.


Generative rules and 2 level rules l.jpg

Generative rules and 2-level rules

  • Two-level rules are similar to the rules of standard generative phonology, but differ in several crucial ways. Rule R1 is an example of a generative rule.

    R1 t ---> c / ___ i

    Rule R2 is the analogous two-level rule.

    R2 t:c => ___ i

    Generative rules

    • Transformational rules

    • Sequential application

    • Unidirectional

      Two-level rules

    • Declarative – talk about correspondences

    • They apply is parallel

    • Bidirectional


Hindi morphology l.jpg

Hindi Morphology


Hindi noun analysis l.jpg

Hindi noun analysis

A. Noun analysis

Nouns are categorised into 20 different paradigms based on the following criterion:

1. Vowel ending.

2. Valid suffix of a word.

3. Gender, Number, Person and Case information.

A snapshot of the analysis in shown in table 2.1.

There are 20,000 Nouns classified in 20 such paradigms.


Hindi verb analysis l.jpg

Hindi verb analysis

B. Verb Analysis

The Verb Group represents the following grammatical prop-

erties:

1. Tense : Present, Past and Future.

2. Aspect: Durative, Stative, Infinitive, Habitual and Per-

fective etc.

3. Modal: Abilitive, Deontic, Probabilitative etc.

4. Gender: Male, Female, Dual.

5. Person: 1st , 2nd and 3rd.

These values formed the basis to list Verb Groups according

to their TAM-GNP values. A TAM-GNP matrix having all

possible VGs is developed.

IITB morph analyzer Presently there are 622 unique

paradigms in the TAM-GNP matrix


Bengali morphology l.jpg

Bengali Morphology


Morphology verb l.jpg

Morphology: Verb

Attribute 1: Root

Val 0: root word of the given surface form of the word

Attribute 2: Category

Val 0: verb (v)

Attribute 3: Person

Val 0: first, Val 1: second normal, Val 2: second familiar, Val 3: third normal, Val 4: formal (second/third)

Attribute 4: Tense

Val 0: Present, Val 1: Past, Val 2: Future

Attribute 5: Aspect

Val 0: simple, Val 1: continuous Val 2: perfect

Attribute 6: Modality

Attribute 8: Specificity

Val 0: non-specific, Val 1: specific

Attribute 9: Emphasizer

Val 0: none, Val 1: only, Val 2: also

Attribute 10: Polarity

Val 0: positive Val 1: negative


Attributes values verb l.jpg

Attributes & Values (Verb) :

Person:

  • First Person-(1),Ami

  • Second Formal-(2),Apani

  • Second Normal-(3),tumi

  • Second Familiar-(4),tui

  • Third Normal-(5),se

  • Third Formal-(6),tini

  • Unspecified


Attributes values verb13 l.jpg

Attributes & Values (Verb) :

Tense:

Present-(1),kari

Past-(2),karalAma

Future-(3),karaba

Overall-(4)


Attributes values verb14 l.jpg

Attributes & Values (Verb) :

Aspect:

Simple-(1),karalAma

Habitual-(2),karatAma

Continuous-(3),karachhe

Perfect-(4),karechhi

Indefinite-(5),kari


Attributes values verb15 l.jpg

Attributes & Values (Verb) :

Modality:

  • Indicative-(1),kara

  • Imperative-(2),kar

  • Subjunctive-(3),karale


Attributes values verb16 l.jpg

Attributes & Values (Verb) :

Polarity:

Positive-(1),kari

Negative-(2),karini


Information verbs l.jpg

INFORMATION:VERBS

  • Total Numbers of Categories (Based on Syllabic Structure) : 20

  • Rules:214/Category

  • Total Numbers of Rules : 214x20=4280(apprx.)


Bengali verb paradigms l.jpg

Bengali Verb Paradigms


Bengali verb morphology for one of the paradigms l.jpg

Bengali Verb morphology for one of the paradigms


Classification nouns l.jpg

Classification : Nouns

  • Morphological Classification Based on Different Types of Nouns:

  • 1.Animate (example: mAnuSha)

  • 2.Inanimate(example: mATi)

  • 3.Abstract/Qualitative(example: daYA)

  • 4.Verbal(example : bhojana)

  • 5.Collective(example: pAla)

  • 6.The Singular (example: chandra)

  • 7.Compounded(example: riksAoYAlA)


Sub classification nouns l.jpg

Sub Classification :Nouns

  • Sub Classification based on “Root Endings”:

  • 1.a-ending root (animate “mAnusha”)

  • 2.A- ending root (animate “bAlikA”)

  • 3.i- ending root (animate “pAkhi”)

  • 4.I- ending root (animate “khukI”)

  • 5.e- ending root (animate “chhele”)

  • 6.o- ending root (animate “myA;o”)

  • 7.u-ending root (animate “shishu”)

  • 8.U- ending root (animate “badhU”)


Classification pronouns l.jpg

Classification :Pronouns

Morphological Analysis Based on Different Natures of Pronouns:

1.Personal (Ami,Apani,-)

2.Inclusive (saba,sakala,ubhaYa,-)

3.Relative(ye,yAhA,-)

4.Interrogative(ke,ki,-)

5.Denoting Others (anya,para,-)

6.Near Demonstrative (e,ihA,-)

7.Far Demonstrative (o,uhA,-)

8.Reflexive (nija,nijenije,-)

9.Indeffinite (keu,kichhu,-)


Morphology pronoun l.jpg

Morphology : Pronoun

Attributes:

  • Number

    • Val 0: singular, Val 1: plural, Val 2: honorary plural

  • Form

    • Val 0: direct, Val 1: oblique

  • Specificity

    • Val 0: non-specific, Val 1: specific

  • Case

    • Val 0: Nom., Val 1: Acc., Val 2: Genitive, Val 3: Locative

  • Emphatic Marker

    • Val 0: none, Val 1: only, Val 2: also

  • Ellipses

    • Val 0: false, Val 1: true

  • Nature

  • Types


Bengali pos categories noun l.jpg

Bengali POS Categories (Noun)

Bengali Noun has the following attributes:

Number, Specificity, Ellipses, Form, Case and Emphasizer

  • Number has 2 values (Singular and Plural)

  • Specificity has 2 values (Specific and non_specific)

  • Ellipses has 2 values (Elliptic and non_elliptic)

  • Form has 2 values (Direct and Oblique)

  • Case has 5 values (Nominative, Accusative, Genitive, Locative, Instrumental)

  • Emphasizer has 3 values (None, Only, Also)


Adjective morphology l.jpg

Adjective Morphology

Root

  • Val 0: root word of the given surface form of the word

    Specificity

  • Val 0: non-specific, Val 1: specific

    Emphasizer

  • Val 0: none, Val 1: only, Val 2: also

    Degree

  • Val 0: normal, Val 1: superlative, Val 2: Comparative

    Gender

  • Val 0: masculine Val 1: feminine Val 2: neuter


Adverb morphology l.jpg

Adverb Morphology

Root

  • Val 0: root word of the given surface form of the word

    Emphasizer

  • Val 0: none, Val 1: only, Val 2: also

    Degree

  • Val 0: normal, Val 1: superlative, Val 2: Comparative


Postposition morphology l.jpg

Postposition Morphology

Root

  • Val 0: root word of the given surface form of the word

    Emphasizer

  • Val 0: none, Val 1: only, Val 2: also


Morphological generator l.jpg

Morphological Generator

Developed at

IIT Kharagpur


Introduction l.jpg

Introduction

Morphological Generator uses certain linguistic resources and generates the surface form from a given input.

The following linguistic resources are required

  • Root Dictionary

  • Morphological Rules

    • Rule/Attribute Type Declaration (RATD)

    • Morphotactics

    • Paradigm Tables

    • Orthographic Rewrite Rules

  • Exception List


Format of the root dictionary l.jpg

Format of the root dictionary

<root_word>:<category, paradigm_no;>+

  • root_word: The root word in UTF-8

  • category: Part-of-speech category

  • paradigm_no: A specific non-negative number referring to the paradigm table to be used for generation of the surface form for the root_word, when used as a particular POS-category.

  • +: denotes one or more occurrence of the <category, paradigm_no;>

    Example for Hindi:

  • कर: NN,0; VM,1;

  • आम: NN,1; JJ, 0;


Slide31 l.jpg

The first line of the RATD is

<#categories> <cat_tag >+

#categories: The total number of distinct categories, for which morphological generation is required.

cat_tag: The category tag as used in the root dictionary, for which the generation is required.

Example:

3NNQCVM

RATD


Slide32 l.jpg

RATD

This is followed by the declarations related to the #categories categories. The declaration for each category consists of meta declaration line followed by #morphotactics lines specifying the morphotactic rules. The meta declaration for a category is as follows:

<cat_tag> <file_name> <#paradigms> <#morphotactics><#attributes> <#values_for_attribute>+

  • cat_tag: As defined above

  • file_name: The name of the file that contains the morphotactics, paradigm tables and rewrite rules of the particular category.

  • #paradigms: Total number of paradigms for the category

  • #morphotactics: Total number of linear morphotactic rules for the category

  • #attributes: Total number of attributes that govern the morphology

  • #values_for_attribute: The number of values for each of the attributes.

    Example

    NN nn.txt 5 1 2 2 2


Morphotactics l.jpg

Morphotactics

The morphotactics are specified linearly in the following format

{ ‘(’ { attribute_id, }+ ‘)’ }+

  • For example, the morphotactic rule (0, 2)(3)(1, 4) means that the suffix marking for the features 0 and 2 is followed by the suffix marking feature 3 and then the suffix marking the features 1 and 4.

  • We assume a linear morphology

  • We assume that inflections are in the form of suffixes only (i.e. no prefix or infix)

  • In the above example, it is not possible to split the suffixes marking for features 0 and 2, and 1 and 4. In other words, the suffixes for these features are fusional as far as (0,2) or (1,4) feature combinations are considered, but the morphology is agglutinative in general.

  • There can be more than one morphotactic rule for a category in a language. In that case, the first rule is taken as the default one, whereas the other rules are triggered only under special circumstances, which are to be specified with the rule by assigning some specific value to the feature, like (0, 2=5)(3)(1, 4) implies that the rule is triggered only when Attribute 2 has a value of 5.


Morphotactics example l.jpg

Morphotactics example

  • Bengali noun morphology

  • Attribute 0: Number Val 0: singular, Val 1: plural

  • Attribute 1: Obliqueness Val 0: direct, Val 1: oblique

  • Attribute 2: Specificity Val 0: non-specific, Val 1: specific

  • Attribute 3: Case Val 0: Nom., Val 1: Acc., Val 2: Genitive, Val 3: Locative

  • Attribute 4: Emphasizer Val 0: none, Val 1: only, Val 2: also

  • Attribute 5: Ellipses Val 0: false, Val 1: true

    Bengali nouns follow one of the following two morphotactics

    • (0,1,2)(3)(4)

    • (0,1,2)(5=1)(0,1,2)(3)(4)

      The second rule is triggered only in the case of ellipses.


Paradigm table l.jpg

Paradigm Table

  • The category specific files (e.g. nn.txt in the earlier example) store the paradigm tables and orthographic rewrite rules.

  • There are paradigm tables corresponding to every paradigm number for each of the feature/feature-combination in the morphotactics. Thus, if there are #paradigms for Bengali nouns, then there are 4*#paradigms paradigm tables. The 4 tables per paradigm corresponds to (0,1,2), (3), (4), and (5).

  • However, several paradigms might share some of the tables. Therefore, in the declaration, a particular table can stand for more than one paradigm.


Slide36 l.jpg

Paradigm table contains the list of suffices for a particular combination of attributes.

<ParadigmTable

<Attributes a1, a2>

<ParadigmNumber x1, x2, x3>

<Suffixes s11, s12, s13,…, s21, s22, s23,…>

The Number of suffices in a table is equal to the multiplication of the values of the attributes in that combination.

Example: If the combination is (0,1) and 1st attribute has 10 values and 2nd attribute has 3 values, the table for the combination (0,1) will contain 10×3 = 30 suffices (may be some of them are NULL).


Orthographic rules l.jpg

Orthographic Rules

Orthographic rules are specified as rewrite rules of the following forms

input  output / left_context, right_context

We also have provisions to specify two layer rules, where on the top layer specifies the rule on strings, and on the bottom layer, the features are indicated.

Thus, a rule of type

input  output / left_context, right_context

[att1] [root], [att2]

means that when the suffix corresponding to the attribute att1 has the pattern input, and it is immediately preceded by the pattern left_context, which belongs to the root and followed by the pattern right_context, which belongs to another suffix corresponding to some attribute att2, then input should be replaced by the pattern output.


Ratd for bengali l.jpg

RATD for Bengali

  • 11 NN QC VM PN AV AJ PS OT UT QF QO

  • NN nn.txt nn_rule.txt mean_noun.txt 1 1 6 2 2 2 2 5 3

  • QC qc.txt qc_rule.txt mean_card.txt 1 1 4 4 2 2 3

  • VM vm.txt vm_rule.txt mean_verb.txt 1 2 5 6 10 3 2 2

  • PN pn.txt pn_rule.txt mean_pron.txt 1 2 7 2 2 2 2 2 5 3

  • AV av.txt av_rule.txt mean_adv.txt 1 1 2 3 3

  • AJ aj.txt aj_rule.txt mean_adj.txt 1 1 2 3 3

  • PS ps.txt ps_rule.txt mean_psp.txt 1 1 1 3

  • OT ot.txt ot_rule.txt mean_oth.txt 1 1 1 1

  • UT ut.txt ut_rule.txt mean_quot.txt 1 1 1 3

  • QF qf.txt qf_rule.txt mean_quan.txt 1 1 2 2 3

  • QO qo.txt qo_rule.txt mean_ord.txt 1 1 1 3

  • symbols: aAbcdDeghiIjklmn.;NoprsStTuUyY


Orthographic rules39 l.jpg

Orthographic Rules

The format is similar to two level morphological rules. Each rule has 4 parts

input:output/left_context,right_context

Here input is changed to output provided left_context is preceded by and right_context is followed by input. Suffix is ended by #.

Example:

“give^ing# = giving” can be written by the rule

Rule 1:e^:NULL/giv,ing#

If we say all “e-ending” words are inflected like “give” then we can write the rule

Rule 2: e^:NULL/*,ing#

If we say all “a-ending” and “o-ending” words are simply concatenated when added with “ing#” we can write

Rule 3:^:NULL/*~,ing#(Where ~ symbol means either ‘a’ or ‘o’)


Orthographic rules contd l.jpg

Orthographic Rules Contd..

The Orthographic rules are best designed by FSM (Deterministic).

FSM will help to decide whether the rule is satisfied by the input word. If “yes” finding out the portion to be replaced is not very tricky.

If no Orthographic rule is triggered suffix is simply concatenated.

If following the FSM, input word reach the final state, we say the rule is triggered.


Building fsm l.jpg

*-e

e

i

n

g

*

e

^

#

G

S

A

B

C

D

E

F

*-e-^

*-n

*-g

*-#

*-i

*

H

Building FSM

Example FSM for Rule 2:

e^:NULL/*,ing#


Orthographic rules for bengali verb l.jpg

Orthographic Rules for Bengali Verb

  • oYA^L:o/*A,*

  • no^e:Ya/*A,K

  • oYA^e:Ya/*A,K

  • AoYA^:eY/X,echh*

  • yAoYA^:giY/*,echh*

  • AoYA^:e/X,M*

  • oYA^:NULL/*A,b*

  • AoYA^:e/y,t*

  • yAoYA^:ge/*,l*

  • eoYA^:iY/*,echh*

  • eoYA^:i;/*,iK

  • eoYA^:ich/*,chh*

  • eoYA^:i/*,P*

  • oYA^:NULL/*e,Q*

  • eoYA^u:i/*,*

  • eoYA^:NULL/*,R*

  • eoYA^:A/*,o*

  • eoYA^a:Ao/*,ni

  • eoYA^e:eYa/*,K

  • oYA^:uY/*$,echh*

  • oYA^:uch/*$,chh*

  • oYA^:u/*$,V*

  • oYA^i:u/*$,sa*

  • YA^:;/*$o,o#

  • YA^a:;o/*$o,ni*

  • YA^e:NULL/*$o,naK

  • YA^u:NULL/*$o,*

  • A^e:a/*$oY,K

  • ^y:;i/*,#

  • ^a:;o/*,#

  • AWA^aie:eWe/*,#

  • yAoYA^aie:giYe/*,*

  • AoYA^aie:eYe/X,#

  • eoYA^aie:iYe/*,#

  • Ano^aie:iYe/*,#

  • oYA^aie:uYe/*$,#

  • ^y:;i/*,#

  • ^a:;o/*,#

  • AWA^aie:eWe/*,#

  • A^aie:e/*,#

  • yAoYA^aie:giYe/*,*

  • AoYA^aie:eYe/X,#

  • eoYA^aie:iYe/*,#

  • Ano^aie:iYe/*,#

  • oYA^aie:uYe/*$,#

  • AWA^aie:eWe/*,#

  • A^:NULL/B,~*

  • A^:a/B,$*

  • no^:ch/*A,chh*

  • oYA^:ch/*A,chh*

  • Ano^:iY/*,echh*

  • no^:NULL/*A,E*

  • no^F:NULL/*A,G*

  • oYA^F:NULL/*A,G*

  • no^:NULL/*A,iK

  • oYA^:NULL/*A,iK

  • no^L:o/*A,*


Input format l.jpg

Input Format

Input to the Morphological Generator is started with the root of the word followed by the POS Category and Attribute names and their values.

Example:

karA VM Person 3 Tense 2 Emp 2

In Bengali Person and Tense combine to give a suffix which will be added first and Emphasizer will give another suffix which will be added next.

See Morphotactic for Bengali Verb.


Input format contd l.jpg

Input Format Contd.

In Bengali, Person can have 6 values and Tense (which is actually TAM) can have 10 values. The suffices In the Paradigm table is arranged in the following way.

First entry is Person 0 Tense 0

Second entry is Person 0 Tense 1

Third entry is Person 0 Tense 2 …

10th entry is Person 0 Tense 9

11th entry is Person 1 Tense 0

So Person 3 Tense 2 will be the entry number

(Person input) × (TAM value) + TAM input +1

= 3 × 10 + 2 + 1 = 33

Get 33rd entry from the Paradigm table for (0,1) and use the Orthographic rule to get the correct word.


Bengali verb paradigms and morphotactics l.jpg

Bengali Verb Paradigms and Morphotactics

<ParadigmTable

<Attributes 1 2 > /* 1 indicates Person and 2 indicates TAM */

<suffixes

i chhi echhi lAma chhilAma echhilAma ba tAma NULL ini isa chhisa echhisa li chhili echhili bi tisa NULL isani o chha echha le chhile echhile be te NULL ani

ena chhena echhena lena chhilena echhilena bena tena una enani e chhe echhe la chhila echhila be ta uka eni ena chhena echhena lena chhilena echhilena bena tena una enani

>>

<ParadigmTable

<Attributes 3 > /*Case*/

<suffixes

NULL i o

>>

Morphotactic rule

  • (0,1)(2)(3)

  • (3=2)(2)


Bengali noun paradigms and morphotactics l.jpg

Bengali Noun Paradigms and Morphotactics

<ParadigmTable

<Attributes 0 1 2 >/* Number, Specificity, Ellipses 2×2×2 = 8 entries*/

<suffixes

NULL eraTA TA NULL gulo guloraTA NULL NULL

>>

<ParadigmTable

<Attributes 3 4 >/* Form, Case 2 × 5 = 10 entries */

<suffixes

NULL ke NULL ete ete NULL NULL era NULL NULL

>>

<ParadigmTable

<Attributes 5 >/* Emphasizer 3 entries */

<suffixes

NULL i o

>>

Morphotactic rule

(0,1,2)(3,4)(5)


Example bengali verb l.jpg

Example (Bengali Verb)

Example: the Input is

balA Verb Person 1 TAM 1 Case 0

First Morphotactic rule is triggered.

Person can have 6 values and TAM can have 10 values. So the extracted suffix number from the paradigm table 1,2 is

10×(Person value) +(TAM value) + 1 = 10×1 + 1 + 1 = 12

i.e., chhisais to be added first.

From the paradigm table (3) extracted suffix is NULL.

i.e., NULL is to be added next.


Example contd l.jpg

Example Contd.

Now balA^chhisa# is the input which will search for suitable Orthographic rule.

Suppose there is an orthographic rule

A^:a/B,$*Where B:*-Y and $: consonant

Then the FSM for this rule will bring the input to the final state. i.e., the rule is triggered. Now “A^” is replaced by “a” and the output is “balachhisa”


Exception list l.jpg

Exception List:

Some words which do not match with other words in the orthographic change on those which are changed completely when inflected are said to be exceptions.

Those words if added in Orthographic rule will cause a large number of rules with a huge complexity.

We handled those words mentioning in a separate file which include the exception words along with all its inflections.


Morph analyzer l.jpg

Morph Analyzer


  • Login