Morphology 2 a case study of developing bengali morph analyzer and generator
Download
1 / 50

Morphology 2 A case study of developing Bengali morph analyzer ... - PowerPoint PPT Presentation


  • 209 Views
  • Uploaded on

Morphology 2 A case study of developing Bengali morph analyzer and generator. Sudeshna Sarkar IIT Kharagpur. Two level morphology. PC-KIMMO, a morphological parser based on Kimmo Koskenniemi's model of two-level morphology ( Koskenniemi 1983 ).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Morphology 2 A case study of developing Bengali morph analyzer ...' - elina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Morphology 2 a case study of developing bengali morph analyzer and generator l.jpg

Morphology 2A case study of developing Bengali morph analyzer and generator

Sudeshna Sarkar

IIT Kharagpur


Two level morphology l.jpg
Two level morphology

  • PC-KIMMO, a morphological parser based on Kimmo Koskenniemi's model of two-level morphology ( Koskenniemi 1983).

  • Koskenniemi's model of two-level morphology was based on the traditional distinction that linguists make between

    • morphotactics, which enumerates the inventory of morphemes and specifies in what order they can occur, and

    • morphophonemics, which accounts for alternate forms or "spellings" of morphemes according to the phonological context in which they occur.


Slide3 l.jpg

  • For example, the word chasedis analyzed morphotactically as the stem chase followed by the suffix -ed.

  • However, the addition of the suffix -ed apparently causes the loss of the final e of chase; thus chase and chas are allomorphs or alternate forms of the same morpheme.

  • Koskenniemi's model is "two-level" in the sense that a word is represented as a direct, letter-for-letter correspondence between its lexical or underlying form and its surface form. For example, the word chased is given this two-level representation (where + is a morpheme boundary symbol and 0 is a null character):

    Lexical form: c h a s e + e d

    Surface form: c h a s 0 0 e d


Main components of karttunen s kimmo parser l.jpg
Main components of Karttunen's KIMMO parser

  • the rules component: two-level rules that accounted for regular phonological or orthographic alternations, such as chase versus chas.

  • lexical component: list all morphemes (stems and affixes) in their lexical form and specify morphotactic constraints.


Slide5 l.jpg


Generative rules and 2 level rules l.jpg
Generative rules and 2-level rules

  • Two-level rules are similar to the rules of standard generative phonology, but differ in several crucial ways. Rule R1 is an example of a generative rule.

    R1 t ---> c / ___ i

    Rule R2 is the analogous two-level rule.

    R2 t:c => ___ i

    Generative rules

    • Transformational rules

    • Sequential application

    • Unidirectional

      Two-level rules

    • Declarative – talk about correspondences

    • They apply is parallel

    • Bidirectional



Hindi noun analysis l.jpg
Hindi noun analysis

A. Noun analysis

Nouns are categorised into 20 different paradigms based on the following criterion:

1. Vowel ending.

2. Valid suffix of a word.

3. Gender, Number, Person and Case information.

A snapshot of the analysis in shown in table 2.1.

There are 20,000 Nouns classified in 20 such paradigms.


Hindi verb analysis l.jpg
Hindi verb analysis

B. Verb Analysis

The Verb Group represents the following grammatical prop-

erties:

1. Tense : Present, Past and Future.

2. Aspect: Durative, Stative, Infinitive, Habitual and Per-

fective etc.

3. Modal: Abilitive, Deontic, Probabilitative etc.

4. Gender: Male, Female, Dual.

5. Person: 1st , 2nd and 3rd.

These values formed the basis to list Verb Groups according

to their TAM-GNP values. A TAM-GNP matrix having all

possible VGs is developed.

IITB morph analyzer Presently there are 622 unique

paradigms in the TAM-GNP matrix



Morphology verb l.jpg
Morphology: Verb

Attribute 1: Root

Val 0: root word of the given surface form of the word

Attribute 2: Category

Val 0: verb (v)

Attribute 3: Person

Val 0: first, Val 1: second normal, Val 2: second familiar, Val 3: third normal, Val 4: formal (second/third)

Attribute 4: Tense

Val 0: Present, Val 1: Past, Val 2: Future

Attribute 5: Aspect

Val 0: simple, Val 1: continuous Val 2: perfect

Attribute 6: Modality

Attribute 8: Specificity

Val 0: non-specific, Val 1: specific

Attribute 9: Emphasizer

Val 0: none, Val 1: only, Val 2: also

Attribute 10: Polarity

Val 0: positive Val 1: negative


Attributes values verb l.jpg
Attributes & Values (Verb) :

Person:

  • First Person-(1),Ami

  • Second Formal-(2),Apani

  • Second Normal-(3),tumi

  • Second Familiar-(4),tui

  • Third Normal-(5),se

  • Third Formal-(6),tini

  • Unspecified


Attributes values verb13 l.jpg
Attributes & Values (Verb) :

Tense:

Present-(1),kari

Past-(2),karalAma

Future-(3),karaba

Overall-(4)


Attributes values verb14 l.jpg
Attributes & Values (Verb) :

Aspect:

Simple-(1),karalAma

Habitual-(2),karatAma

Continuous-(3),karachhe

Perfect-(4),karechhi

Indefinite-(5),kari


Attributes values verb15 l.jpg
Attributes & Values (Verb) :

Modality:

  • Indicative-(1),kara

  • Imperative-(2),kar

  • Subjunctive-(3),karale


Attributes values verb16 l.jpg
Attributes & Values (Verb) :

Polarity:

Positive-(1),kari

Negative-(2),karini


Information verbs l.jpg
INFORMATION:VERBS

  • Total Numbers of Categories (Based on Syllabic Structure) : 20

  • Rules:214/Category

  • Total Numbers of Rules : 214x20=4280(apprx.)




Classification nouns l.jpg
Classification : Nouns

  • Morphological Classification Based on Different Types of Nouns:

  • 1.Animate (example: mAnuSha)

  • 2.Inanimate(example: mATi)

  • 3.Abstract/Qualitative(example: daYA)

  • 4.Verbal(example : bhojana)

  • 5.Collective(example: pAla)

  • 6.The Singular (example: chandra)

  • 7.Compounded(example: riksAoYAlA)


Sub classification nouns l.jpg
Sub Classification :Nouns

  • Sub Classification based on “Root Endings”:

  • 1.a-ending root (animate “mAnusha”)

  • 2.A- ending root (animate “bAlikA”)

  • 3.i- ending root (animate “pAkhi”)

  • 4.I- ending root (animate “khukI”)

  • 5.e- ending root (animate “chhele”)

  • 6.o- ending root (animate “myA;o”)

  • 7.u-ending root (animate “shishu”)

  • 8.U- ending root (animate “badhU”)


Classification pronouns l.jpg
Classification :Pronouns

Morphological Analysis Based on Different Natures of Pronouns:

1.Personal (Ami,Apani,-)

2.Inclusive (saba,sakala,ubhaYa,-)

3.Relative(ye,yAhA,-)

4.Interrogative(ke,ki,-)

5.Denoting Others (anya,para,-)

6.Near Demonstrative (e,ihA,-)

7.Far Demonstrative (o,uhA,-)

8.Reflexive (nija,nijenije,-)

9.Indeffinite (keu,kichhu,-)


Morphology pronoun l.jpg
Morphology : Pronoun

Attributes:

  • Number

    • Val 0: singular, Val 1: plural, Val 2: honorary plural

  • Form

    • Val 0: direct, Val 1: oblique

  • Specificity

    • Val 0: non-specific, Val 1: specific

  • Case

    • Val 0: Nom., Val 1: Acc., Val 2: Genitive, Val 3: Locative

  • Emphatic Marker

    • Val 0: none, Val 1: only, Val 2: also

  • Ellipses

    • Val 0: false, Val 1: true

  • Nature

  • Types


Bengali pos categories noun l.jpg
Bengali POS Categories (Noun)

Bengali Noun has the following attributes:

Number, Specificity, Ellipses, Form, Case and Emphasizer

  • Number has 2 values (Singular and Plural)

  • Specificity has 2 values (Specific and non_specific)

  • Ellipses has 2 values (Elliptic and non_elliptic)

  • Form has 2 values (Direct and Oblique)

  • Case has 5 values (Nominative, Accusative, Genitive, Locative, Instrumental)

  • Emphasizer has 3 values (None, Only, Also)


Adjective morphology l.jpg
Adjective Morphology

Root

  • Val 0: root word of the given surface form of the word

    Specificity

  • Val 0: non-specific, Val 1: specific

    Emphasizer

  • Val 0: none, Val 1: only, Val 2: also

    Degree

  • Val 0: normal, Val 1: superlative, Val 2: Comparative

    Gender

  • Val 0: masculine Val 1: feminine Val 2: neuter


Adverb morphology l.jpg
Adverb Morphology

Root

  • Val 0: root word of the given surface form of the word

    Emphasizer

  • Val 0: none, Val 1: only, Val 2: also

    Degree

  • Val 0: normal, Val 1: superlative, Val 2: Comparative


Postposition morphology l.jpg
Postposition Morphology

Root

  • Val 0: root word of the given surface form of the word

    Emphasizer

  • Val 0: none, Val 1: only, Val 2: also


Morphological generator l.jpg

Morphological Generator

Developed at

IIT Kharagpur


Introduction l.jpg
Introduction

Morphological Generator uses certain linguistic resources and generates the surface form from a given input.

The following linguistic resources are required

  • Root Dictionary

  • Morphological Rules

    • Rule/Attribute Type Declaration (RATD)

    • Morphotactics

    • Paradigm Tables

    • Orthographic Rewrite Rules

  • Exception List


Format of the root dictionary l.jpg
Format of the root dictionary

<root_word>:<category, paradigm_no;>+

  • root_word: The root word in UTF-8

  • category: Part-of-speech category

  • paradigm_no: A specific non-negative number referring to the paradigm table to be used for generation of the surface form for the root_word, when used as a particular POS-category.

  • +: denotes one or more occurrence of the <category, paradigm_no;>

    Example for Hindi:

  • कर: NN,0; VM,1;

  • आम: NN,1; JJ, 0;


Slide31 l.jpg

The first line of the RATD is

<#categories> <cat_tag >+

#categories: The total number of distinct categories, for which morphological generation is required.

cat_tag: The category tag as used in the root dictionary, for which the generation is required.

Example:

3 NN QC VM

RATD


Slide32 l.jpg
RATD

This is followed by the declarations related to the #categories categories. The declaration for each category consists of meta declaration line followed by #morphotactics lines specifying the morphotactic rules. The meta declaration for a category is as follows:

<cat_tag> <file_name> <#paradigms> <#morphotactics><#attributes> <#values_for_attribute>+

  • cat_tag: As defined above

  • file_name: The name of the file that contains the morphotactics, paradigm tables and rewrite rules of the particular category.

  • #paradigms: Total number of paradigms for the category

  • #morphotactics: Total number of linear morphotactic rules for the category

  • #attributes: Total number of attributes that govern the morphology

  • #values_for_attribute: The number of values for each of the attributes.

    Example

    NN nn.txt 5 1 2 2 2


Morphotactics l.jpg
Morphotactics

The morphotactics are specified linearly in the following format

{ ‘(’ { attribute_id, }+ ‘)’ }+

  • For example, the morphotactic rule (0, 2)(3)(1, 4) means that the suffix marking for the features 0 and 2 is followed by the suffix marking feature 3 and then the suffix marking the features 1 and 4.

  • We assume a linear morphology

  • We assume that inflections are in the form of suffixes only (i.e. no prefix or infix)

  • In the above example, it is not possible to split the suffixes marking for features 0 and 2, and 1 and 4. In other words, the suffixes for these features are fusional as far as (0,2) or (1,4) feature combinations are considered, but the morphology is agglutinative in general.

  • There can be more than one morphotactic rule for a category in a language. In that case, the first rule is taken as the default one, whereas the other rules are triggered only under special circumstances, which are to be specified with the rule by assigning some specific value to the feature, like (0, 2=5)(3)(1, 4) implies that the rule is triggered only when Attribute 2 has a value of 5.


Morphotactics example l.jpg
Morphotactics example

  • Bengali noun morphology

  • Attribute 0: Number Val 0: singular, Val 1: plural

  • Attribute 1: Obliqueness Val 0: direct, Val 1: oblique

  • Attribute 2: Specificity Val 0: non-specific, Val 1: specific

  • Attribute 3: Case Val 0: Nom., Val 1: Acc., Val 2: Genitive, Val 3: Locative

  • Attribute 4: Emphasizer Val 0: none, Val 1: only, Val 2: also

  • Attribute 5: Ellipses Val 0: false, Val 1: true

    Bengali nouns follow one of the following two morphotactics

    • (0,1,2)(3)(4)

    • (0,1,2)(5=1)(0,1,2)(3)(4)

      The second rule is triggered only in the case of ellipses.


Paradigm table l.jpg
Paradigm Table

  • The category specific files (e.g. nn.txt in the earlier example) store the paradigm tables and orthographic rewrite rules.

  • There are paradigm tables corresponding to every paradigm number for each of the feature/feature-combination in the morphotactics. Thus, if there are #paradigms for Bengali nouns, then there are 4*#paradigms paradigm tables. The 4 tables per paradigm corresponds to (0,1,2), (3), (4), and (5).

  • However, several paradigms might share some of the tables. Therefore, in the declaration, a particular table can stand for more than one paradigm.


Slide36 l.jpg

Paradigm table contains the list of suffices for a particular combination of attributes.

<ParadigmTable

<Attributes a1, a2>

<ParadigmNumber x1, x2, x3>

<Suffixes s11, s12, s13,…, s21, s22, s23,…>

The Number of suffices in a table is equal to the multiplication of the values of the attributes in that combination.

Example: If the combination is (0,1) and 1st attribute has 10 values and 2nd attribute has 3 values, the table for the combination (0,1) will contain 10×3 = 30 suffices (may be some of them are NULL).


Orthographic rules l.jpg
Orthographic Rules particular combination of attributes.

Orthographic rules are specified as rewrite rules of the following forms

input  output / left_context, right_context

We also have provisions to specify two layer rules, where on the top layer specifies the rule on strings, and on the bottom layer, the features are indicated.

Thus, a rule of type

input  output / left_context, right_context

[att1] [root], [att2]

means that when the suffix corresponding to the attribute att1 has the pattern input, and it is immediately preceded by the pattern left_context, which belongs to the root and followed by the pattern right_context, which belongs to another suffix corresponding to some attribute att2, then input should be replaced by the pattern output.


Ratd for bengali l.jpg
RATD for Bengali particular combination of attributes.

  • 11 NN QC VM PN AV AJ PS OT UT QF QO

  • NN nn.txt nn_rule.txt mean_noun.txt 1 1 6 2 2 2 2 5 3

  • QC qc.txt qc_rule.txt mean_card.txt 1 1 4 4 2 2 3

  • VM vm.txt vm_rule.txt mean_verb.txt 1 2 5 6 10 3 2 2

  • PN pn.txt pn_rule.txt mean_pron.txt 1 2 7 2 2 2 2 2 5 3

  • AV av.txt av_rule.txt mean_adv.txt 1 1 2 3 3

  • AJ aj.txt aj_rule.txt mean_adj.txt 1 1 2 3 3

  • PS ps.txt ps_rule.txt mean_psp.txt 1 1 1 3

  • OT ot.txt ot_rule.txt mean_oth.txt 1 1 1 1

  • UT ut.txt ut_rule.txt mean_quot.txt 1 1 1 3

  • QF qf.txt qf_rule.txt mean_quan.txt 1 1 2 2 3

  • QO qo.txt qo_rule.txt mean_ord.txt 1 1 1 3

  • symbols: aAbcdDeghiIjklmn.;NoprsStTuUyY


Orthographic rules39 l.jpg
Orthographic Rules particular combination of attributes.

The format is similar to two level morphological rules. Each rule has 4 parts

input:output/left_context,right_context

Here input is changed to output provided left_context is preceded by and right_context is followed by input. Suffix is ended by #.

Example:

“give^ing# = giving” can be written by the rule

Rule 1: e^:NULL/giv,ing#

If we say all “e-ending” words are inflected like “give” then we can write the rule

Rule 2: e^:NULL/*,ing#

If we say all “a-ending” and “o-ending” words are simply concatenated when added with “ing#” we can write

Rule 3: ^:NULL/*~,ing# (Where ~ symbol means either ‘a’ or ‘o’)


Orthographic rules contd l.jpg
Orthographic Rules Contd.. particular combination of attributes.

The Orthographic rules are best designed by FSM (Deterministic).

FSM will help to decide whether the rule is satisfied by the input word. If “yes” finding out the portion to be replaced is not very tricky.

If no Orthographic rule is triggered suffix is simply concatenated.

If following the FSM, input word reach the final state, we say the rule is triggered.


Building fsm l.jpg

*-e particular combination of attributes.

e

i

n

g

*

e

^

#

G

S

A

B

C

D

E

F

*-e-^

*-n

*-g

*-#

*-i

*

H

Building FSM

Example FSM for Rule 2:

e^:NULL/*,ing#


Orthographic rules for bengali verb l.jpg
Orthographic Rules for Bengali Verb particular combination of attributes.

  • oYA^L:o/*A,*

  • no^e:Ya/*A,K

  • oYA^e:Ya/*A,K

  • AoYA^:eY/X,echh*

  • yAoYA^:giY/*,echh*

  • AoYA^:e/X,M*

  • oYA^:NULL/*A,b*

  • AoYA^:e/y,t*

  • yAoYA^:ge/*,l*

  • eoYA^:iY/*,echh*

  • eoYA^:i;/*,iK

  • eoYA^:ich/*,chh*

  • eoYA^:i/*,P*

  • oYA^:NULL/*e,Q*

  • eoYA^u:i/*,*

  • eoYA^:NULL/*,R*

  • eoYA^:A/*,o*

  • eoYA^a:Ao/*,ni

  • eoYA^e:eYa/*,K

  • oYA^:uY/*$,echh*

  • oYA^:uch/*$,chh*

  • oYA^:u/*$,V*

  • oYA^i:u/*$,sa*

  • YA^:;/*$o,o#

  • YA^a:;o/*$o,ni*

  • YA^e:NULL/*$o,naK

  • YA^u:NULL/*$o,*

  • A^e:a/*$oY,K

  • ^y:;i/*,#

  • ^a:;o/*,#

  • AWA^aie:eWe/*,#

  • yAoYA^aie:giYe/*,*

  • AoYA^aie:eYe/X,#

  • eoYA^aie:iYe/*,#

  • Ano^aie:iYe/*,#

  • oYA^aie:uYe/*$,#

  • ^y:;i/*,#

  • ^a:;o/*,#

  • AWA^aie:eWe/*,#

  • A^aie:e/*,#

  • yAoYA^aie:giYe/*,*

  • AoYA^aie:eYe/X,#

  • eoYA^aie:iYe/*,#

  • Ano^aie:iYe/*,#

  • oYA^aie:uYe/*$,#

  • AWA^aie:eWe/*,#

  • A^:NULL/B,~*

  • A^:a/B,$*

  • no^:ch/*A,chh*

  • oYA^:ch/*A,chh*

  • Ano^:iY/*,echh*

  • no^:NULL/*A,E*

  • no^F:NULL/*A,G*

  • oYA^F:NULL/*A,G*

  • no^:NULL/*A,iK

  • oYA^:NULL/*A,iK

  • no^L:o/*A,*


Input format l.jpg
Input Format particular combination of attributes.

Input to the Morphological Generator is started with the root of the word followed by the POS Category and Attribute names and their values.

Example:

karA VM Person 3 Tense 2 Emp 2

In Bengali Person and Tense combine to give a suffix which will be added first and Emphasizer will give another suffix which will be added next.

See Morphotactic for Bengali Verb.


Input format contd l.jpg
Input Format Contd. particular combination of attributes.

In Bengali, Person can have 6 values and Tense (which is actually TAM) can have 10 values. The suffices In the Paradigm table is arranged in the following way.

First entry is Person 0 Tense 0

Second entry is Person 0 Tense 1

Third entry is Person 0 Tense 2 …

10th entry is Person 0 Tense 9

11th entry is Person 1 Tense 0

So Person 3 Tense 2 will be the entry number

(Person input) × (TAM value) + TAM input +1

= 3 × 10 + 2 + 1 = 33

Get 33rd entry from the Paradigm table for (0,1) and use the Orthographic rule to get the correct word.


Bengali verb paradigms and morphotactics l.jpg
Bengali Verb Paradigms and Morphotactics particular combination of attributes.

<ParadigmTable

<Attributes 1 2 > /* 1 indicates Person and 2 indicates TAM */

<suffixes

i chhi echhi lAma chhilAma echhilAma ba tAma NULL ini isa chhisa echhisa li chhili echhili bi tisa NULL isani o chha echha le chhile echhile be te NULL ani

ena chhena echhena lena chhilena echhilena bena tena una enani e chhe echhe la chhila echhila be ta uka eni ena chhena echhena lena chhilena echhilena bena tena una enani

>>

<ParadigmTable

<Attributes 3 > /*Case*/

<suffixes

NULL i o

>>

Morphotactic rule

  • (0,1)(2)(3)

  • (3=2)(2)


Bengali noun paradigms and morphotactics l.jpg
Bengali Noun Paradigms and Morphotactics particular combination of attributes.

<ParadigmTable

<Attributes 0 1 2 > /* Number, Specificity, Ellipses 2×2×2 = 8 entries*/

<suffixes

NULL eraTA TA NULL gulo guloraTA NULL NULL

>>

<ParadigmTable

<Attributes 3 4 > /* Form, Case 2 × 5 = 10 entries */

<suffixes

NULL ke NULL ete ete NULL NULL era NULL NULL

>>

<ParadigmTable

<Attributes 5 > /* Emphasizer 3 entries */

<suffixes

NULL i o

>>

Morphotactic rule

(0,1,2)(3,4)(5)


Example bengali verb l.jpg
Example (Bengali Verb) particular combination of attributes.

Example: the Input is

balA Verb Person 1 TAM 1 Case 0

First Morphotactic rule is triggered.

Person can have 6 values and TAM can have 10 values. So the extracted suffix number from the paradigm table 1,2 is

10×(Person value) +(TAM value) + 1 = 10×1 + 1 + 1 = 12

i.e., chhisais to be added first.

From the paradigm table (3) extracted suffix is NULL.

i.e., NULL is to be added next.


Example contd l.jpg
Example Contd. particular combination of attributes.

Now balA^chhisa# is the input which will search for suitable Orthographic rule.

Suppose there is an orthographic rule

A^:a/B,$* Where B:*-Y and $: consonant

Then the FSM for this rule will bring the input to the final state. i.e., the rule is triggered. Now “A^” is replaced by “a” and the output is “balachhisa”


Exception list l.jpg
Exception List: particular combination of attributes.

Some words which do not match with other words in the orthographic change on those which are changed completely when inflected are said to be exceptions.

Those words if added in Orthographic rule will cause a large number of rules with a huge complexity.

We handled those words mentioning in a separate file which include the exception words along with all its inflections.


Morph analyzer l.jpg
Morph Analyzer particular combination of attributes.


ad