slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
The Motivation- Statements by Prof Raj Reddy PowerPoint Presentation
Download Presentation
The Motivation- Statements by Prof Raj Reddy

Loading in 2 Seconds...

play fullscreen
1 / 82

The Motivation- Statements by Prof Raj Reddy - PowerPoint PPT Presentation


  • 120 Views
  • Uploaded on

The Motivation- Statements by Prof Raj Reddy. Information will be read by both humans and machines - more so by machines. If you are not in Google you are not there !. What does Google do ?. The Google. It removes the stop words It stems It does not disambiguate

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The Motivation- Statements by Prof Raj Reddy' - heaton


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the motivation statements by prof raj reddy
The Motivation-Statements by Prof Raj Reddy
  • Information will be read by both humans and machines - more so by machines.
  • If you are not in Google you are not there !

What does Google do?

the google
The Google
  • It removes the stop words
  • It stems
  • It does not disambiguate
  • It makes you wonder why you did such a beautiful Translation
the machine translation
The machine Translation
  • Often follow the Law of Diminishing Returns – Asymptotic –
  • Require Huge Human, material and computer resources.
  • Assume that the user is unaware of either the context or has no intelligence
  • Almost impossible to attain perfect Human Like translation
  • Lexical, syntactical and semantic error
our experience
Our Experience
  • The Migrant workers in India pick up the local alien language in less than a month-
  • The Butler English
  • Learning by Experience and usage
what is a good translation
What is a Good Translation
  • If the user does not get irritated when reading the translated text
  • Intelligent Human Beings have more resistance to irritation
  • If we design a Machine Translation system that assumes Intelligent Users, then the resourses and time required would be significantly less
  • Intelligent users would be more tolerant to syntactical and semantic errors
good enough translation
Good Enough Translation
  • Lexical Errors are easy to handle since bilingual dictionaries can be built easily
  • Mike Shamos’s concept of Universal Dictionary and Disambiguation
  • Colocation Frequencies have been exploited by us in Automatic Summarization- Its manifestation is the Phrase dictionary
  • Add to this simple aligned corpora of human translated frequently used sentences
  • Mine the sentences for new phrases and mine the phrases for new words
  • Use the Wikipedia Approach to enhancing the learning
  • First prototype built by Hemant, Madhavi, Raj and me for Hindi
  • Later on extended to Kannada and tamil by Rashmi, Sravan, Sheik, Anand, Vivek and Vinodini
  • Now we have the ability to make EBMT good enough in 30 Days
universal dictionary
Universal Dictionary
  • Mike Shamos’s contribution to UDL
  • A collection of dictionaries in various languages.
  • Contains many European languages.
  • Given a word in English we can get the meaning of the word in various languages at one click
  • A total of five Indian languages were added to the Universal dictionary:
      • Kannada, Telugu, Tamil, Malayalam, Hindi
  • Microsoft Access Database.
e xample b ased m achine t ranslation a good enough translation
Example Based Machine Translation A good enough Translation
  • Requires:
    • A set of sentences in the source language and their corresponding translation in the target language.
    • A set of phrases in the source language and their corresponding Translation in the target language
    • A Bilingual Dictionary (English-Hindi)
  • It looks for the longest match to learn
  • The best part of the EBMT system we have is that it keeps on learning day by day
  • Right now available for -

English-Hindi-Telugu-Kannada-Tamil

problem statement
Problem statement

Aim:

  • To obtain a “good enough” translation

Constraints:

  • Limited Data
  • Limited Processing Time
slide14

Corpora

Languages

Corpora Table (Indian Languages)

g ebmt
G-EBMT
  • Similar examples are tokenized to show equivalence classes and stored as a generalized example. [Brown 99]
  • A database of sentence and phrase rules + bilingual dictionary

Format:

Source sentence rule  Target sentence rule

Source phrase rule  Target phrase rule

she brought a<noun>aval’u<noun> than’dal’u

Input sentence : she brought a dog

Output sentence : aval’u naayi tandal’u

slide16

Word Order Free – A feature of Indian languages

  • Root words take different forms according to its meaning in the sentence, i.e. the sequential order of the words in the sentence does not become important, unlike in English, for example ,
  • I am Going Home-
  • Home Going I am
  • May mean the same thing in Indian Languages
slide18

Krishna told Yudhisthira that Drona would finish his entire army if not checked.

The only way to check Drona was to make him lay down his arms.

slide19

That was possible only when Drona was told that his son Ashwathama had died.

Telling a lie is not a good practice even in wars. Krishna came up with an idea, Yudhisthira agreed reluctantly.

slide20

Bhima killed an elephant named‘ Ashwathama '.

Then he loudly announced for all to hear,

‘Ashwathama Hathah Kunjarah’

Ashwathama– an elephant, is killed

slide21

Lord Krishna blows his conch and makes the wordKunjarah (elephant) inaudible in the battlefield.

slide22

Drona turned to Yudhisthira and asked if that was true.

Yudhisthira said, “Ashwathama is killed - An Elephant ”

he added in a low voice

slide23

Dronacharya hears only the first two words

‘Ashwathama Hathah’, Presumed that his son ‘Ashwathama’ has been killed.

slide24

He gives up his weapons and sits in prayer.

Dhristadymna takes advantage of the opportunity and kills Dronacharya.

slide25

This story depicts Indian Languages likeSanskrit are Word Order Free Languages-

Hence good lexical Corpora would help in making nearly good enough Translation

generalization for indian languages phrase level vs sentence level
Generalization for Indian Languages-Phrase level vs. Sentence level
  • Should we generalize
    • at the phrase level?
    • at the sentence level?

Input English Sentence : they did not follow the rules

Phrase Level Generalization:

<pron>did not <pron>maadalilla

theydid not avarumaadalilla

follow the<noun> <noun>annu paalisi

follow therulesniyamagal’l’annu paalisi

Output sentence:avarumaadalillaniyamagal’l’annu paalisi(BLEU score : 0)

(Word order is very important)

Sentence Level-Generalization:

<pron>did not follow the<noun>  <pron> <noun> annu paalisalilla

Output sentence:avaru niyamagal’l’annu paalisalilla(BLEU score : 1)

motivation for linguistic rules
Motivation for Linguistic Rules

What happens if the input sentence doesn’t match with any of the rules ?

WHEN:Will surely happen since we can’t have an infinite set of examples…..

WHAT TO DO:The most obvious thing - go in for… Word-Word Translation as back-off

– which is not a good idea as we will need to rearrange the words

Can we add Linguistic rules?

What do we do if it does?

When will this happen?

managing idioms
Managing Idioms
  • Meanings of idioms cannot be inferred from the meanings of the words that make it up
  • Idioms are stored separately in a file in the following format

bite the hand that feeds 

un'd'a manege erad'u bage

stage1 tagger and stemmer 1
Stage1: Tagger and Stemmer (1)

Input sentence:

he is playing in my house

Output of the tagger*:

he_PPis_VBZplay_VBGin_INmy_PP$house_NN

PP- Personal pronoun

VBZ-verb, 3rd person singular present

VBG-verb, gerund or present participle

IN-Preposition or subordinating conjunction

PP$-Possessive pronoun

NN-Noun, singular or mass

*Helmut Schmid, “Probabilistic Part-of-Speech Tagging using Decision Trees”, Proceedings of International Conference on New Methods in Language Processing, September 1994.

stage1 retagging 2
Stage1: Retagging (2)

Why?

Auxiliary verbs are different from other verbs in Indian Languages

he_PPis_auxvplay_VBGin_INmy _PP$house_NN

stage2 splitting based on prepositions and conjunctions and reordering 1
Stage2: Splitting based on Prepositions and Conjunctions and Reordering (1)

Original sentence: with preposition

he is playingin my house

P1:in my house

P2:he is play

Reordering the preposition (postposition) :

P1:my house in

P2:he is play

Original sentence: with conjunction AND connecting two PP’s

She is freeto develop her ideasandto distribute it

E1: She is free

E2: develop her ideas to

E3: and

E3: distribute it to

slide34

Stage2: Splitting based on Prepositions and Conjunctions and Reordering (2)

Special case: two or more prepositions

I am going to schoolwith Mary

After splitting and Reordering:

P1: Mary with (Mary jote)

P2: school to (shaale ge)

P3: I am go (naanu hooguttidene)

Mary joteshaalege naanu hooguttidene

stage3 reorder interrogations and verbs
Stage3: Reorder Interrogations and Verbs

Place the verbs at the end:

P1:my house in

P2:he is play(play is placed at the end of P2)

If a verb and a particle are present together in any of the parts, place the verb along with the particle at the end of that part – Explained in Stage 5

stage 4 reorder auxiliary modal verbs
Stage 4: Reorder Auxiliary/Modal verbs

P1: my house in

P2: he play is(isis placed at the end of P2)

Special Case:

He is not playing in my house

E1: my house in

E2: he play is not

stage5 word word translation 1
Stage5: word-word translation(1)

Dictionary format:

english-word category kannada-word

Sample:

in IN alli

in RB o’lage

to IN ge (if “to” is followed by a noun)

to IN alu

put on V-P haakiko (where, RB- adverb, V-P-verb particle)

Join the parts (P1 and P2) obtained from Stage 4, and translate word to word.

my house in he play is

nanna mane alli avanu aad’u ide

Actual Kannada translation:

nanna mane alli avanu aad’ uttiddaane

stage5 word word translation 2
Stage5: word-word translation(2)

Special care needs to be taken with sentences containing particles:

verb and particle have to be translated together,

Hence an entry in the dictionary is made as,

put on V-P haakiko

Eg:

Put on your shoes.

If particles are not taken care of, your shoes on put

ninna chappali meile id’u

(meaning: put your shoes on top)

sandhi 1
Sandhi(1)
  • South Indian languages (Dravidian) are rich in Sandhis
  • the tense, gender and number present in a sentence inflect the verbs
sandhi 2
Sandhi(2)

word to word translation gave the following output:

nanna mane alli avanu aad’u ide

Actual Kannada translation:

nanna mane alli avanu aad’uttiddaane

To solve this problem, the following is stored in a file,

Sample:

he_is_verb ttiddaane

she_is_verb ttiddaal’e

he_is_adj yavanu

she_is_adj yaval’u

*_is_not_verb ttilla

they_is_verb ttidaare

he_will_verb ttaane

she_will_verb ttaal’e

they_will_verb ttaare

he_will_be_verb ttirutaane

*independent of gender

A part of the [auxiliary/modal]-[verb]-translation list

sandhi 21
Sandhi(2)

Input: he is playing in my house (a)

nanna maneyalli avanu aad’u ide(b)

The translation of any “auxiliary verb”, “modal verb” and “not” present in the sentence are removed from (b) to get,

nanna maneyalli avanu aad’u(c)

“he_is_playing” matches with the sequence, “he_is_verb” in the list

nanna maneyalli avanu aad’uttidaane

he_is_verb ttiddaane

she_is_verb ttiddaal’e

he_is_adj yavanu

she_is_adj yaval’u

*_is_not_verb ttilla

they_is_verb ttidaare

he_will_verb ttaane

she_will_verb ttaal’e

they_will_verb ttaare

he_will_be_verb ttirutaane

*independent of gender

evaluation
Evaluation

Input sentence:

he is going home for lunch

Output sentence from the system:

Ootakke avanu mane hooguttidaane (BLEU (N=3) = 0)

(Should have been:Ootakke avanu manegehooguttidaane)

Although the translation implies its meaning, the BLEU score returns a score of 0 (absence of trigrams (N=3)).

give a fair score
Give a fair score

the words with sandhis were split before (human),

reference sentence:Oota kke avanu mane ge hooguttidaane

The output from the system was split as,

candidate sentence:Oota kke avanu mane hooguttidaane

BLEU Score: 0.6237

slide44

Results

BLEU Score Results for Kannada

Where,

1  Performance of EBMT system without Rules

2  Performance of EBMT system with 582 Rules

slide45

BLEU Score Evaluation for Kannada Corpora

With Rules 500

Without Rules 500

Sentences Taken for Evaluation is 100

slide46

BLEU Score Evaluation for Kannada Corpora

With Rules 500

Without Rules 500

Sentences Taken for Evaluation is 100

effect of number of rules from g ebmt on accuracy
Effect of number of rules (from G-EBMT) on accuracy
  • substantial improvement in the average score when Module1 (G-EBMT) and Module2 (language specific-linguistic rules) are combined.
  • errors mainly due to ambiguities in the meaning of the words
how can we improve further
How can we improve further
  • Wikipedia approach
  • Human Evaluation rather than BLEU
  • Use linguistic expertise to generate more Linguistic rules
  • We are also using data mining rules to infer rules from the corpora
slide49

BLEU Score Evaluation for Tamil Corpora

With Rules 500

Without Rules 500

Sentences Taken for Evaluation is 100

slide50

BLEU Score Evaluation for Tamil Corpora

With Rules 500

Without Rules 500

Sentences Taken for Evaluation is 100

slide51

BLEU Score Evaluation for Kannada Corpora

With Rules 500

Without Rules 500

Sentences Taken for Evaluation is 100

slide52

BLEU Score Evaluation for Kannada Corpora

With Rules 500

Without Rules 500

Sentences Taken for Evaluation is 100

slide53

Results

BLEU Score results for Tamil

Where,

1  Performance of EBMT system without Rules

2  Performance of EBMT system with 700 rules

slide54

Human Evaluation of Machine Translation

  • Because of the problems in word ordering, BLEU gives minimum score. So evaluation is done by human beings in a test bed and the results are as follows:
  • A sample of this application can be found at the following URL:

http://ashwini.dli.ernet.in/humanmt/

slide55

Post Editing of Translated sentences

  • The EBMT system possesses all the necessary words to yield a perfect translation. But still it is getting only 80% accuracy.
  • What lags here is the choosing of correct word at the correct position. This can be done by editing the translated sentence. For this AJAX (Asynchronous JavaScript And XML) is used.
  • Since, whenever the user wants to change a word, if it was deployed conventionally, then for each word the whole page has to be refreshed each time results an annoying situation.
  • For this, AJAX has been deployed since it doesn't need any refresh.
  • By this effort the translation accuracy increased drastically,nearing human quality. A sample of this has been hosted at http://ashwini.dli.ernet.in/ebmtpe/
slide56

Wikipedia approach to EBMT

  • A community portal has been created and anyone who is willing to donate a word/phrase/sentence can do it by this approach. They can evaluate the EBMT translation quality also.
  • A moderator will take care of the added/corrected words, phrases and sentences. He/she will decide whether to omit the entry or add it to the database
  • The URLs of the community portal are as follows:-
  • http://ashwini.dli.ernet.in/community/wordfinder.html
  • http://ashwini.dli.ernet.in/community/phrasefinder.html
  • http://ashwini.dli.ernet.in/community/sentencefinder.html
  • http://ashwini.dli.ernet.in/mod.php
slide57

START

Word

Sentence

Main Interface

Phrase

Enter phrase

Enter word

Enter sentence

Word Availability

Phrase Availability

Sentence Availability

Wordfound

PhraseNot found

sentenceNot found

WordNot found

Phrasefound

Sentencefound

Add

sentence

Correct word

Add

phrase

Correct phrase

Add

word

Correct

sentence

Wikipedia approach to EBMTPublic Interface- Flow Chart

conclusion
Conclusion
  • Many problems in Indian languages are looking more and more intractable like the legendary Indian language OCR !
  • What is succesful in English and European languages need not be successful for Indian languages
  • A whole new way of thinking may be needed-
  • Overall UDL will turn out to be a very fertile ground for Research- many unsolved problems for which we do not even know the directions-
  • Good Enough Technologies are part of the new way of thinking
the websites to watch
The Websites to watch
  • http://www.new.dli.ernet.in/
  • http://www.dli.ernet.in/
  • http://dli.iiit.ac.in/
  • http://swati.dli.ernet.in/om
  • http://bharani.dli.ernet.in/ebmt/
  • http://revati.dli.ernet.in/SearchTamil.html
acknowledgements
Acknowledgements
  • Prof Raj Reddyfor his great vision and Guidance
  • Madhavi, Hemant, Eric, Krishna, Kiran, Srini, Sravan, Sheik, Mini, Rashmi, Pradeepa, Tina, Malar, Anand, Jiju, Ravi, Vamshi, Vivek, Vinodini and Kishore