Machine translation language divergence and lexical resources
This presentation is the property of its rightful owner.
Sponsored Links
1 / 52

Machine Translation, Language Divergence and Lexical Resources PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on
  • Presentation posted in: General

Machine Translation, Language Divergence and Lexical Resources. Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay. Acknowledgement. NLP-AI members, CSE Dept, IIT Bombay. What is MT. Conversion of source language text to target language text. Computer Program.

Download Presentation

Machine Translation, Language Divergence and Lexical Resources

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Machine translation language divergence and lexical resources

Machine Translation, Language Divergence and Lexical Resources

Pushpak Bhattacharyya

Computer Science and Engineering Department

IIT Bombay


Acknowledgement

Acknowledgement

  • NLP-AI members, CSE Dept, IIT Bombay.


What is mt

What is MT

Conversion of source language text to target language text

Computer Program

Document in L2

Document in L1


Kinds of mt systems how much of human participation

Kinds of MT Systems(How much of Human Participation)

  • Fully Automatic

  • Semi Automatic

    • Human Aided MT (HAMT)

      • Pre-editing

      • Post-editing

        example

    • Machine Aided HT (MAHT)

      • On-line Dictionaries

      • Terminology Data Banks

      • Translation Memories

        example


Kinds of mt systems domain coverage

Kinds of MT Systems(domain coverage)

  • General Purpose

    (SYSTRAN in Europe)

  • Domain Specific

    (Tom-Mateo in Canada;

    Translates weather reports between

    French and English)


Kinds of mt systems point of entry from source to the target text fwd

Kinds of MT Systems(point of entry from source to the target text)fwd


Why is mt difficult classical nlp problems

Why is MT difficult?Classical NLP problems

  • Ambiguity

    • Lexical

    • Structural

  • Ellipsis

  • Co-reference

    • Anaphora

    • Hypernymic

      examples


Why is mt difficult language divergence

Why is MT DifficultLanguage Divergence

  • Lexico-Semantic Divergence

  • Structural Divergence


Language divergence english hindi noun to adjective

Language Divergence(English Hindi: Noun to Adjective)

  • The demands on sportsmen today can lead to burnout at an early age.

    (noun – the state of being extremely tired or ill, either physically or mentally, because you have worked too hard)

  • खिलाड़यों से जो आज अपेक्षाएं हैं, वे उन्हें कम उम्र में ही अक्रियाशील कर सकती हैं।


Language divergence english hindi noun to verb

Language Divergence(English Hindi: Noun to Verb)

  • Every concert they gave us was a sell-out.

    (an event for which on the tickets have been sold)

  • उनके हर संगीत-कार्यक्रम के सभी टिकट बिक गए थे।


Language divergence english hindi adjective to adverb

Language Divergence(English Hindi: Adjective to Adverb)

  • The children watched in wide-eyed amazement.

    (with eyes fully open because of fear, great surprise, etc)

  • बच्चे आश्चर्य से आँखें फाड़े देख रहे थे।


Language divergence english hindi adjective to verb

Language Divergence(English Hindi: Adjective to Verb)

  • He was in a bad mood at breakfast and wasn't very communicative.

    (able and willing to talk and give information to other people)

  • नाश्ते के समय वह खराब मूड में था और ज्यादा बात-चीत नहीं कर रहा था।


Language divergence english hindi preposition to adverb

Language Divergence(English Hindi: Preposition to Adverb)

  • It gets cooler toward evening.

    (near a point in time)

  • शाम होते-होते ठंडक बढ़ जाती है।


Language divergence english hindi idiomatic usage

Language Divergence(English Hindi: idiomatic usage)

  • Given her interest in children, teaching seems the right job for her.

    (when you consider sth)

  • बच्चों के प्रति (में) उसकी दिलचस्पी देखते हुए, अध्यापन उसके लिए उचित लगता है।


Language divergence marathi hindi english case marking and postpositions transfer works

Language Divergence(Marathi-Hindi-English: case marking and postpositions transfer: works!)

  • प्रथम ताख्यात

  • वर्तमान(simple present)

    • तो जातो.

    • वह जाता है।

    • He goes.

  • स्थिरसत्य(universal truth)

    • पृथ्वी सूर्याभोवती फिरते.

    • पृथ्वी सूर्य के चारों ओर घूमती है।

    • The earth revolves round the sun.


Language divergence marathi hindi english case marking and postpositions works again

Language Divergence(Marathi-Hindi-English: case marking and postpositions: works again!)

  • ऐतिहासिक सत्य(historical truth)

    • कृष्ण अर्जुनास सांगतो...

    • कृष्ण अर्जुन से कहते हैं...

    • Krushna says to Arjuna…

  • अवतरण (quoting)

    • दामले म्हणतात, ...

    • दामले कहते हैं, ...

    • Damle says,...


Language divergence marathi hindi english case marking and postpositions does not work

Language Divergence(Marathi-Hindi-English: case marking and postpositions: does not work!)

  • संनिहित भूत (immediate past)

    • कधी आलास? हा येतो इतकाच !

    • कब आये? बस अभी आया ।

    • When did you come? Just now (I came).

  • निःसंशय भविष्य (certainty in future)

    • आता तो मार खातो खास !

    • अब वह मार खायगा ही !

    • He is in for a thrashing.

  • आश्वासन (assurance)

    • मी तुम्हाला उद्या भेटतो.

    • मैं आप से कल मिलता हूँ।

    • I will see you tomorrow.


Language divergence theory lexico semantic divergences

Language Divergence Theory: Lexico-Semantic Divergences

  • Conflational divergence

  • Structural divergence

  • Categorial divergence

  • Head swapping divergence

  • Lexical divergence


Language divergence theory syntactic divergences

Language Divergence Theory: Syntactic Divergences

  • Constituent Order divergence

  • Adjunction Divergence

  • Preposition-Stranding divergence

  • Null Subject Divergence

  • Pleonastic Divergence


Mt approaches

MT approaches

interlingua Based

Direct

Transfer Based

Vaquiouse Triangle


Interlingua methodology

Interlingua Methodology

Directly obtain the meaning of the source sentence.

Do target sentence generation from the meaning

representation.

John gave the book to Mary.

Meaning representation:

give-action:

agent: John

object: the book

receiver: Mary

ATLAS system in Fujitsu

precursor to

World wide project on UNL


Competing approaches

Competing approaches

Direct

Transfer based


Direct approach

Direct approach

  • Word replacements

    I like mangoes

    maOM AcCa laga Aama

    I like (root) mangoes

  • Morphology

    maOM AcCa lagata Aama

    I like mangoes

  • Syntactic re-arrangement

    maOM Aama AcCa lagata hO

    I mangoes like

  • Idiomatization

    mauJao Aama AcCa lagata hO

    I (dative) mangoes like


Transfer based

Transfer Based

Source sentence processed for parsing, chunking etc.

S

VP

NP

V

NP

I

like

mangoes


Transfer based1

Transfer Based

Transfer structures obtained for the target sentence.

S

VP

NP

NP

V

I

mangoes

like


Transfer based2

Transfer Based

Morphology and language specific modifications

S

VP

NP

NP

V

mauJao

AcCa lagataa hO

Aama


Relation between the transfer and the interlingua models

Interlingua

Relation Between the Transfer and the Interlingua Models

Source language

Parse tree

Target Language

Parse tree

Interpretation generation

transfer

Parsing generation

Target language

words

source language

words


State of affairs

State of Affairs

  • Systran reports 19 different language

    pairs.

  • Only 8 alright for intended use.

  • Even fewer are capable of quality written

    or spoken text translation.


Notable systems in india

Notable Systems in India

  • Anusaaraka (IITK and IIIT Hyderabad: information access: one of the earliest systems)

  • Angla-Hindi (IITK: Transfer Based)

  • Shakti and Shiva (IIIT Hyderabad: Use of simple modules to create complex and high level performance)

  • UNL Based system (IIT Bombay- part of the UN effort: emphasis on semantics)

  • Hindi-Tamil system (AU-KBC, Chennai: based on the approach at IIIT Hyderabad)


Semantics use of lexical resources

Semantics: use of Lexical Resources

  • WordNet

  • Word Sense Disambiguation


Wordnet

Wordnet

  • A lexical knowledgebase based on conceptual lookup

    • Organizing concepts in a semantic network.

  • Organize lexical information in terms of word meaning, rather than word form

    • Wordnet can also be used as a thesaurus.


Lexical matrix

Lexical Matrix


The structure of hindi wordnet

The Structure of Hindi Wordnet

  • 30,000 unique words

  • 13,000 synsets

  • Wordnet Relations

    1. Lexical Relations (between word forms)

    Synonymy

    Antonymy

    2. Semantic Relations (between word meanings)

    Hyponymy/Hypernymy

    Meronymy/Holonymy

    Entailment/Troponymy


A small part of hindi wordnet

A small part of Hindi Wordnet


Hindi wordnet apis

findtheinfo getindex

in_wn index_lookup read_synset

free_synset

free_index morphstr

Hindi Data

Hindi WordNet APIs


The hindi wsd system

The Hindi WSD System


Approach to wsd

Approach to WSD ….

Hindi Wordnet

Hindi Document

Intersection

Similarity

Context Bag Semantic Bag


Wsd algorithm

WSD Algorithm

  • For a polysemous word w needing diambiguation, a set of context

  • words in its surrounding window is collected. Let this collection be C, the context bag. The window is the current sentence and the preceding and the following sentences.

  • For each sense s of w, do the following

    Let B be the bag of words obtained from the

    • Synonyms in the synsets

    • Glosses of the synsets

    • Example Sentences of the synsets

    • Hypernyms (recursively upto the roots)

    • Glosses of Hypernyms

    • Example Sentences of Hypernyms


Wsd algorithm continued

WSD Algorithm (continued)

  • Hyponyms

  • Glosses of Hypernyms (recursively upto the leaves)

  • Example Sentences of Hyponyms

  • Meronyms (recursively upto the beginner synset)

  • Glosses of Meronyms

  • Example sentences of meronyms

  • Mesure the overlap between C and B using intersection similarity

  • Output that sense as the winner sense which has the maximum overlap simialrity value


  • Evaluation

    Evaluation

    • Only Nouns

    • Test corpora from CIIL, Mysore.

    • Corpus from 8 domains, each containing around 2000 words on an average.


    Result

    Result


    Conclusions knowledge based mt

    Conclusions(Knowledge Based MT)

    • Language Divergence is the bottleneck

    • Not only for languages from distant families (English-Japanese)

    • But also for siblings within a family (Hindi-Marathi)

    • Solution lies in creating and exploiting knowledge structures


    Conclusions statistical mt

    Conclusions(Statistical MT)

    • Complementary (not really competing) approach

    • Example: IBM approach to translation from/to English and other languages (French, Chinese, and currently Hindi)

    • Needs vast amount of text aligned corpora

    • Basic idea is to maximize P(T|S) over all target sentences T: needs language modeling(P(T)) and translation modeling(P(S|T))


    Pre editing

    Pre Editing

    The inspection team appointed by the United Nations visited Iraq early July, 2003.

    The <cnp> inspection team </cnp> {which was} appointed by the <org> United Nations </org> visited Iraq {in} early <date>July, 2003</date>.


    Post editing

    Post Editing

    • back (I want to eat well today)

    MMmaOM Aaja AcCa Kanaa caahta hUM

    mauJao Aaja AcCa Kanaa caaihe


    Terminology db and translation memory

    Terminology DB and Translation Memory

    • Special lexicon containing the domain terms and their translations

      • Nuclear Energy- AaNaivak }jaa-

    • Memories of previous translations

      • Apply fragments of previous translations to new translation situations

        Available

      • He bought a pen

      • ]snanao ek klama KrIda

      • All ministers have huge houses

      • saBaI pMtaoMko pasa bahut baDo Gar hOM

        New

      • He bought a huge house

      • ]snanao ek bahut baDa Gar KrIda


    Pitfall of translation memory

    Pitfall of Translation Memory

    • German:

      Ein messer ist im schrank; er miβt eletrizitat.

      • TM1: Ein messer ist im schrank ->

        A meter is in the cabinet.

      • TM2: er miβt eletrizitat.

        It measures electricity

  • New situation

    Ein messer ist im schrank; er ist sehr scharf.

    • A meter is in the cabinet; it is very sharp (?).

    • Messer in German: Meter/Knife in English.

      back


  • Ambiguity

    Ambiguity

    Chair


    Co reference resolution

    Co-reference Resolution

    • Pronoun

      • Sequence of commands to a robot:

        • place the wrench on the table.

        • Then paint it.

          • What does it refer to? (anaphora- back reference)

        • Learning of his intentions, Shivaji went to meet Afjal Khan, prepared with concealed weapons

          • Who does his refer to? (cataphora- forward ref)

    • Hypernymic

      • Children love to see lions? These animals, however, are getting extinct.


    Elipsis

    Elipsis

    Sequence of command to the Robot:

    Move the table to the corner.

    Also the chair.

    Second command needs completing by using the first part of the previous command.

    back


  • Login