Ldmt muri data collection and linguistic annotations
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

LDMT MURI Data Collection and Linguistic Annotations PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on
  • Presentation posted in: General

LDMT MURI Data Collection and Linguistic Annotations. November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI. Purpose. Collect and build data Monolingual text Bilingual text Linguistic annotations to support work on machine translations for Kinyarwanda-English

Download Presentation

LDMT MURI Data Collection and Linguistic Annotations

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ldmt muri data collection and linguistic annotations

LDMT MURIData Collection and Linguistic Annotations

November 4, 2011

Jason Baldridge, UT Austin

Ulf Hermjakob, USC/ISI


Purpose

Purpose

Collect and build data

  • Monolingual text

  • Bilingual text

  • Linguistic annotations

    to support work on machine translations for

  • Kinyarwanda-English

  • Malagasy-English


Overview

Overview

  • Source, type and size of data

  • Language consultants

  • Kinyarwanda data

  • Malagasy data

  • Annotation

  • An idea

  • Accomplishments, challenges, future releases


Text sources

Text sources

  • Bible (highly multilingual parallel corpus)

  • Dictionaries, phrasebooks

  • Interview transcripts

  • Newspapers


Ldmt muri data collection and linguistic annotations

Kinyarwanda Data Resources

word

counts

ENG

treebank

ENG

text

word

align

KIN

text

KIN

treebank

ENGLISH

monolingual

(huge)

PTB (1m)

GWord (8b)

KGMC (5.8k)

KGMC (4.8k)

NONE

BILINGUAL

(16k)

Dict (9k)

Dict (8k)

Pbook (0.9k)

Pbook (0.7k)

KINYARWANDA

monolingual

(7m)

News (7m)

1.0 Release

2.0 Release


Ldmt muri data collection and linguistic annotations

Kinyarwanda Data Resources

word

counts

ENG

treebank

ENG

text

word

align

KIN

text

KIN

treebank

NOTE: no gold

morph-split text

ENGLISH

monolingual

(huge)

PTB (1m)

GWord (8b)

KGMC (270k)

KGMC (225k)

KGMC (3.8k)

KGMC (2.9k)

KGMC (5.8k)

KGMC (4.8k)

NONE

BILINGUAL

(285k)

Dict (9k)

Dict (8k)

Pbook (0.9k)

Pbook (0.7k)

BBC (0.3k)

BBC (0.3k)

BBC (0.3k)

BBC (0.3k)

IGT (0.1k)

IGT (0.1k)

IGT (0.06k)

IGT (0.06k)

KINYARWANDA

monolingual

(7m)

News (7m)

1.0 Release

2.0 Release


Ldmt muri data collection and linguistic annotations

Malagasy Data Resources

ENG

treebank

ENG

text

word

align

MLG

text

MLG

treebank

ENGLISH

monolingual

(huge)

PTB (1m)

Gword (8b)

Bible (730k)

Bible (725k)

BILINGUAL

(730k)

NONE

MALAGASY

monolingual

(zero)

none

1.0 Release

2.0 Release


Ldmt muri data collection and linguistic annotations

Malagasy Data Resources

ENG

treebank

ENG

text

word

align

MLG

text

MLG

treebank

NOTE: no gold

morph-split text

ENGLISH

monolingual

(huge)

PTB (1m)

Gword (8b)

Bible (730k)

Bible (725k)

BILINGUAL

(732k)

NONE

News (2.1k)

News (2.1k)

News (2.3k)

News (2.3k)

MALAGASY

monolingual

(zero)

none

1.0 Release

2.0 Release


Quality of original texts

Quality of Original Texts

  • Perfectly clean: English Bible

  • Reasonably edited: Newspapers (kin/mlg)

  • Uneven editing: Genocide protocols

    • Spelling errors

    • missing/sloppy punctuation

    • untranslated text (missing or still in source language)

      Kinyarwanda word ikaragiro (which means dairy)

      repeatedly translated as diary.

      “... over there, the houses that belong to the diary.”


Native speaker consultants

Native speaker consultants

  • UT reached out to speakers of both languages

  • Kinyarwanda

    • Several speakers near Austin

    • Most would like some payment

    • One has helped with translation and consultation

  • Malagasy speakers

    • Many speakers from around US and Canada

    • Most would like some payment

    • Two have helped with translations


Native speaker consultants1

Native speaker consultants

  • At this point, UT does need to have access to paid informants.

    • Need texts from other genres translated

    • Need to ask questions about meanings of some sentences for linguistic analysis

  • The CMU-Rwanda initiative may provide us with a further avenue for obtaining consultants for Kinyarwanda.

    • Also a potential source of data


Overview1

Overview

  • Source, type and size of data

  • Language consultants

  • Kinyarwanda data

  • Malagasy data

  • Annotation

  • An idea

  • Accomplishments, challenges, future releases


Kgmc transcripts

KGMC Transcripts

  • Collaboration between Kigali Genocide Memorial Center and the Human Rights Documentation Initiative at UT Austin Library

    • http://www.lib.utexas.edu/hrdi/

    • http://www.kigalimemorialcentre.org

  • Transcriptions of survivor testimonies filmed for the Genocide Archive Rwanda

    http://www.genocidearchiverwanda.org.rw/index.php/Welcome_to_Genocide_Archive_Rwanda

    http://www.genocidearchiverwanda.org.rw/index.php/Kmc00005-sub2-eng-glifos


Kgmc data

KGMC Data

  • 48 translated transcripts

    • all translated into English

    • 33 into French

  • 41 untranslated transcripts (only Kinyarwanda)


Kgmc data1

KGMC Data

  • Original format: Microsoft Word, in tables


Kgmc data normalization

KGMC Data normalization

  • Converted to XML using a semi-automatic process

  • Each language represented side-by-side

  • Script to process the MS Word format

    • Iteratively modified based on output and error detection

    • Needed to handle missing data and misalignments between time spans across translations

  • Final manual verification and correction of each file.


Example xml

Example XML


Overview2

Overview

  • Source, type and size of data

  • Language consultants

  • Kinyarwanda data

  • Malagasy data

  • Annotation

  • An idea

  • Accomplishments, challenges, future releases


Malagasy bible

Malagasy Bible

  • Online version of 1865 Malagasy Bible

    • http://www.madapourchrist.org/

  • Preparation:

    • Convert HTML to text

    • Align with the NET Bible (New English Translation) using verses

    • Currently have 686 chapters aligned

  • Obvious problem: 150 year-old Malagasy text


Malagasy dictionary

Malagasy Dictionary

  • Online dictionary of Malagasy

    • http://malagasyworld.org

  • 63k words

    • English definitions for 8000 words

    • French definitions for 10,000 words

  • Includes parts-of-speech, mostly coarse-grained (noun, verb, adjective, etc.)


Malagasy dictionary1

Malagasy Dictionary

  • Scraped and processed to produce clean XML


Malagasy texts

Malagasy texts

  • Texts from six webpages

    • 3 from Lakroa: http://www.lakroa.mg/

    • 3 from Lagazette: http://www.lagazette-dgi.com/

  • Translated by native speakers to English to create small parallel corpus for initial analysis and annotation.


Overview3

Overview

  • Source, type and size of data

  • Language consultants

  • Kinyarwanda data

  • Malagasy data

  • Annotation

  • An idea

  • Accomplishments, challenges, future releases


Morphological analysis

Morphological analysis

  • UT Austin obtained and adapted XFST analyzer created by Dalrymple, Liakata and Mackie 2006.

  • Applied it to the Malagasy website texts from Lakroa and Lagazette, hand-selecting the correct analysis for each word.

  • These need to be integrated with the standard tokenization and data organization.

  • Kinyarwanda morph analyzer in development.


Syntactic annotations

Syntactic annotations

  • Did initial pilot annotations with example sentences from the linguistics literature.

  • Annotated KGMC (kin) and Lagazette and Lakroa (mlg) texts with phrase structures.

    • Used a fairly standard set of labels and structures

    • Trees created for both the source language sentences and their English translations


Example kgmc tree

Example KGMC tree


Example kgmc tree1

Example KGMC tree


Syntactic annotations1

Syntactic annotations

  • Phrase structures were created before standardizing the tokenization; had to be grafted back onto correct tokens.

  • Current trees are still pilot annotations! Need to do many things, including:

    • reconsider the choice of node labels

    • add head markers (enable easy conversion to dependency analyses)

    • review and incorporate feedback from others

    • graft some existing trees to standard tokenization


Overview4

Overview

  • Source, type and size of data

  • Language consultants

  • Kinyarwanda data

  • Malagasy data

  • Annotation

  • An idea: data-driven dictionary development

  • Accomplishments, challenges, future releases


Data driven dictionary development

Data-driven Dictionary Development

Current dictionary size is moderate

6,632 entries with 3,890 distinct Kin. words/phrases

many relatively common words not covered

Idea: increase dictionary size using translators

based on data analysis of monolingual corpora

using NLP techniques to leverage process

Goals

Additional bitext for direct use in MT training

Improved resource for morphological analyzers


Data driven dict dev example

Data-driven Dict. Dev. (Example)

Monolingual Kinyarwanda corpus contains

ikinini (43 occ.), ibinini (96 occ.); not in dictionary

Automatically predict lexical form(s), POS

ikinini (noun, plural: ibinini)

Elicit English translation: pill, tablet

providing examples from corpus in context

Generate dictionary entry as well as MT bitext

ikinini=pill, ikinini=tablet, ibinini=pills, ibinini=tablets


Overview5

Overview

  • Source, type and size of data

  • Language consultants

  • Kinyarwanda data

  • Malagasy data

  • Annotation

  • An idea: data-driven dictionary development

  • Accomplishments, challenges, future releases


Accomplishments

Accomplishments

  • Released monolingual, bilingual, and tree-banked data for Kinyarwanda and Malagasy

    • Data release v1.0 in February 2011

    • Data release v2.0 in October 2011

  • Tools that can be shared

    • Tokenizer for Kinyarwanda and Malagasy

    • Diagnostic tools to check encoding, character sets, tokenization, tree well-formedness etc.


Challenges

Challenges

  • Need for more and better annotation tools to annotate faster and assure consistency

    • sentence segmentation, treebanking, ...

  • Need guidelines, workflow for data acquisition and annotation process

  • Need reliable language experts for Kinyarwanda and Malagasy

  • Need more data Wikipedia, LDS, mlg/fre


Data release v2 1 target dec 2011

Data release v2.1 (target: Dec. 2011)

  • Full sentence-level segmentation on Kinyarwanda-English text

  • Release tokenizers, morph analyzers, diagnostic tools


Data release v3 0 target may 2012

Data release v3.0 (target: May 2012)

Highest priority

In-domain bilingual test sets

500 sentences (300 newswire, 200 conversation)

Naturally occurring, source texts on both sides

Multiple translation if possible

Large, modern Malagasy monolingual corpora

Head markings (syntactic)

Word alignment


Data release v3 0 target may 20121

Data release v3.0 (target: May 2012)

Next priority

Increase size of Kinyarwanda-English dictionary

More Malagasy-English news bitext

Typo correction

Bible in Kinyarwanda (?)

Malagasy-English dictionary

Morphological gold standard


  • Login