Spoken language corpus project
Download
1 / 41

SPOKEN LANGUAGE CORPUS PROJECT - PowerPoint PPT Presentation


  • 140 Views
  • Uploaded on

SPOKEN LANGUAGE CORPUS PROJECT. SPOKEN CORPORA FOR THE 9 OFFICIAL SOUTH AFRICAN AFRICAN LANGUAGES. The Asmara Declaration – Rusandre What’s the point of spoken language corpora? – Jens Overview of the project and it’s phases – Rusandre. The recording phase – Jens/Mmem

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' SPOKEN LANGUAGE CORPUS PROJECT' - nura


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Spoken language corpus project

SPOKEN LANGUAGE CORPUS PROJECT

SPOKEN CORPORA FOR THE 9 OFFICIAL SOUTH AFRICAN AFRICAN LANGUAGES


Workshop overview

The Asmara Declaration – Rusandre

What’s the point of spoken language corpora? – Jens

Overview of the project and it’s phases – Rusandre

The recording phase – Jens/Mmem

The transcription phase – Jens

The checking phase – Jens

The tagging phase – Leif/Rusandre

Research output - Jens

Workshop Overview


The asmara declaration 2000

Dialogue among African languages is essential: African languages must use the instrument of translation to advance communication among all people, including the disabled.

All African children have the inalienable right to attend school and learn in their mother tongues. All effort should be made to develop African languages at all levels of education.

THE ASMARA DECLARATION - 2000


Asmara declaration cntd

Promoting research on African languages is vital for their development, while the advancement of African research and documentation will be best served by the use of African languages.

The effective and rapid development of science and technology in Africa depends on the use of African languages and modern technology must be used for the development of African languages.

ASMARA DECLARATION -CNTD


What s the point of spoken language corpora
What’s the point of spoken language corpora? development, while the advancement of African research and documentation will be best served by the use of African languages.

Jens Allwood

  • Corpus linguistics / Armchair linguistics


Project management
PROJECT MANAGEMENT development, while the advancement of African research and documentation will be best served by the use of African languages.


Objectives
OBJECTIVES development, while the advancement of African research and documentation will be best served by the use of African languages.

  • To develop a platform of computer supported basic linguistic resources for the previously disadvantaged languages of SA

  • The resources will be in the form of

    • archived audio-visual recordings of activity-based natural language use;

    • machine-readable transcriptions of recordings for corpus-driven searches;

    • morphologically tagged corpora for corpus-based searches.


Project phases 2002 2004
PROJECT PHASES development, while the advancement of African research and documentation will be best served by the use of African languages.2002 - 2004

  • Ongoing Audio-video recordings of activity-based spoken language use (min. 200hrs p/l).

  • Transcriptions (enriched with comment lines) of recordings in machine-readable text format.

  • Checking and editing of transcriptions.

  • Manual morphological tagging of corpora.

  • Automated tagging of corpora.

  • Research outputs.


The recording phase
The recording phase development, while the advancement of African research and documentation will be best served by the use of African languages.

  • What to record

    • Activity types

    • What to think about when recording natural language dialogues

    • Keep it natural

  • The video camera, microphone, etc

    • Keep the camera fixed!


Recording and transcription
Recording and transcription development, while the advancement of African research and documentation will be best served by the use of African languages.

Practical exercise!

  • A short recording

  • Transcribe together


Transcription structure
Transcription Structure development, while the advancement of African research and documentation will be best served by the use of African languages.

  • Header (background information about transcription and recorded activity)

  • Body (the actual transcription consisting of two kinds of elements)

    • Contributions (transcribed utterances of participants in the recorded activity)

    • Information lines - marks various peculiar aspects in the contributions and recorded activity


Example of a header

@ Recorded activity ID: V010501 development, while the advancement of African research and documentation will be best served by the use of African languages.

@ Activity type: Informal conversation

@ Recorded activity title: Getting to know each other

@ Recorded activity date: 20020725

@ Recorder: Britta Zawada

@ Participant: A = F2 (Lunga)

@ Participant: B = F1 (Bukiwe)

@ Transcriber: Mvuyisi Siwisa

@ Transcription date: 20020805

@ Checker: Rusandre Hendrikse

@ Checking date: 20020912

@ Anonymised: No

@ Activity Medium: face-to-face

@ Activity duration: 00:44:30

@ Other time coding: Each section

@ Tape: V0105

@ Section: Family affairs

@ Section: Crime

@ Section: Unemployment

@ Section: Closing

@ Comment: Medunsa open ended conversation between two adult speech therapy students Bukiwe and Lunga

Example of a header


Transcription header
Transcription header development, while the advancement of African research and documentation will be best served by the use of African languages.

@ Recorded activity ID: V010501

  • V = Video, 01 = project number

  • 05 = Tape number within this project

  • 01 = Recording number

    @ Activity type: Informal conversation

    @ Recorded activity title: Getting to know each other

    @ Recorded activity date: 20020725

    @ Recorder: Britta Zawada


Transcription header cont
Transcription header, cont development, while the advancement of African research and documentation will be best served by the use of African languages.

@ Participant: A = F2 (Lunga)

@ Participant: B = F1 (Bukiwe)

  • F stands for female

  • F1 is unique for Bukiwe in the entire corpus

  • A and B are ID:s for the participants


Transcription header cont1
Transcription header, cont development, while the advancement of African research and documentation will be best served by the use of African languages.

@ Transcriber: Mvuyisi Siwisa

@ Transcription date: 20020805

@ Checker: Rusandre Hendrikse

@ Checking date: 20020912


Transcription header cont2
Transcription header, cont development, while the advancement of African research and documentation will be best served by the use of African languages.

@ Anonymised: No

  • Indicates whether personal names, etc have been changed to pseudonyms (Yes) or not (No) – both in the header and in the conversation

    @ Activity Medium: face-to-face

  • Normally spoken, face to face, but could also have other values, like telephone conversations.


Transcription header cont3
Transcription header, cont development, while the advancement of African research and documentation will be best served by the use of African languages.

@ Activity duration: 00:44:30

  • Duration in hours, minutes and seconds

    @ Other time coding: Each section

  • There is a time line for each section

    @ Tape: V0105

  • This is a part of the recorded activity ID


Transcription header cont4
Transcription header, cont development, while the advancement of African research and documentation will be best served by the use of African languages.

@ Section: Family affairs

@ Section: Crime

@ Section: Unemployment

@ Section: Closing

@ Comment: Medunsa open ended conversation between two adult speech therapy students Bukiwe and Lunga

  • Any relevant information that is not covered by any of the required headings


The body
The body development, while the advancement of African research and documentation will be best served by the use of African languages.

  • This is the actual transcription - the background information is in the header

  • Four kinds of lines:

    • $A: uyakhonza kanene Contribution

    • @ < nod > Information line

    • § At office Section line

    • # 00:10:00 Time line


Sections
Sections development, while the advancement of African research and documentation will be best served by the use of African languages.

§ Family affairs

$B: sibabini kuphela esibabalwe sada safunda ke noko sakwazi ukuphangela sikwazi ke noko kuba ndinobhuti wam osebenzayo

...

§ Religion

$B: uyakhonza kanene

$A: ndiyakhonza owu ndiyamthand{a} [4 uthixo ndiyamthanda andisoze ndimlahle undibonisile ukuba mkhulu nantso ke into efunekayo qha ]4 kuphela

$B: [4 nantso ke sisi e: e: ]4

$B: nantso ke into efunekayo uthixo ulithemba lethu [5 uthixo ulithemba lethu ulixhadi lethu ]5 uligwiba

$A: [5 ulixhadi lethu ulixhadi lethu]5

$B: [6 uligwiba andazi ukuba ndingangendithini ngendiphi na xa uthixo heyi ]6

§ Situation on their arrival at Medunsa

$A: [6 ucinga ukuba ngesiphi na ngesisemedunsa ]6

$B: uye wasithatha khona waza kusibeka kule ndawo

...


Contributions
Contributions development, while the advancement of African research and documentation will be best served by the use of African languages.

§ Religion

$B: uyakhonza kanene

$A: ndiyakhonza owu ndiyamthand{a} [4 < uthixo > ndiyamthanda andisoze ndimlahle undibonisile ukuba mkhulu nantso ke into efunekayo qha ]4 kuphela

@ < name: Gods name >


Overlaps
Overlaps development, while the advancement of African research and documentation will be best served by the use of African languages.

§ Religion

$B: uyakhonza kanene

$A: ndiyakhonza owu ndiyamthand{a} [4 < uthixo > ndiyamthanda andisoze ndimlahle undibonisile ukuba mkhulu nantso ke into efunekayo qha ]4 kuphela

$B: [4 nantso ke sisi // e: e: ]4

@ < name >


Contrastive stress pauses and lengthening
Contrastive stress, pauses and lengthening development, while the advancement of African research and documentation will be best served by the use of African languages.

$B: abanyeke bazihlalele nje:/abanyeABAZANGE bafune sikolo //uyayiqonda ke la meko yokungabikho mzali uqhubayo /uthi aba baza emva kwam bobabini ABAZANGE bafunde kuyaphi //kodwa ke //andigxeki nto kuba ke /ndibakhona ngethuba le ngxaki nobhuti ke [2 abeyinkxaso kakhulu ]2

$A: [2 ya /m: ewe ]2 hayi izinto zikuthixo azikho kuthi nam obu bushuman bam ndiseza kutshata ndiseza kutshata


Unclear speech and glottal stop
Unclear speech and glottal stop development, while the advancement of African research and documentation will be best served by the use of African languages.

$M: loo nto ke njengo{ku}ba sekunyanzeleke ukuba ndiye phaya nje (...) ndikwazi ukuncedisa phaya ndiyiphushile ukwenzela ukuba ndibe neclaim endizakuba nayo that is why ndithole because ndiyaclaimer so that at least uba ndiclayimile ndikwazi ukuhamba

$T: ke ngoku ke yenye yezinto endifuna ukuyoyenza

$M: ngolwesithathu (what she said to me ngoku bendiphaya ngecawe) besingcwaba umfazi kasicaka jama

$T: e’e andekufuni ukutya


Comment lines
Comment Lines development, while the advancement of African research and documentation will be best served by the use of African languages.

$A: kunetha imvula sinemithwalo engaka < yebhegi >< yho yho yho >nako sisa

@ < loan English: bag >

@ < gesture: hand wipes >

$B: esingazi lo mntwana ngoba kaloku siza apha asazi mntu < wakwandungwana > ukuba wayengekho ngesasitheni na asazi mntu< >

@ < name: clan name >

@ < comment: A drops her book >


Research output
Research output development, while the advancement of African research and documentation will be best served by the use of African languages.

Jens Allwood

  • A distributed database (corpus)

  • Networks (homepages)

  • Spoken language corpus activities (seminars, workshops)


Tagging spoken language samples

TAGGING SPOKEN LANGUAGE SAMPLES development, while the advancement of African research and documentation will be best served by the use of African languages.

PROBLEMATIC ISSUES CONVENTIONS & STANDARDS

A P Hendrikse – 16/03/04


Problematic issues
PROBLEMATIC ISSUES development, while the advancement of African research and documentation will be best served by the use of African languages.

  • Loans and codeswitching

  • Fixed expressions

  • Spoken language reductions

  • Morphophonological issues

  • Designing a tag set

  • Manual tagging

  • A drag-and-drop tagger

  • Automated tagging


Loans and codeswitching
Loans and Codeswitching development, while the advancement of African research and documentation will be best served by the use of African languages.

  • Non-indigenised codeswitching

    ndifuna <fish and chips>

  • Indigenised but non-standardised codeswitching – loans

    <ndiyaclaimisha> >ndiyakleyimisha?

    ndiyaklayimisha?

    <Ndiyaphonisha> ndiyafonisha?

    ndiyafowunisha?


Fixed expressions
Fixed Expressions development, while the advancement of African research and documentation will be best served by the use of African languages.

  • A continuum:

    Idioms/proverbs – prefabricated expressions – collocations

  • How fixed is fixed?

    Into yokuba (*izinto zokuba)

    Nantso ke (*nantsi ke?)

    (Ke) kaloku (ke)

    Bafondini/mfondini

    Undincedile

    Ungadinwa nangomso


Fixed expressions cntd
Fixed Expressions cntd development, while the advancement of African research and documentation will be best served by the use of African languages.

  • Flagging fixed phrases

    Into_yokuba

    Ke_kaloku_ke

  • Morphosyntactic tagging or not?

    Ke<<adv>>_kaloku<<adv>>_ke<<adv>><<adv>>

    Or

    Ke_kaloku_ke<<adv>>


Spoken language reductions
Spoken language reductions development, while the advancement of African research and documentation will be best served by the use of African languages.

  • Standardised reductions

    Ngokuba > ngoba

    Written standard reduction: reconstruction convention {} not used, i.e. *ngo{ku}ba

  • Non-standardised reductions

    Musa ukuhamba > sukuhamba (wsr) >

    Suhamba (non-standardised)


Spoken reductions cntd
Spoken Reductions cntd development, while the advancement of African research and documentation will be best served by the use of African languages.

  • Reconstruction convention

    S{uku}hamba

  • Tagged

    S<<aux>>{uku<<inf>>}hamb<<vstem>>a<<basicv>><<v>>


Morphophonological issues
Morphophonological Issues development, while the advancement of African research and documentation will be best served by the use of African languages.

  • Coalescence

    Nenkomo > ne<<ass>>n<<n9>>komo<<n>>

    Neenkomo > ne<<ass>>en<<n9>>komo<<n>>

  • Syllabification

    Ngasendl{w}ini > nga<<inscon>>se<<locgen>>n<<n9>>dl{w}<<nstem>>ini<<locsuf>>

    Ayikafiki > ayi<<nind9>>ka<<excl>>fik<<vstem>>i<<negv>>


Morphophonological cntd
Morphophonological cntd development, while the advancement of African research and documentation will be best served by the use of African languages.

  • Elision

    Andinamoto > andi<<nindI>>na<<posscop>><<n9>>moto<<nstem>><<cop>>

  • Stem modifications

    Emlanjeni > e<<locgen>>m<<n3>>lanj<<nstem>>eni<<locsuf>><<adv>>


Designing a tag set
Designing a tag set development, while the advancement of African research and documentation will be best served by the use of African languages.

  • Granularity

  • Lexical categories

    N, V (Tagging lexical categories is problematic in an agglutinating language)

  • Syntagmatic morphological slots

    amadodana > a<<pp>>ma<<gnp>>dod<<n>>ana<<suf>>


Designing cntd
Designing cntd development, while the advancement of African research and documentation will be best served by the use of African languages.

  • Paradigmatic instantiations within a syntagmatic slot

    gnp = <<n1>>---<<n15>>

  • Word categories

  • nje (wenjenje) <<adv>>

    nje<<demvI>>; njalo<<demvII>>; njeya<<demvIII>>

  • ke <<adv>>

  • ke<<adv>> kaloku<<adv>> ke<<adv>>

  • ke<<res>> kaloku<<adv>> ke<<res>>

  • ke_kaloku_ke<<adv>>

  • e<<locgen>>m<<n3>>lanj<<nstem>>eni<locsuf>>??


Designing cntd1
Designing cntd development, while the advancement of African research and documentation will be best served by the use of African languages.

  • Spoken language expressions

  • Non-word like expressions – 2 problems

  • Standardising orthographic representation

  • Tags

    e:<<feedb>> mh:<<feedb>>

    uh_uh_uh<<ocm>>


Designing cntd2
Designing cntd development, while the advancement of African research and documentation will be best served by the use of African languages.

  • Word-like expressions

    <<n1a>>thixo<<n>>

    Thixo<<ocm>>

    Thixo<<feedb>>

    Heyi_wethu

    Nantso_ke

    Suka_(wena)


Manual tagging
Manual tagging development, while the advancement of African research and documentation will be best served by the use of African languages.

  • Manual tagging necessary for 3 reasons

  • Identifying tagging problems and problematic phenomena and revising the tag set

  • Developing a training corpus

  • Correcting automated tagging errors

  • Manual (typing) tagging not ideal

  • Tedious

  • Error-prone

  • Solution: Drag-and-drop tagger


Drag and drop tagger
Drag-and-drop tagger development, while the advancement of African research and documentation will be best served by the use of African languages.

  • Demonstration of drag-and-drop tagger


ad