1 / 21

Spoken Language Corpora for the Official African Languages of South Africa

Spoken Language Corpora for the Official African Languages of South Africa. Jens Allwood Göteborg University, Department of Linguistics Leif Grönqvist Växjö University, School of Mathematics and Systems Engineering Göteborg University, Department of Linguistics. Background.

kylar
Download Presentation

Spoken Language Corpora for the Official African Languages of South Africa

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spoken Language Corpora for the Official African Languages of South Africa Jens Allwood Göteborg University, Department of Linguistics Leif Grönqvist Växjö University, School of Mathematics and Systems Engineering Göteborg University, Department of Linguistics Allwood & Grönqvist

  2. Background • Corpus work in Gothenburg • A project cooperation with UNISA (University of South Africa) in Pretoria • Financed by SIDA and NRF • Travel money for Göteborg • Some more money for Pretoria covering practical corpus work Allwood & Grönqvist

  3. Why? • Creating support for survival of endangered languages • Linguistic corpora are very important resources for a language • Spoken language corpora • Unexplored • speech recognition/synthesis • language learning • standardization Allwood & Grönqvist

  4. African Languages: Ncedile, Mmemesi Linguistics (UNISA): Rusandré Hendrikse, Mvuyesi Linguistics (Göteborg): Jens, Leif Who? Allwood & Grönqvist

  5. Allwood & Grönqvist

  6. Dialogue among African languages is essential: African languages must use the instrument of translation to advance communication among all people, including the disabled. THE ASMARA DECLARATION – 2000 (UNESCO) • All African children have the inalienable right to attend school and learn in their mother tongues. All effort should be made to develop African languages at all levels of education. Allwood & Grönqvist

  7. Promoting research on African languages is vital for their development, while the advancement of African research and documentation will be best served by the use of African languages. THE ASMARA DECLARATION – 2000 (UNESCO), cont’d • The effective and rapid development of science and technology in Africa depends on the use of African languages and modern technology must be used for the development of African languages. Allwood & Grönqvist

  8. OBJECTIVES • To develop a platform of computer supported basic linguistic resources for the previously disadvantaged languages of SA • The resources will be in the form of • Archived audio-visual recordings of activity-based natural language use • Machine-readable transcriptions of recordings for corpus-driven searches • Morphologically tagged corpora for corpus-based searches • Other kinds of analysis – manual or automatic Allwood & Grönqvist

  9. Spoken language corpora for: • Xhosa • Zulu • Ndebele • Siswati • Southern Sotho • Tswana, Tsonga, Venda • Northern Sotho • (Pedi) • Afrikaans • English Allwood & Grönqvist

  10. PROJECT MANAGEMENT Allwood & Grönqvist

  11. PROJECT PHASES: 2002-2004 • Ongoing Audio-video recordings of activity-based spoken language use (min. 200hrs p/l). • Transcriptions (enriched with comment lines) of recordings in machine-readable text format. • Checking and editing of transcriptions. • Manual morphological tagging of corpora. • Automated tagging of corpora. • Research outputs. Allwood & Grönqvist

  12. The Asmara Declaration - Ncedile What’s the point of spoken language corpora? – Jens Overview of the project and it’s phases – Rusandré Workshop overview • The recording phase – Jens/Mmemesi • The transcription phase – Jens/Mvuyesi • The checking phase – Jens/Ncedile • The tagging phase – Leif/Rusandré • Research output - Jens Allwood & Grönqvist

  13. The workshops, etc • Seminars at UNISA, Pretoria • Rhodes University, Grahamstown • University of the Transkei, Umtata • Natal University, Durban • Other places Allwood & Grönqvist

  14. Contacts from the workshops • Durban • IsizuluProgramme, University of Durban: • NN Gumede, CT Gumede, NP Ndimande, NN Mathonsi • IsizuluProgramme University ofNatal • NS Turner, S Naidoo, CNT Ntshangase, MP Kufa, SE Ximba • Grahamstown • African Languages, Rhodes University • Bulelwa Nosilela, John Claughton, Ntosh Mazwi • ISEA, Rhodes University • Prof Laurence Wright, Ms Cossie Rasana • Vista, Port Edward: Prof BB Mkonto • SAUL, Fort Hare: Mr Zandisile Wilberforce • Dept. Sport, Arts & Culture, Grahamstown: Vaugham Japtha • Umtata • UNITRA, African Languages: RM Nakin, N Vapi Allwood & Grönqvist

  15. @ Recorded activity ID: V010501 @ Activity type: Informal conversation @ Recorded activity title: Getting to know each other @ Recorded activity date: 20020725 @ Recorder: Britta Zawada @ Participant: A = F2 (Lunga) @ Participant: B = F1 (Bukiwe) @ Transcriber: Mvuyisi Siwisa @ Transcription date: 20020805 @ Checker: Rusandre Hendrikse @ Checking date: 20020912 The transcription header @ Anonymised: No @ Activity Medium: face-to-face @ Activity duration: 00:44:30 @ Other time coding: Each section @ Tape: V0105 @ Section: Family affairs @ Section: Crime @ Section: Unemployment @ Section: Closing @ Comment: Medunsa open ended conversation between two adult speech therapy students Bukiwe and Lunga Allwood & Grönqvist

  16. Contrastive stress, pauses and lengthening $B: abanyeke bazihlalele nje:/abanyeABAZANGE bafune sikolo //uyayiqonda ke la meko yokungabikho mzali uqhubayo /uthi aba baza emva kwam bobabini ABAZANGE bafunde kuyaphi //kodwa ke //andigxeki nto kuba ke /ndibakhona ngethuba le ngxaki nobhuti ke [2 abeyinkxaso kakhulu ]2 $A: [2 ya /m: ewe ]2 hayi izinto zikuthixo azikho kuthi nam obu bushuman bam ndiseza kutshata ndiseza kutshata Allwood & Grönqvist

  17. Overlaps § Religion $B: uyakhonza kanene $A: ndiyakhonza owu ndiyamthand{a} [4 < uthixo > ndiyamthanda andisoze ndimlahle undibonisile ukuba mkhulu nantso ke into efunekayo qha ]4 kuphela $B: [4 nantso ke sisi // e: e: ]4 @ < name > Allwood & Grönqvist

  18. Comment Lines $A: kunetha imvula sinemithwalo engaka < yebhegi >< yho yho yho >nako sisa @ < loan English: bag > @ < gesture: hand wipes > $B: esingazi lo mntwana ngoba kaloku siza apha asazi mntu < wakwandungwana > ukuba wayengekho ngesasitheni na asazi mntu< > @ < name: clan name > @ < comment: A drops her book > Allwood & Grönqvist

  19. Current status • 20 hours of Xhosa recordings and transcriptions • A preliminary coding scheme for morphology • Ongoing work on recording, transcription and manual coding of morphology Allwood & Grönqvist

  20. Things to do • Make transcription standards with examples for each of the nine languages • Hand tag some transcriptions for morphology for training of an experimental tagger • A frequency dictionary and/or a thesaurus for Xhosa Allwood & Grönqvist

  21. Last slide • Summary • Long time plans Allwood & Grönqvist

More Related