1 / 13

The CareGiver corpus

Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H. van den Heuvel. The CareGiver corpus. Overview. Background of the ACORNS project A speech corpus Rationale Design A few details Public availability. Background of the ACORNS project.

carol
Download Presentation

The CareGiver corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H. van den Heuvel The CareGiver corpus

  2. Overview Background of the ACORNS project A speech corpus Rationale Design A few details Public availability

  3. Background of the ACORNS project Acquisition of COmmunication and RecogNition Skills FP6 FET Project 2006-2009 www.acorns-project.org Aim: to investigate language acquisition by young infants By simulating this learning process by designing and testing a computational model Focus on word discovery Improve ASR To that end, a speech corpus was created

  4. The ACORNS corpus - rationale ACORNS model takes part in a caregiver-learner interaction loop Corpus is required for testing various computational approaches for language learning Utterances in corpus ‘simulate’ the caregiver Corpus keeps the balance in complexity between Real-life recordings of caretaker utterances in real-life noisy child-caretaker interactions (CHILDES) Lab-fabricated speech-like stimuli (NEWPORT)

  5. ACORNS-corpus – design (1) Four languages (FIN, SWE, UK, NL) In total 10 speakers for FIN, UK, NL 4 speakers for SWE Speech from primary and secondary caregivers Speakers read aloud sentences Simple grammatical structure Limited number of keywords Two speaking styles Infant directed style (IDS)– adult directed style (ADS)

  6. Design (2) Utterances across languages are highly comparable with respect to utterance length, syntactic structure, choice of keywords Allows a cross-linguistic comparison of computational approaches of word discovery Keyword selection was inspired by information about communicative development inventories (CDI) E.g. the MacArthur Bates CDI http://www.sci.sdsu.edu/cdi/

  7. Examples of Y1-utterances (UK) Where is Miriam now ? Do you see the shoe ? Show me the book ! That is the bottle The telephone is here Look, Daddy Here is the diaper That is a telephone Show me a shoe

  8. Examples of Y2-utterances (UK) • I see a green turtle • Can you hear the red square and the airplane? • 50 keywords • Up to 4 keywords per sentence • Semantically free • But inconsistencies were avoided: • * Look at the big small car, * red green ball

  9. Number of utterances

  10. Format Each utterance is available as single wav file 44.1 kHz, mono … and is accompanied by an xml file, with Speaker information (gender) Speech style (IDS, ADS) Orthographic annotation (checked) Keyword (s) Duration And for FIN some more information about syntax (see paper) Total 12 GB L. ten Bosch2, G. Aimetti3, C. Koniaris4, K. Demuynck5, H. van den Heuvel2 L. ten Bosch2, G. Aimetti3, C. Koniaris4, K. Demuynck5, H. van den Heuvel2 L. ten Bosch2, G. Aimetti3, C. Koniaris4, K. Demuynck5, H. van den Heuvel2

  11. Research purposes • Simulation of word detection/word spotting • Acquisition of word-like units • Acquisition of (simple) syntax • Across morphologically + syntactically different European languages

  12. Public availability • Corpus made available via ELRA • Interested parties must contact ELRA

  13. Conclusion • Corpus available with cross-language compatible utterances • Speech based • IDS & ADS modes • Utterances have lexical and syntactic structure inspired by infant-directed speech • Primary & secondary caregivers • Ideal for testing models of language acquisition and word detection • Made available through ELRA • More information at www.acorns-project.org • Also software available – see website

More Related