The TBALL Project Data Collection: Making a Young Children's Speech Corpus

Project Overview: Technology Based Assessment of Language and Literacy The TBALL Project Data Collection: Making a Young Children's Speech Corpus Abe Kazemzadeh*, Hong You+, Markus Iseli+, Barbara Jones+, Xiaodong Cui+, Margaret Heritage+, Patti Price^, Elaine Anderson*, Shrikanth Narayanan*, and Abeer Alwan+ * University of Southern California, + University of California Los Angeles, and ^ PPRICE Speech and Language Techology Results and Observations Data Collection Motivation • Establish a corpus for studying child and non-native speech. • Build speech applications for under-represented populations. • Analyze pronunciation variation. • Test bed for our target child-computer interface. • Test hardware, animations, timing, vocabulary, etc. • Measure children's ability with respect to grade level and other factors Project Goals • Automation of literacy assessment measures using speech and language technology. • Development of standards and methods for reliable, objective assessment. • One-on-one interaction with child, which leaves teachers with more time for teaching. • Focused on fair assessment robust to dialect variation including nonnative speakers. • Support for teacher feedback and database records. Native Language Distribution of Recorded Subjects Transcriptions Wizard of Oz Interface • Enhanced ARPABET symbols to represent phenomena peculiar to non-native and children's speech: • Dental stops • Unaspirated voiceless stops • Negative VOT (prevoiced) stops • Lispy /s/ • Glottalized /t/ • Long frication of /f/ • Trill • Syllabic Consonants • With a convention to represent vowel space with respect to English vowels • Non-native vowels defined by the two nearest English vowels, with the highest vowel first (e.g., /iyih/). • 82% phone-level transcriber agreement. • Transcribers started with 100% overlap (each file transcribed twice), 25% after agreement was established. • Sentences are transcribed by word-level alignments, with phone-level detail if there was pronunciation variation. “techie” “tester” Language Background Effects • Difficulty associating words with pictures. • Sometimes reading sentences was performed better than individual words by children who could read in Spanish but not English. • Sounding out words with Spanish letter-to-sound rules. Corpus Statistics • 256 Children recorded. • ~ 30,000 utterances. • 40 hours of speech. • 13 GB of speech data sampled at 44.1 kHz. child Age/Grade Effects • Position in school year is important (children learn a lot between the beginning and end of the school year). • Younger children are more timid. • Less social and reading experience. • Less exposure to computers. Reading Tactics • Sounding out words generally helped children. • Mispronunciations when a subword portion is confused with another word (e.g., once, using). • Confusion with the different sounds an orthographic symbol may have (e.g., “now” pronounced as “no”). Our recording setup was similar to our target application: • Two visible operators: • The “techie” controls the presentation of stimuli and monitors recording quality. • The “tester” instructs and guides the child. • The target application would include just the child and computer. • 20 min. per child maximum. • A secure database stores child demographic info and speech recordings: • Age, grade, English level, • native language, language used at home, language used with friends, • parents' native language, parents' birthplaces. • Accommodations for children. • ~15 children recorded per day (~1.9 hours of recorded speech per day). • Stimuli: pictures, colors, alphabet, numbers, words, and sentences. Acknowledgements Higher Level Phenomena • Using “a, an, some” in picture naming. • Perhaps due to grammatical differences in English and Spanish. • Verb tense changes when reading sentences. • Formation of contractions from long forms (but not vice-versa). • Reanalysis of sentence after the child realizes a mistake he/she has made. Pronunciation Variation • Read speech is slower. • Long breaks in fricatives followed by stops (e.g., s-tart). • Long liquids, nasals, and fricatives. • Syllables spoken slowly (e.g., a-long). • Final consonants may be delayed or dropped (e.g., par-t or par- ). • Difficulty with “am” and “an” in isolation. • At times, children speak in an exaggerated voice. This project is supported in part by the NSF. In addition, this work would not be possible with out the hard work of transcribers Daylen Riggs and Nathan Go; the patience and bilingualism of Kimberly Reynolds and Blanca Martinez; the careful recordings of Erdem Unal, Vivek Rangarajan, Shiva Sundaram, Yirong Yang, Jinjin Ye, and Yijian Bai; and the planning of Larry Casey and Christy Boscardin.

The TBALL Project Data Collection: Making a Young Children's Speech Corpus

The TBALL Project Data Collection: Making a Young Children's Speech Corpus

Presentation Transcript

Training Objectives

BRONCHIOLITIS

corpus linguistics: a general introduction

The Development of Language

A toolkit for speech recognition research

CLASSROOM INSTRUCTION for CHILDREN WITH LANDAU-KLEFFNER SYNDROME

Corpus Linguistics: Introduction

Special Education Data Collection in 2010-11 School Year

Data and Data Collection

Data Demand & Use (DDU)

Occupational and Speech Therapy: Treating children with ASD

Why Inner Speech?

Special Education Data Collection

Using ACT with Children, Adolescents, and Parents

Emanuela Cresti LABLITA- University of Florence Lisbon, May 5 2003

Statistical Methods for Corpus Analysis

California “In The Know” Inpatient Data Collection, Reporting, and Validation

GOT DATA? Step-by-Step Guide to Making Data Work for You

Identifying Young Children with Language Impairments: Measurement Issues Mabel L. Rice

Data-Driven Decision Making

Objectives

START Orientation Young Children with Autism Spectrum Disorders

The TBALL Project Data Collection: Making a Young Children's Speech Corpus