Automatic Assessment of Spoken Modern Standard Arabic

Automatic Assessment ofSpoken Modern Standard Arabic NAACL Boulder, Colorado 5 June 2009 Pearson Knowledge Technologies Palo Alto, California Jian Cheng Jared Bernstein Ulrike Pado Masa Suzuki

Outline • Pearson Knowledge Technologies • How Versant tests operate 2. Versant Arabic Test (development) 3. Validation evidence 4. Predictive accuracy

Pearson Knowledge Tech. (PKT) (KAT + Ordinate) are now PKT KAT ≈ {LSA, Essay Scoring, Write-to-Learn, PTE, etc.} Ordinate ≈ {Versant, ORF for NCES, VersaReader, PTE, etc.) PKT is part of Pearson Pearson ≈ { FT, Economist, Penguin, Longman, PsychCorp, … etc} PearsonKT is in Boulder, Colorado and Palo Alto, California.

Test delivery Scoring system ENGLISH speech Database tests, prompts, responses ARABIC Delivery Interface Communication Network DUTCH report SPANISH California Anywhere

How Versant tests operate “The train’s been delayed by one hour ” Test Delivery Server Versant Database Scoring

Versant Arabic Test • DLI purpose • ~1000 students at DLI need predictive speaking tests • Requirements • Accurate test of Arabic listening & speaking • Convenient to use at DLI and worldwide (ILR is costly) • Suitable for repeated formative testing • High peak capacity for mass screening

Construct Comparison OPI Construct:Oral Proficiency as manifest in an Oral Proficiency Interview, is compatible with communicative competence as reflected in the functional level and/or complexity of content accurately produced. VersantConstruct: facility in spoken language–the ability to understand spoken language and speak appropriately in response at a conversational pace on everyday topics.

Versant Arabic Test Test Structure Part A: Reading Part B: Repeat -1 Part C: Short Answers Part D: Sentence Builds Part E: Repeat -2 Part F: Passage Retelling

20% 30% 30% 20% Fluency Sentence Mastery Vocabulary Pronunciation HumanScoring Read Repeat Sentence 1 SAQ Sent Build Repeat Sentence 2 Passage Versant Scoring

How Versants are developed (1) ScaleEstimates NativeJudges scale scores Criteria Internal Ordinate System Versant Scores NativeScribes transcripts Validation (Versant Arabic Test) External Recorded Items Item Text Arabic Natives ILR Scores Concurrent ILR Interviews Arabic Learners Native TestDevelopers Test Spec

kutubu al-waladi– the books of the boy kataba al-waladu – wrote the boysubj No disambiguating short vowels written Vowels carry phonetic information Vowels carry grammar information Arabic Challenges: Voweling

forvisitof us – for our visit Complicates lexicon lookup, frequency estimates… “Short” Arabic items are harder than English items with the same number of words Complex Morphology naa ziyaarat li

Development & Run-time Processes Compilation of expectation and runtime flow

Training data sources Prompt Voices and Training Samples

Reliability: Scores are consistent Validity: Native and non-native speakers should be clearly distinct MSA and dialect speakers should be distinct(since we’re testing MSA) Machine scores should predict human scores Validation Criteria

Reliability

Native ~ Non-Native Scores

Natives by Countries

Educated ~ Uneducated Speakers CumulativeDensity Arabic Overall Score

Machine – Human Comparison

How Versants Compare to OPIs ILR OPI Score (logits) N = 118 r = 0.87 Versant Arabic Overall Score

ILR OPI Score (logits) N = 37 r = 0.92 Versant Spanish Score Spanish & English: Versant ~ Human Spanish English N = 37r = 0.92 N = 151r = 0.86

Summary • Versant Arabic Test (VAT) is in operation • Based on a large and wide body of transcribed spoken material • VAT is available on demand • Returns consistent, accurate scores that reflect real-time skills with MSA • VAT can triage or screen for OPI tests

النهاية Thanks to Waheed Samy, Naima Bousofara Omar, Eli Andrews,Mohamed Al-Saffar, Nazir Kikhia, Rula Kikhia,and Linda Istanbullifor item development and data collection/transcription in Arabic,and to Andy Freeman for providing diacritic markings.

Automatic Assessment of Spoken Modern Standard Arabic