Module u1: Speech in the Interface 4: User-centered Design and Evaluation

Module u1: Speech in the Interface 4: User-centered Design and Evaluation Jacques Terken SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Contents • Methodological issues: design • Evaluation methodology SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

The design process • Requirements • Specifications of prototype • Evaluation 1: Wizard-of-Oz experiments “bionic wizards” • Redesign and implementation: V1 • Evaluation 2: Objective and subjective measurements (laboratory tests) • Redesign and implementation: V2 • Evaluation 3: Lab tests, field tests SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Requirements • Source of requirements: • you yourself • potential end users • customer • manufacturer • Checklist • consistency • feasibility (w. r. to performance and price) SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Interface design • success of design depends on consideration of • task demands  • knowledge, needs and expectations of user population  • capabilities of technology  SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Task demands • exploit structure in task to make interaction more transparent • E.g. form-filling metaphor SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

User expectations • Users may bring advance knowledge of domain • Users may bring too high expectations of communicative capabilities of system, especially if quality of output speech is high; this will lead to user utterances that the system can’t handle • Instruction of limited value • Interactive tutorial more useful (kamm et al., icslp98) • Can also include training on how to speak to the system • Edutainment approach (weevers, 2004) SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Capabilities of technology • Awareness of ASR and NLP limitations • Necessary modelling of domain knowledge through ontology • Understanding of needs w.r. to cooperative communication: rationality; inferencing • Understanding of needs w.r. to conversational dynamics, including mechanisms for graceful recovery from errors SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Specifications: check ui design principles Shneiderman (1986) • continuous representation of objects and actions of interest (transparency) • rapid, incremental, reversible operations with immediately visible impact • physical actions or labelled button presses, not complex syntax~nl SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Application to speech interfaces Kamm & Walker (1997) • continuous representation: • may be impossible or undesirable as such in speech interfaces • open question - pause – options (zooming) • subset of vocabulary with consistent meaning throughout (“help me out”, “cancel”) • immediate impact agent: anny here, what can i do for you user: call lyn walker agent: calling lyn walker SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

incrementality user: i want to go from boston to san francisco agent: san francisco has two airports: ….. • reversibility • “cancel” • NB Discussion topic • Schneiderman heuristic 7: Locus of control vs mixed control dialogue SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Contents • Methodological issues: design • evaluation methodology SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Aim of evaluation • diagnostic test/formative evaluation: • To inform the design team • Ensure that the system meets the expectations and requirements of end users • To improve the design where possible • Benchmarking/summative evaluation: • To inform the manufacturer about quality of system relative to those of competitors or previous releases  SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Benchmarking • Requires accepted, standardised test • No accepted solution for benchmarking of complete spoken dialogue systems • Stand-alone tests of separate components both for diagnostic and benchmarking purposes (glass box approach ) SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Glass box / black box • Black box: system evaluation (e.g. “how will it perform in an application”) • Glass box: performance of individual modules (both for benchmarking and diagnostic purposes) • with perfect input from previous modules • or with real input (always imperfect!) • evaluation methods: statistical, performance-based (objective/subjective) SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

problem of componentiality: • relation between performance of individual components and performance of whole system SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Anchoring: choosing the right contrast condition • In the absence of validated standards: Need for reference condition to evaluate performance of test system(s) • speech output: often natural speech used as reference • will lead to compression effects for experimental systems when evaluation is conducted by means of rating scales • anchoring preferably in context of objective evaluation and with preference judgements SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Evaluation tools/frameworks • Hone and Graham: Sassi questionnaire tuned towards evaluation of speech interfaces • Walker et al: Paradise Establishing connections between objective and subjective measures • Extension of Paradise to multimodal interfaces: Promise SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Sassi • Subjective Assessment of Speech System Interfaces • http://people.brunel.ac.uk/~csstksh/sassi.html and pdf • Likert type questions • Factors: • Response accuracy • Likeability • Cognitive demand • Annoyance • Habitability (match between mental model and actual system) • Speed SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Examples of questions ( n= 36) • The system is accurate • The system is unreliable • The interaction with the system is unpredictable • The system is pleasant • The system is friendly • I was able to recover easily from errors • I enjoyed using the system • It is clear how to speak to the system • The interaction with the system is frustrating . • The system is too inflexible • I sometimes wondered if I was using the right word • I always knew what to say to the system • It is easy to lose track of where you are in an interaction with the system • The interaction with the system is fast • The system responds too slowly SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Paradise • User satisfaction (subjective) brought in connection with task success and costs (objective measure) pdf • Users perform scenario-based tasks • Measure task success for scenarios, correcting for chance on the basis of attribute value matrices denoting the number of possible options (measure: kappa; κ = 1if all scenarios were successfully completed) • Obtain objective measures of costs: • Efficiency measures (number of utterances, dialogue time, …) • Qualitative measures (repair ratio, inappropriate utterance ratio, …) • Normalize task success and cost measures across subjects by taking the z-scores SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Measure user satisfaction (Mean Opinion Scores across one or more scales) • estimate performance function   zappa - (wi  zcosti) compute value of  and wi by multiple linear regression wi indicates the relative weight of the individual cost components costi wi gives information about what are the primary cost factors, i.e. which factors have most influence on (the lack of) usability of the system SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

case study: performance = .40zappa - .78 cost2 with cost2 is number of repetitions • once the weights have been established and validated, user satisfaction can be predicted from objective data • The typical finding is that user satisfaction as measured by the questionnaire is primarily determined by the quality of the speech recognition (which is not very informative) • Concerns: • “Conservative” scoring on semantic scales • Not all cost functions may be linear SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Promise • Evaluation of multimodal interfaces • References: Pdf1 and pdf2 • Basic idea same as for PARADISE but differences in the way task success is calculated and the correlations are computed SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Where to evaluate: Laboratory tests • Use of scenarios gives some degree of experimental control • Objective and subjective measurements aimed at identifying problem sources and testing potential solutions • Interviews • BUT: Scenarios implicitly specify domain • AND: subjects may be co-operative of overly non-co-operative (exploring the limits of the system) SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Where: Field tests • Advantage: gives information about performance of system with actual end users with self-defined, real goals in realistic situations • Mainly diagnostic (how does the system perform in realistic conditions) • BUT: no information about reasons for particular actions in the dialogue SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Additional considerations • evaluation also in terms of suitability of system given the technological and cost constraints imposed by the application • Cpu consumption, real-time performance • bandwidth, memory consumption • cost SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Project • Wizard-of-Oz • usual assumption is that subjects are made to believe that they are interacting with a real system • most suited when system to be developed is very complex, or when performance of individual modules strongly affects overall performance • Full vs bionic wizard SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

WOZ: General set-up subject scenarios user interface data collection (logging) wizard interface simulation tools assistant wizard SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation

Module u1: Speech in the Interface 4: User-centered Design and Evaluation