Modeling the Cost of Misunderstandings in the CMU Communicator System Dan Bohus Alex Rudnicky School of Computer Scien

Modeling the Cost of Misunderstandings in the CMU Communicator System Dan Bohus Alex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213 ø. Abstract We present a data-driven approach which allows us to quantitatively assess the costs of various types of errors that a confidence annotator commits in the CMU Communicator spoken dialog system. Knowing these costs we can determine the optimal tradeoff point between these errors, and fine-tune the confidence annotator accordingly. The cost models based on net concept transfer efficiency fit our data quite well, and the relative costs of false-positives and false-negatives are in accordance with our intuitions. We also find, surprisingly that for a mixed-initiative system such as the CMU Communicator, these errors trade-off equally over a wide operating range. 1. 2. • Motivation. Problem Formulation. • Intro • In previous work [1], we have cast the problem of utterance-level confidence annotation as a binary classification task, and have trained multiple classifiers for this purpose: • Training corpus: 131 dialogs, 4550 utterances • 12 Features from recognition, parsing and dialog level • 7 Classifiers: Decision Tree, ANN, Bayesian Net, AdaBoost, Naïve Bayes, SVM, Logistic regression. • Results(mean classification error rates in 10-fold cross-validation) • * Most of the classifiers obtained statistically indistinguishable results (with the notable exception of Naïve Bayes). The logistic regression model obtained much better performance on a soft-metric • Question: Is Classification Error Rate the Right Way to Evaluate Performance ? • CER as a measure of performance implicitly assumes that the cost of false-positives and false-negatives is the same. But intuitively this assumption does not hold in most dialog systems: • On FP, the system incorporates an will act on invalid info; • On FN, the system will reject a valid user utterance. • So optimally, we want to build an error function which takes into account these costs, and optimize for that. • Problem Formulation • Develop a cost model which allows us to Quantitatively assess the costs of FP and FN errors • Use these costs to pick an optimal point on the classifier operating characteristic • Cost Models: The Approach • The Approach • To model the impact of FPs and FNs on the system performance, we: • Identify a suitable dialog performance metric (P) which we want to optimize for • Build a statistical regression model on whole sessions using P as the response variable and the counts of FPs and FNs as predictors: • P = f(FPs, FNs) • P = k+CostFP • FP+CostFN•FN (Linear Regression) • Performance metrics: • User satisfaction (5-point scale): subjective, hard to obtain • Completion (binary): too coarse • Concept transmission efficiency • CTC = correctly transferred concepts/turn • ITC = incorrectly transferred concepts/turn • REC = relevantly expressed concepts/turn • The Dataset • 134 dialogs collected using mostly 4 different scenarios - 2561 utterances • User satisfaction scores obtained for only 35 dialogs • Corpus manually labeled at the concept level: • 4 labels: OK/RBAD/PBAD/OOD • Aggregate utterance labels generated • Confidence annotator decisions available in the logs • We therefore could compute the counts of FPs, FNs and CTCs and ITCs for each session • An Example • User: I want to fly from Pittsburgh to Boston • Decoder: I want to fly from Pittsburgh to Austin • Parse: [I_want/OK] [Depart_Loc/OK] [Arrive_Loc/OK] • Only 2 relevantly expressed concepts • If Accept: CTC=1, ITC=1, REC=2 • If Reject: CTC=0, ITC=0, REC=2

5. 6. So the problem translates to locating a point on the operating characteristic (by moving the classification threshold) which minimizes the total cost (and thus implicitly maximize the chosen performance metric), rather than the classification error rate. The cost, according to model 3 is: Cost = 0.48 • FPNC + 2.12 • FPC + 1.33 • FN + 0.56 • TN 3. • Cost Models: The Results • Cost Models Targeting Efficiency • 3 successively refined cost models were developed targeting efficiency as a response variable. • The goodness of fit for this models (indicated by R2), both on the training and in a 10-fold cross-validation process are illustrated in the table below. • Model 1: CTC = FP + FN + TN + k • Model 2: CTC–ITC = REC + FP + FN + TN + k • added the ITC term so that we also minimize the number of incorrectly transferred concepts. • REC captures a prior on the verbosity of the user • both changes further improve performance • Model 3: CTC–ITC = REC + FPC + FPNC + FN + TN + k • The FP term was split into 2, since there are 2 different types of false positives in the system, which intuitively should have very different costs: • FPC = false positives with relevant concepts • FPNC = false positives without relevant concepts • The resulting coefficients for model 3 are given below, together with their 95% confidence intervals: • Other Models • Targeting Completion (binary) • Logistic regression model • Estimated model does not indicate a good fit • Targeting User Satisfaction (5-point scale) • Based only on 35 dialogs • R2=0.61, similar to literature (Walker et al) • Explanation: subjectivity of metric + limited dataset The fact that the cost function is almost constant across a wide range of thresholds, indicates that the efficiency of the dialog stays about the same, regardless of the ratios of FPs and FNs that the system makes. • Further Analysis • Is CPT-IPT an Adequate Metric ? • Mean = 0.71; Standard Deviation = 0.28, • Mean for Completed dialogs = 0.82, • Mean for Uncompleted dialogs = 0.57, • differences are statistically significant at a very high level of confidence (p = 7.23 •10-9) • Can We Reliably Extrapolate the Model to Other Areas of the ROC ? • The distribution of FPs and FNs across dialogs indicates that, although the data is obtained with the confidence annotator running with a threshold of 0.5, we have enough samples to reliably estimate the other areas of the ROC. • How About the Impact of the Baseline Error Rate ? • Cost models constructed based on sessions with a low baseline error rate indicate that the optimal point is with the threshold at 0 (no confidence annotator). • Explanation: • Ability to easily overwrite incorrectly captured information in the CMU Communicator. • Relatively low baseline error rates. 4. • Fine-tuning the annotator • We want to find the optimal trade-off point on the operating characteristic of the classifier. Implicitly we are minimizing classification error rate (FP + FN). • Conclusions • Proposed a data-driven approach to quantitatively assess the costs of various types of errors committed by a confidence annotator. • Models based on efficiency fit the data well; obtained costs confirm the intuition. • For CMU Communicator, the models predict that the total cost stays the same across a large range of the operating characteristic of the confidence annotator. School of Computer Science, Carnegie Mellon University, 2001, Pittsburgh, PA, 15213.

Modeling the Cost of Misunderstandings in the CMU Communicator System Dan Bohus Alex Rudnicky School of Computer Scien

Modeling the Cost of Misunderstandings in the CMU Communicator System Dan Bohus Alex Rudnicky School of Computer Scien

Presentation Transcript

An Overview of the Computer System

The Cost of School Security

An Overview of the Computer System

The application of the systemic approach in the school system

Misunderstandings About Oracle Internals The Cost of Oracle Logical I/O Calls

An Overview of the Computer System

The Argument Mapping Tool of the Carneades Argumentation System DIAGRAMMING EVIDENCE: VISUALIZING CONNECTIONS IN SCIEN

The Communicator

Cost of Misunderstandings

Norman Sadeh, School of Computer Science, CMU Mark S.Fox

Behavioral Modeling The DLX Computer System

Detecting Misunderstandings in the CMU Communicator Spoken Dialog System

Modeling of Stratospheric Ozone in the Climate System

Controlling the Cost of Computer Printing

Modeling in the Computer Age

THE COST OF HIGH SCHOOL NEGLECT

The School System in the Federal Republic of Germany

The Engineer Communicator

An Overview of the Computer System

An Overview of the Computer System

The Cost Modeling Process