1 / 2

Modeling the Cost of Misunderstandings in the CMU Communicator System Dan Bohus Alex Rudnicky School of Computer Scien

Modeling the Cost of Misunderstandings in the CMU Communicator System Dan Bohus Alex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213. ø. Abstract

landis
Download Presentation

Modeling the Cost of Misunderstandings in the CMU Communicator System Dan Bohus Alex Rudnicky School of Computer Scien

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling the Cost of Misunderstandings in the CMU Communicator System Dan Bohus Alex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213 ø. Abstract We present a data-driven approach which allows us to quantitatively assess the costs of various types of errors that a confidence annotator commits in the CMU Communicator spoken dialog system. Knowing these costs we can determine the optimal tradeoff point between these errors, and fine-tune the confidence annotator accordingly. The cost models based on net concept transfer efficiency fit our data quite well, and the relative costs of false-positives and false-negatives are in accordance with our intuitions. We also find, surprisingly that for a mixed-initiative system such as the CMU Communicator, these errors trade-off equally over a wide operating range. 1. 2. • Motivation. Problem Formulation. • Intro • In previous work [1], we have cast the problem of utterance-level confidence annotation as a binary classification task, and have trained multiple classifiers for this purpose: • Training corpus: 131 dialogs, 4550 utterances • 12 Features from recognition, parsing and dialog level • 7 Classifiers: Decision Tree, ANN, Bayesian Net, AdaBoost, Naïve Bayes, SVM, Logistic regression. • Results(mean classification error rates in 10-fold cross-validation) • * Most of the classifiers obtained statistically indistinguishable results (with the notable exception of Naïve Bayes). The logistic regression model obtained much better performance on a soft-metric • Question: Is Classification Error Rate the Right Way to Evaluate Performance ? • CER as a measure of performance implicitly assumes that the cost of false-positives and false-negatives is the same. But intuitively this assumption does not hold in most dialog systems: • On FP, the system incorporates an will act on invalid info; • On FN, the system will reject a valid user utterance. • So optimally, we want to build an error function which takes into account these costs, and optimize for that. • Problem Formulation • Develop a cost model which allows us to Quantitatively assess the costs of FP and FN errors • Use these costs to pick an optimal point on the classifier operating characteristic • Cost Models: The Approach • The Approach • To model the impact of FPs and FNs on the system performance, we: • Identify a suitable dialog performance metric (P) which we want to optimize for • Build a statistical regression model on whole sessions using P as the response variable and the counts of FPs and FNs as predictors: • P = f(FPs, FNs) • P = k+CostFP • FP+CostFN•FN (Linear Regression) • Performance metrics: • User satisfaction (5-point scale): subjective, hard to obtain • Completion (binary): too coarse • Concept transmission efficiency • CTC = correctly transferred concepts/turn • ITC = incorrectly transferred concepts/turn • REC = relevantly expressed concepts/turn • The Dataset • 134 dialogs collected using mostly 4 different scenarios - 2561 utterances • User satisfaction scores obtained for only 35 dialogs • Corpus manually labeled at the concept level: • 4 labels: OK/RBAD/PBAD/OOD • Aggregate utterance labels generated • Confidence annotator decisions available in the logs • We therefore could compute the counts of FPs, FNs and CTCs and ITCs for each session • An Example • User: I want to fly from Pittsburgh to Boston • Decoder: I want to fly from Pittsburgh to Austin • Parse: [I_want/OK] [Depart_Loc/OK] [Arrive_Loc/OK] • Only 2 relevantly expressed concepts • If Accept: CTC=1, ITC=1, REC=2 • If Reject: CTC=0, ITC=0, REC=2

  2. 5. 6. So the problem translates to locating a point on the operating characteristic (by moving the classification threshold) which minimizes the total cost (and thus implicitly maximize the chosen performance metric), rather than the classification error rate. The cost, according to model 3 is: Cost = 0.48 • FPNC + 2.12 • FPC + 1.33 • FN + 0.56 • TN 3. • Cost Models: The Results • Cost Models Targeting Efficiency • 3 successively refined cost models were developed targeting efficiency as a response variable. • The goodness of fit for this models (indicated by R2), both on the training and in a 10-fold cross-validation process are illustrated in the table below. • Model 1: CTC = FP + FN + TN + k • Model 2: CTC–ITC = REC + FP + FN + TN + k • added the ITC term so that we also minimize the number of incorrectly transferred concepts. • REC captures a prior on the verbosity of the user • both changes further improve performance • Model 3: CTC–ITC = REC + FPC + FPNC + FN + TN + k • The FP term was split into 2, since there are 2 different types of false positives in the system, which intuitively should have very different costs: • FPC = false positives with relevant concepts • FPNC = false positives without relevant concepts • The resulting coefficients for model 3 are given below, together with their 95% confidence intervals: • Other Models • Targeting Completion (binary) • Logistic regression model • Estimated model does not indicate a good fit • Targeting User Satisfaction (5-point scale) • Based only on 35 dialogs • R2=0.61, similar to literature (Walker et al) • Explanation: subjectivity of metric + limited dataset The fact that the cost function is almost constant across a wide range of thresholds, indicates that the efficiency of the dialog stays about the same, regardless of the ratios of FPs and FNs that the system makes. • Further Analysis • Is CPT-IPT an Adequate Metric ? • Mean = 0.71; Standard Deviation = 0.28, • Mean for Completed dialogs = 0.82, • Mean for Uncompleted dialogs = 0.57, • differences are statistically significant at a very high level of confidence (p = 7.23 •10-9) • Can We Reliably Extrapolate the Model to Other Areas of the ROC ? • The distribution of FPs and FNs across dialogs indicates that, although the data is obtained with the confidence annotator running with a threshold of 0.5, we have enough samples to reliably estimate the other areas of the ROC. • How About the Impact of the Baseline Error Rate ? • Cost models constructed based on sessions with a low baseline error rate indicate that the optimal point is with the threshold at 0 (no confidence annotator). • Explanation: • Ability to easily overwrite incorrectly captured information in the CMU Communicator. • Relatively low baseline error rates. 4. • Fine-tuning the annotator • We want to find the optimal trade-off point on the operating characteristic of the classifier. Implicitly we are minimizing classification error rate (FP + FN). • Conclusions • Proposed a data-driven approach to quantitatively assess the costs of various types of errors committed by a confidence annotator. • Models based on efficiency fit the data well; obtained costs confirm the intuition. • For CMU Communicator, the models predict that the total cost stays the same across a large range of the operating characteristic of the confidence annotator. School of Computer Science, Carnegie Mellon University, 2001, Pittsburgh, PA, 15213.

More Related