Spoken Dialogue Systems A Tutorial

Spoken Dialogue SystemsA Tutorial GautamVarmaMantena Dr. KishorePrahallad Speech and Vision Lab, International Institute of Information Technology-Hyderbad

Introduction • Interface between a human and a machine for information access. • Some applications for information access: • Search engines. Example: Google search, Bing, etc. • IVR applications. Example: Most customer care • Dialogue Systems. Example Google chat bots (guru@google.com) • Dialogue Systems: • Dialogue is a much freer mode of communication. • Speech being the natural mode of communication, there is a need to build systems which communicate to users via speech Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Introduction (contd.) • Modes of communication in a dialogue system • Text • Speech • Multimodal features like gestures, touch screen, etc. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Dialogue • How is spoken dialogue different from other modes of communication? • Input: Input utterance can consists disfluencies (like ‘uh’ and ‘hmm’), word repetitions, etc. • Barge-ins: Interrupting the system before allowing it to finish • Turn taking: System should know when to speak (i.e. when to take control of the conversation) • Prosodic information. • Grounding with user. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Applications of Spoken Dialogue Systems • ARISE – Automatic Railway Information System for Europe • RoomLine – Conference room scheduling and reservation. • Let’s Go – Bus information system • TeamTalk – Command and control interface to a team of robots • Communicator – Air travel planning • Jupiter – Automated weather service • MIS – Mandi Information System and many more Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Characteristics of a Spoken Dialogue Systems • Ability to understand user’s goal and to reach an appropriate and a satisfied solution. • Ability to carry out sub-dialogues to achieve sub-goals. • Ability to pass control from one sub-dialogue to another. • Ability to vary the dialogue initiative modes from system initiative to user initiative. • Use of a user model to expect user’s utterances and act aptly. • Directing the user towards task completion. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Components of a Spoken Dialogue System • Automatic Speech Recognition (ASR) • Convert user input (utterance/speech) to text • Natural Language Understanding (NLU) • Structure text to a format which can be understood • Dialogue Manager (DM) • Actions to be performed based on structured text. • Natural Language Generator (NLG) • Generate text that is to be conveyed to the user. • Text to Speech Synthesis (TTS) • Convert the text to speech Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Architecture of a Spoken Dialogue System Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Approaches to build Dialogue Manager • Classification can be done based on • On how the conversation is controlled: • System initiative: System has complete control on the conversation. • User initiative: User has the complete dialogue control and the system will only respond to user requests. • Mixed initiative: Both user and the system take turns in controlling the conversation. • Functioning of DM • Finite State • Self Organizing Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Finite State Dialogue Managers • Dialogue is structured in a sequence of predetermined utterances. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Finite State Dialogue Managers (contd.) • User is expected to answer only the system queries. Providing more information would be redundant. • FS-DM systems are system initiative as they restrict user responses. • FS-DM systems are very robust as it already knows what state it is in and what would be the possible responses of the user. • Useful when dealing with well structured data like small enquiry systems, questionnaires, etc. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Self Organizing Models • Dialogue path is not predetermined • Dialogue path evolves along with the user responses to the system. • Mixed initiative. • Provide much freer form of communication as the user is not restricted. • Types of self organizing models: • Frame based • Agent based • Information State based. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Frame Based Dialogue Manager • Frame (or form) based models function like form filling application. It keeps track of the information present and information to be acquired. • System chooses the required question to be asked at every instant. Dialogue path is not predetermined. • State of DM is defined as the content of the frame. • Frame is a data structure which holds the concepts and the necessary actions to be performed. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Frame Based Dialogue Manager (contd) • Preconditions are the conditions under which the actions are to be performed. • Given all the questions and the preconditions, the system would determine what actions to be performed. • Example: Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Agent based Systems • An agent is one which perceives its environment and perform necessary actions. • Information obtained from the environment by an agent are called percepts. • Agents behavior is dependent on the sequence of percepts it receives from the environment. • Goal of an intelligent agent is to effectively map the percept sequences to its corresponding action sequences via a program called agent program. • Example model is a plan based model. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Agent based Systems (contd) • In a plan based model, given an initial state and a goal state, the system constructs a plan or a sequence of operations to be performed. • The series of actions are to be performed are defined by an action schema. • An example action schema consists of the following parameters and constraints • Preconditions: Necessary conditions for successful execution of the procedure Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Agent based Systems (contd) • Effect: Conditions that become true after execution of the procedure. • Body: A set of goal states that are to be achieved in executing the procedure. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Information State based Dialogue Systems • Dialogue management operations are all in terms of information state. • Information state of a dialogue represents: • Current information present. • Information obtained from previous dialogue moves. • Future actions to be performed. • Information state approach for dialogue modeling consists of: • Description and a representation of the information components that constitute an information state. In general they can be sub-divided into static and dynamic components. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Information State based Dialogue Systems (contd) • Static components are those information components that do not change during a conversation with a user. Static components can be domain knowledge, update rules, etc. • Set of dialogue moves that will update the information state. • Set of update rules that control the update process of the information state. Rules are governed by the current information state and the performed dialogue move. • An update strategy for deciding which update rule to be executed from a set of applicable ones. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Stochastic Methods for learning Dialogue Strategies • Dialogue systems like information state based systems are complex to build. • Difficulty comes in incorporating update moves and strategies. • An approach to overcome the complexity is to model the dialogue manager as a stochastic model and train the model using preliminary dialogue corpus. • Incase of real time systems, it is difficult to foresee all possible dialogue scenarios that could occur between a user and a system. To overcome such problems we could approximate a dialogue system to a stochastic model. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Stochastic Methods for learning Dialogue Strategies (contd.) • Some of the approaches used for modeling dialogue managers are as follows: • Markov Decision Process (MDP) • Partially Observable Markov Decision Process (POMDP) Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Markov Decision Process • Markov decision process or MDP is characterized by • A set of states S an agent can be in. • A set of actions A the agent can take. • A reward r(s,a) that the agent receives for taking an action. • Goal is to specify a policy which specifies which action to be taken to receive the best reward. • Expected cumulative reward Q for a sequence of states is Q[s0,a0,s1,a1,..] = R(s0,a0) + αR(s1,a1) + α2 R(s2,a2) + … Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Markov Decision Process • α is called discount factor and lies between 0 and 1. • It gives more importance to the current reward than to future rewards. • The more future a reward, the more discounted its value. • Bellman equation: Q(s,a) = R(s,a) + α ∑s1 P(s1|s,a) maxa1 Q(s1,a1) We require hand labeled data to model R(s,a) and P(s1|s,a) Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Partially Observable Markov Decision Process (POMDP) • Problems with MDP: • System might never know what state the user is in. • System state might be misdirected by recognition errors. • Use of POMDP to overcome the above problems. • Environment is not completely observable for the agent. • POMDP provides a framework to overcome such errors by explicitly mentioning the recognition errors. • It models the user output as an observed signal generated from another hidden variable. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Evaluation of Spoken Dialogue System • Dialogue metrics can be classified as: • Objective measures: calculated automatically by machine without any considerations towards human judgment. • Some objective measures: • Percentage of correct answers given by a system to a user. • Number of turns taken to complete the task. • Mean user response time. • Mean system response time. • Percentage of errors recognized by the dialogue system. • Time taken to complete the task. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Evaluation of Spoken Dialogue System (contd) • Subjective measures: require human analysis and a set of ground rules for user evaluation. • Some subjective measures: • System cooperation. • User satisfaction • Percentage of correct and partially correct answers. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Issues in evaluating Spoken Dialogue Systems • Only work when responses of the system is a single correct answer. It would not be able to compare systems if their outputs are summarized texts. • Cannot clearly demarcate the different metrics. Some of the given metrics can be correlated and can be redundant. • Given all the metrics, how do we combine the metrics for evaluation. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Mandi Information System Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Mandi Information System • A conversation system for accessing prices of agricultural commodities • Target users: farmers in rural and semi-urban areas • Commodities: Vegetables, fruits, pulses, spices, • Markets: Andhra Pradesh, India • Language: Telugu Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Issues/Challenged involved • Noisy environment: • Target users are primarily farmers in rural and semi-urban areas. • Quality of speech is affected by the surrounding noises and also the distance of mobile/telephone handset or the microphone. • Dialect/Pronunciation variation: • Four distinct dialects of Telugu in Andhra Pradesh • Dialectal variation is a continuum • Unstructured conversation: • Target audience with no familiarity with computers • Conversation is unstructured with disfluencies like repeats and false starts • Personalization: • Useful to guide the user in obtaining the required information as quickly as possible. • Require mining user's previous calls to predict his/her preferred query during the next call. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Building a baseline system • Data collection • Users were asked to read out the names of the commodities, place names, etc. from the text provided to them. • Users were shown a series of pictures of agricultural commodities and were asked to say the names of the commodities shown in the picture. This is to collect the dialectal variations of the commodities. • Users were asked a series of questions related to agriculture and the places around their locality. This is to record conversational speech. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Building a baseline system (contd) • Dialogue flow is as follows Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Building a baseline system (contd) • Finite state dialogue manager • Limited domain and number of inputs required are only three. • Well structured. • Real time application and the system is expected to be very robust. • A sample conversation of the system Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Building a baseline system (contd) • Sample action schema for MIS • Obtaining commodity name Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Building a baseline system (contd) • Obtain district name Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Characteristics of MIS baseline system • MIS expects three concepts or inputs from the user, which are commodity, district and market names. • When a user provides some information to MIS, the goodness of recognition hypothesis is checked by using explicit confirmation. This is to make sure that the input query is right as recognition is error prone. • The strategy of explicit confirmation is hardly convenient, as the user has to confirm to all the information that was provided to the MIS queries. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Mandi Information System II • Spoken dialogue system should provide accurate information to a user in less number of turns (or interactions). • Speech recognition being error prone, it is difficult to avoid confirmations from users. However, the objective would be to limit these confirmations. • An approach would be to associate a confidence score to the recognition output of an ASR. • Based on the confidence measure we decide the dialogue flow of the system. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Multiple Decoders and Contextual Information • One could build multiple ASR decoders, where each decoder tries to capture complementary information about the speech data. Can be done through training multiple decoders using • Different training datasets • Different features such as Mel-frequency cepstral coefficients, linear prediction cepstral coefficients. • If a majority of these decoders agree on a hypothesis, the recognized output is considered to be same and the system could choose to avoid an explicit confirmation from a user. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Multiple Decoders and Contextual Information (contd) • Given a set of decoders {d1,d2} and for a given acoustic signal, their corresponding hypothesis be {h1,h2}. • Let C[i] be the contextual information for the dialogue state i. • Following are the possible cases: • Case 1: h1 = h2 and h1 belongs in C[i] • Action: Recognition output is most likely to be correct and the system would jump to subsequent dialogue states • Case 2: h1 = h2 and h1,h2 does not belong in C[i] • Action: Recognition output is most likely to be correct, but the input is of no help as it is not present in the contextual information. System would prompt the user, saying that no such information is available pertaining to that given input and would ask the user to provide some other query. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Multiple Decoders and Contextual Information (contd) • Case 3: h1 ≠ h2 and h1/h2 belongs in C[i] • Action: System would try to consider the hypothesis that is present in the contextual information and discard the other. To make sure that the recognition is correct, system would ask for an explicit confirmation from the user. • Case 4: h1 ≠ h2 and h1,h2 does not belong in C[i] • Action: Mis-recognition might have occurred and the system would prompt the user to provide the information again. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Multiple Decoders and Contextual Information (contd) Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Multiple Decoders and Contextual Information (contd) • Sample conversation logged using MIS II Speech and Vision Lab, International Institute of Information Technology - Hyderabad

MIS performace • Performance was measured based on whether the user was able to retrieve information or not. • Evaluation was conducted on 8 users, where each user was asked to retrieve information for any 5 commodities and the performance of MIS baseline system and MIS-II is as follows: Speech and Vision Lab, International Institute of Information Technology - Hyderabad

Personalization • A frequent caller expects his/her preferred query to be automatically answered. • Such personalization requires identifying a user and mining users previous calls to predict his/her preferred query during the next call. • Currently the system logs the user calls based on the caller id. • If the useris a frequent caller the system would prompt a yes/no question asking for the most recent transaction they made. • Example: Do you want information of Rice in Vishakhapatnam • Dialogue flow will be user specific and based on user responses. Speech and Vision Lab, International Institute of Information Technology - Hyderabad

MIS Demo Number: 66150320 extension 2

Thank You

Spoken Dialogue Systems A Tutorial

Spoken Dialogue Systems A Tutorial

Presentation Transcript

Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue Systems: System Overview

Spoken Dialogue Systems

Spoken Dialogue Systems: Human and Machine

Turn-Taking in Spoken Dialogue Systems

Spoken Dialogue Technology

Spoken Dialogue Technology

Evaluating Spoken Dialogue Systems

Learning Optimal Strategies for Spoken Dialogue Systems

Intonational Variation in Spoken Dialogue Systems

Spoken Dialogue Systems

User Simulation for Spoken Dialogue Systems

Spoken Dialogue Systems

Components of Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue Systems

Spoken Dialogue Systems: Human and Machine

Spoken Dialogue Systems