**Tony Jebara, Columbia University** Dynamic Bayesian Networks for Multimodal Interaction Tony Jebara Machine Learning Lab Columbia University joint work with A. Howard and N. Gu

**Tony Jebara, Columbia University** Outline • Introduction: Multi-Modal and Multi-Person • Bayesian Networks and the Junction Tree Algorithm • Maximum Likelihood and Expectation Maximization • Dynamic Bayesian Networks (HMMs, Kalman Filters) • Hidden ARMA Models • Maximum Conditional Likelihood and Conditional EM • Two-Person Visual Interaction (Gesture Games) • Input-Output Hidden Markov Models • Audio-Visual Interaction (Conversation) • Intractable DBNs, Minimum Free Energy, Generalized EM • Dynamical System Trees • Multi-Person Visual Interaction (Football Plays) • Haptic-Visual Modeling (Surgical Drills) • Ongoing Directions

**Tony Jebara, Columbia University** Introduction • Simplest Dynamical Systems (single Markovian Process) • Hidden Markov Model and Kalman Filter • But Multi-modal data (audio, video and haptics) have: • Different time scale processes • Different amplitude scale processes • Different noise characteristics processes • Also, Multi-person data (multi-limb, two-person, group) • Weakly coupled • Conditionally Dependent • Dangerous to • slam all time • data into one • single series: • Find new ways to zipper multiple interacting processes

**Tony Jebara, Columbia University** Bayesian Networks • Also called Graphical Models • Marry graph theory & statistics • Directed graph which efficiently • encodes large p(x1,…,xN) as • product of conditionals • of node given parents • Avoids storing huge hypercube over all variables x1,…,xN • Here, xi discrete (multinomial) or continuous (Gaussian) • Split BNs over sets of hidden XH and observed XV variables • Three basic operations for BNs • 1) Infer marginals/conditionals of hidden (JTA) • 2) Compute likelihood of data (JTA) • 3) Maximize likelihood the data (EM)

**Tony Jebara, Columbia University** Bayes Nets to Junction Trees • Workhorse of BNs is Junction Tree Algorithm 1) Bayes Net 2) Moral Graph 3) Triangulated 4) Junction tree

**Tony Jebara, Columbia University** Junction Tree Algorithm • The JTA sends messages from cliques through • separators (these are just tables or potential functions) • Ensures that various tables in the junction tree graph • agree/consistent over shared variables (via marginals). If agree: Send message From V to W… Send message From W to V… Then, Cliques Agree Else:

**Tony Jebara, Columbia University** Junction Tree Algorithm • On trees, JTA is guaranteed: 1)Init 2)Collect 3)Distribute Ends with potentials as marginals or conditionals of hidden variables given data p(Xh1|Xv) p(Xh2|Xv) p(Xh1, Xh2|Xv) And likelihood p(Xv) is potential normalizer

**Tony Jebara, Columbia University** l(q) q Maximum Likelihood with EM • We wish to maximize the likelihood over q for learning: • EM instead iteratively maxes lower bound on log-likelihood: • E-step: • M-step: q(z) L(q,q) q

**Tony Jebara, Columbia University** Dynamic Bayes Nets • Dynamic Bayesian Networks are BNs unrolled in time • Simples and most classical examples are: Linear Dynamical System Hidden Markov Model State Transition Model: State Transition Model: Emission Model: Emission Model:

**Tony Jebara, Columbia University** Two-Person Interaction • Learn from two interacting people (person Y and • person X) to mimic interaction via simulated person Y. • One hidden Markov model for each user…no coupling! • One time series for both users… too rigid! Learn from two users to get p(y|x) Interact with single user via p(y|x)

**Tony Jebara, Columbia University** DBN: Hidden ARMA Model Learn to imitate behavior by watching a teacher exhibit it. Eg. unsupervised observation of 2- agent interaction Eg. Track lip motion Discover correlations between past action & subsequent reaction Estimate p(Y | past X , past Y) X Y

**Tony Jebara, Columbia University** DBN: Hidden ARMA Model • Focus on predicting person Y from past of both X and Y • Have multiple linear models of the past to the future • Use a window for moving average (compressed with PCA) • But, select among them using S (nonlinear) • Here, we show only a 2nd order moving average • to predict the next Y given • past two Y’s, past two X’s and current X • and random choice of ARMA linear model

**Tony Jebara, Columbia University** Hidden ARMA Features: • Model skin color as mixture of RGB Gaussians • Track person as mixture of spatial Gaussians • But, want to predict only Y from X… Be discriminative • Use maximum conditional likelihood (CEM)

**Tony Jebara, Columbia University** Conditional EM • Only need a conditional? • Then maximize conditional likelihood EM: divide & conquer CEM: discriminative divide & conquer

**Tony Jebara, Columbia University** Conditional EM CEM p(y|x) CEM vs. EM p(c|x,y) CEM accuracy = 100% EM accuracy = 51% EM p(y|x)

**Tony Jebara, Columbia University** Conditional EM for hidden ARMA Estimate Prediction Discriminatively/Conditionally p(future|past) 2 Users gesture to each other for a few minutes Model: Mix of 25 Gaussians, STM: T=120, Dims=22+15 Nearest Neighbor 1.57% RMS Constant Velocity 0.85% RMS Hidden ARMA: 0.64% RMS

**Tony Jebara, Columbia University** Hidden ARMA on Gesture SCARE WAVE CLAP

**Tony Jebara, Columbia University** DBN: Input-Output HMM • Similarly, learn person’s response • audio video stimuli to predict Y • (or agent A) from X (or world W) • Wearable collects audio & video A,W -Sony Picturebook Laptop -2 Cameras (7 Hz) (USB & Analog) -2 Microphones (USB & Analog) -100 Megs per hour (10$/Gig)

**Tony Jebara, Columbia University** DBN: Input-Output HMM • Consider simulating agent given world • Hidden Markov model on its own • is insufficient since it does not • distinguish between the input • rule the world has and the output • we need to generate • Instead, form input-output HMM • One IOHMM predicts agent’s audio • using all 3 past channels • One IOHMM predicts agent’s video • Use CEM to learn the IOHMM discriminatively

**Tony Jebara, Columbia University** Input-Output HMM Data Video -Histogram lighting correction -RGB Mixture of Gaussians to detect skin -Face: 2000 pixels at 7Hz (X,Y,Intensity) Audio -Hamming Window, FFT, Equalization -Spectrograms at 60Hz -200 bands (Amplitude, Frequency) Very noisy data set!

**Tony Jebara, Columbia University** Video Representation - Principal Components Analysis - linear vectors in Euclidean space - Images, spectrograms, time series vectors. - Vectorization is bad, nonlinear - Images = collections of (X,Y,I) tuples “pixels” - Spectrograms = collections of (A,F) tuples …therefore... - Corresponded Principal Components Analysis X = M are soft permutation matrices

**Tony Jebara, Columbia University** Video Representation Original PCA CPCA 2000 XYI Pixels: Compress to 20 dims

**Tony Jebara, Columbia University** Input-Output HMM Estimate hidden trellis from partial data For agent and world: 1 Loudness scalar 20 Spectro Coeffs 20 Face Coeffs

**Tony Jebara, Columbia University** Input-Output HMM with CEM Conditionally model p(Agent Audio | World Audio , World Video) p(Agent Video | World Audio, World Video) Don’t care how well we can model world audio and video Just as long as we can map it to agent audio or agent video Avoids temporal scale problems too (Video 5Hz, Audio 60 Hz) Audio IOHMM: CEM: 60-state 82-Dim HMM Diagonal Gaussian Emissions 90,000 Samples Train / 36,000 Test

**Tony Jebara, Columbia University** Input-Output HMM with CEM TRAINING & TESTING EM (red) CEM (blue) Audio 99.61 100.58 Video -122.46 -121.26 Joint Likelihood Conditional Likelihood RESYNTHESIS Spectrograms from eigenspace KD-Tree on Video Coefficients to closest image in training (point-cloud too confusing)

**Tony Jebara, Columbia University** Input-Output HMM Results Train Test

**Tony Jebara, Columbia University** Intractable Dynamic Bayes Nets Interaction Through Output Interaction Through Hidden States Factorial Hidden Markov Model Coupled Hidden Markov Model

**Tony Jebara, Columbia University** Intractable DBNs: Generalized EM • As before, we use bound on likelihood: • But best q over hidden vars that minimizes KL intractable! • Thus, restrict q to only explore factorized distributions • EM still converges underpartial E steps & partial M steps, q(z) -L(q,q) q l(q) q

**Tony Jebara, Columbia University** Intractable DBNs Variational EM • Now, the q distributions are limited to be chains • Tractable as an iterative method • Also known as variational EM structured mean-field Factorial Hidden Markov Model Coupled Hidden Markov Model

**Tony Jebara, Columbia University** Dynamical System Trees • How to handle more people and a hieararchy of coupling? • DSTs consider coupling university staff: • students -> department -> school -> university Interaction Through Aggregated Community State Internal nodes are states. Leaf nodes are emissions. Any subtree is also a DST. DST above unrolled over 2 time steps

**Tony Jebara, Columbia University** Dynamical System Trees • Also apply generalization of EM and do • variational structured mean field for q distribution. • Becomes formulaic fo any DST topology! • Code available at http://www.cs.columbia.edu/~jebara/dst

**Tony Jebara, Columbia University** DSTs and Generalized EM Inference Introduce v.p. Inference Introduce v.p. Inference Introduce v.p. Inference Structured Mean Field: Use tractable distribution Q to approximate P Introduce variational parameters Find Min KL(Q||P)

**Tony Jebara, Columbia University** DSTs for American Football Initial frame of a typical play Trajectories of players

**Tony Jebara, Columbia University** DSTs for American Football ~20 time series of two types of plays (wham and digs) Likelihood ratio of models used as classifer DST1 puts all players into 1 game state DST2 combines players into two teams and then into game

**Tony Jebara, Columbia University** DSTs for Gene Networks • Time series of cell cycle • Hundreds of gene • expression levels over time • Use given hierarchical • clustering • DST with hierarchical • clustering structure • gives best test • likelihood

**Tony Jebara, Columbia University** Robotic Surgery, Haptics & Video • Davinci Laparoscopic Robot • Used in hundreds of hospitals • Surgeon works on console • Robot mimics movement • on (local) patient • Captures all actuator/robot • data as 300Hz time series • Multi-Channel Video of • cameras inside patient

**Tony Jebara, Columbia University** Robotic Surgery, Haptics & Video

**Tony Jebara, Columbia University** Robotic Surgery, Haptics & Video 64 Dimensional Time Series @ 300 Hz Console and Actuator Parameters Expert Novice Suturing

**Tony Jebara, Columbia University** Robotic Surgical Drills Results • Compress Haptic & Video data with PCA to 60 dims. • Collected Data from Novices and Experts and • built several DBNs (IOHMMs, DSTs, etc.) of • expert and novice for 3 different drills (6 models total). • Preliminary results: • Minefield Russian Roulette Suture

**Tony Jebara, Columbia University** Conclusion • Dynamic Bayesian networks are natural upgrade to HMMs. • Relevant for structured, multi-modal and multi-person temporal data. • Several exampls of dynamic Bayesian networks for • audio, video and haptic channels • single, two-person and multi-person activity. • DBNs: HMMs, Kalman Filters, hidden ARMA, input-output HMMs. • Use max likelihood (EM) or max conditional likelihood (CEM). • Intractable DBNs: switched Kalman filters, dynamical systems trees. • Use max free energy (GEM) and structured mean field. • Examples of applications: • gesture interaction (gesture games) • audio-video interaction (social conversation) • multi-person game playing (American football) • haptic-video interaction (robotic laparoscopy). • Funding provided in part by the National Science Foundation, the • Central Intelligence Agency, Alphastar and Microsoft.