Bayesian Machine Learning for Signal Processing

Bayesian Machine Learning for Signal Processing Hagai T. Attias Golden Metallic, Inc. San Francisco, CA Tutorial 6th International Conference on Independent Component Analysis and Blind Source Separation, Charleston, SC, March 2006

ICA / BSS is 15 Years Old • First pair of papers: Comon, Jutten & Herault, Signal Processing, 1991 • First papers on a statistical machine learning approach to ICA/BSS: Bell & Sejnowski 1995; Cardoso 1996; Pearlmutter & Parra 1997 • First conference on ICA/BSS: Helsinki, 2000 • Lesson drawn by many: ICA is a cool problem. Let’s find many approaches to it and many places where it’s useful. • Lesson drawn by some: statistical machine learning is a cool framework. Let’s use it to transform adaptive signal processing. ICA is a good start.

Background Microphone ANC TV Noise Cancellation

From Noise Cancellation to ICA Background Microphone ICA TV TV

Noise Cancellation: Derivation • y = sensor, x = sources, n = time point • y1(n) = x1(n) + w(n) * x2(n) y2(n) = x2(n) • Joint probability distribution of observed sensor data p(y) = px(x1 = y1 – w*y2, x2 = y2) • Assume the sources are independent, identically distributed Gaussians, with mean 0 and precisions v1, v2 • Observed data likelihood L = log p(y) = -0.5 v1 (y1 – w*y2)2 + const. • dL / dw = 0  linear equation for w

Noise Cancellation  ICA: Derivation • y = sensor, x = sources, n = time point • y(n) = A x(n) , A = square mixing matrix • x(n) = G y(n) , G = square unmixing matrix • Probability distribution of observed sensor data p(y) = |G| px(G y) • Assume the sources are i.i.d. non-Gaussians • Observed data likelihood L = log p(y) = log |G| + log p(x1) + log p(x2) • dL / dG = 0  non-linear equation for G

Sensor Noise and Hidden Variables • y = sensor, x = sources, u = noise, n = time point • y(n) = A x(n) + u(n) • x are now hidden variables: even if A is known, one cannot obtain x exactly from y • However, one can compute the posterior probability of x conditioned on y p(x|y) = p(y|x) p(x) / p(y) where p(y|x) = pu(y – A x) • To learn A from data, one must use an expectation maximization (EM) algorithm (and often approximate it)

Probabilistic Graphical Models • Model the distribution of observed data • Graph structure determines the probabilistic dependence between variables • We focus on DAGs = directed acyclic graphs • Node = variable Arrow = probabilistic dependence x x y p(y,x) = p(y|x) p(x) p(x)

Linear Classification • c = class label; discrete, multinomial • y = data; continuous, Gaussian • p(c) = πc , p(y|c) = N( y | μc, νc ) • Training set: pairs {y,c} Learn parameters by maximum likelihood L = log p(y,c) = log p(y|c) + log p(c) • Test set: {y}, classify using p(c|y) = p(y,c) / p(y) c p(y,c) = p(y|c) p(c) y

Linear Regression x • x = predictor; continuous, Gaussian • y = dependent; continuous, Gaussian • p(x) = N(x | μ, ν ) , p(y|x) = N( y | Ax, Λ ) • Training set: pairs {y,x} Learn parameters by maximum likelihood L = log p(y,x) = log p(y|x) + log p(x) • Test set: {x}, predict using p(y|x) p(y,x) = p(y|x) p(x) y

Clustering • c = class label; discrete, multinomial • y = data; continuous, Gaussian • p(c) = πc , p(y|c) = N( y | μc, νc ) • Training set: {y} p(y) is a mixture of Gaussians (MoG) Learn parameters by expectation maximization (EM) • Test set: {y}, cluster using p(c|y) = p(y,c) / p(y) • Limit of zero variance: vector quantization (VQ) c p(y,c) = p(y|c) p(c) But now c is hidden y

x y Factor Analysis • x = factors; continuous, Gaussian • y = data; continuous, Gaussian • p(x) = N( x | 0, I ) , p(y|x) = N( y | Ax, Λ ) • Training set: {y} p(y) is Gaussian with covariance AA’+Λ-1 Learn parameters by expectation maximization (EM) • Test set: {y}, obtain factors by p(x|y) = p(y,x) / p(x) • Limit of zero noise: principal component analysis (PCA) p(y,x) = p(y|x) p(x) But now x is hidden

x(3) x(1) x(2) x(2) x(3) s(1) s(2) s(3) y(1) y(2) y(3) y(1) y(2) y(3) y(3) y(2) s(1) s(2) s(3) x(1) y(1) Dynamical Models Hidden Markov Model (Baum-Welch) State Space Model (Kalman Smoothing) Switching State Space Model (Intractable)

Probabilistic Inference Factor analysis model p(x) = N(x|0,I) p(y|x,A,Λ) = N(y|Ax,Λ) A • Nodes inside frame: variables, vary in time • Nodes outside frame: parameters, constant in time • Parameters have prior distributions p(A), p(Λ) • Bayesian Inference: compute full posterior distribution p(x,A,Λ|y) over all hidden nodes conditioned on observed nodes • Bayes’ rule: p(x,A,Λ|y)=p(y|x,A,Λ)p(x)p(A)p(Λ)/p(y) • In hidden variable models, joint posterior can generally not be computed exactly. The normalization factor p(y) is instractable x Λ y

MAP and Maximum Likelihood • MAP = maximum aposteriori; consider only the parameter values the maximize the posterior p(x,A,Λ|y) • This is the maximum likelihood method: compute A,Λ that maximize L = log p(y|A,Λ) • However, in hidden variable models L is a complicated function of the parameters; direct maximization would require gradient based techniques which are slow • Solution: the EM algorithm • Iterative algorithm, each iteration has an E-step and an M-step • E-step: compute posterior over hidden variables p(x|y) • M-step: maximize complete data likelihood E log p(y,x,A,Λ) w.r.t. the parameters A,Λ; E = posterior average over x

Derivation of the EM Algorithm Instead of the likelihood L = log p(y), consider F(q) = E log p(y,x) – E log q(x|y) where q(x|y) is a ‘trial’ posterior and E averaged over x w.r.t. q • Can show: F(q) = L – KL [ q(x|y) || p(x|y) ] <= L • Hence F is upper bounded by L, and F=L when q=true posterior EM performs an alternate maximization of F • The E-step maximizes F w.r.t. the posterior q • The M-step maximizes F w.r.t. the parameters A,Λ Hence EM performs maximum likelihood

MoG1 s1 s2 MoG2 x1 x2 y1 y2 A ICA by EM: MoG Sources • Each source distribution p(x) is a 1-dim mixture of Gaussians • The Gaussian labels s are hidden variables • The data y = A x, hence x = G y are not hidden • Likelihood: L = log |G| + log p(x) • F(q) = log |G| + E log p(x,s) • – E log q(s|y) • E-step: q(s|y) = p(x,s) / z • M-step: G  G + ε(I-Φ(x)x’)G (natural gradient) Φ(x) is linear in x and q • Can also learn the source parameters MoG1, MoG2 at the M-step

MoG1 s1 s2 MoG2 x1 x2 y1 y2 A,Λ Noisy, Non-Square ICA: Independent Factor Analysis • The Gaussian labels s are hidden variables • The data y = A x + u, hence x are also hidden • p(y|x) = N( y | Ax, Λ ) • Likelihood: L = log p(y) must marginalize over x,s • F(q) = E log p(y,x,s) • – E log q(x,s|y) • E-step: q(x,s|y) = q(x|s,y)q(s|y) • M-step: linear eqs for A, Λ • Can also learn the source parameters MoG1, MoG2 at the M-step • Convergence problem in low noise

x(2) x(3) y(2) y(3) s(1) s(2) s(3) x(1) y(1) Intractability of Inference • In many models of interest the E-step is computationally intractable • Switching state space model: posterior over discrete state p(s|y) is exponential in time • Independent factor analysis: posterior over Gaussian labels is exponential in number of sources • Approximations must be made • MAP approximation: consider only the most likely state configuration(s) • Markov Chain Monte Carlo: convergence may be quite slow and hard to determine

Variational EM • Idea: use an approximate posterior which has a factorized form • Example: switching state space model factorize the continuous states from the discrete states p(x,s|y) ~ q(x,s|y) = q(x|y) q(s|y) make no other assumptions (e.g., functional forms) • To derive, consider F(q) from the derivation of EM F(q) = E log p(y,x,s) - E log q(x|y) – E log q(s|y) E performs posterior averaging w.r.t. q • Maximize F alternately w.r.t. q(x|y) and q(s|y) q(x|y) = Es p(y,x,s) / zs q(s|y) = Ex p(y,x,s) / zx • This adds an internal loop in the E-step; M-step is unchanged • Convergence is guaranteed since F(q) is upper bounded by L

x(2) x(3) y(2) y(3) s(1) s(2) s(3) x(1) y(1) Switching Model: 2 Variational Approximations Model: Variational approximation I: s(1) s(2) s(3) x(1) x(2) x(3) Variational approximation II: s(1) s(2) s(3) I: Baum-Welch, Kalman; Gaussian, smoothed II: Baum-Welch, MoG; Multimodal, not smoothed x(1) x(2) x(3)

s1 s2 s1 s2 x1 x2 x1 x2 y1 y2 s1 s2 x1 x2 IFA: 2 Variational Approximations Model: Variational approximation I: Variational approximation II: I: Source posterior is Gaussian, correlated II: Source posterior is multimodal, independent

Model Order Selection • How does one determine the optimal number of factors in FA? • Maximum likelihood would always prefer more complex models, since they fit the data better; but they overfit • The probabilistic inference approach: place a prior p(A) over the model parameters, and consider the marginal likelihood L = log p(y) = E log p(y,A) – E log p(A|y) Compute L for each number of factors. Choose the number that maximizes L • An alternative approach: place a prior p(A) assuming a maximum number of factors. The prior has a hyperparameter for each column of A – its precision α. Optimize the precisions by maximizing L. Unnecessary columns will have α infinity • Both approaches require computing the parameter posterior p(A|y), which is usually intractable

Variational Bayesian EM • Idea: use an approximate posterior which factorizes the parameters from the hidden variables • Example: factor analysis p(x,A|y) ~ q(x,A|y) = q(x|y) q(A|y) make no other assumptions (e.g., functional forms) • To derive, consider F(q) from the derivation of EM F(q) = E log p(y,x,A) - E log q(x|y) – E log q(A|y) E performs posterior averaging w.r.t. q • Maximize F alternately w.r.t. q(x|y) and q(A|y) E-step: q(x|y) = EA p(y,x,A) / zA M-step: q(A|y) = Ex p(y,x,A) / zx • Plus, maximize F w.r.t. the noise precision Λ and hyperparameters α (MAP approximation)

s1 s2 x1 x2 y1 y2 A VB Approximation for IFA Model: VB approximation: s1 s2 x1 x2 A

Conjugate Priors • Which form should one choose for prior distributions? • Conjugate prior idea: Choose a prior such that the resulting posterior distribution would have the same functional form as the prior • Single Gaussian: posterior over mean is p(μ|y) = p(y|μ) p(μ) / p(y) conjugate prior is Gaussian • Single Gaussian: posterior over mean + precision is p(μ,ν|y) = p(y|μ,ν) p(μ,ν) conjugate prior is Normal-Wishart • Factor analysis: VB posterior over mixing matrix is q(A|y) = Ex p(y,x|A) p(A) / z conjugate prior is Gaussian

Separation of Convoluted Mixtures of Speech Sources • Blind separation methods use extremely simple models for source distributions • Speech signals have a rich structure. Models that capture aspects of it could result in improved separation, deconvolution, and noise robustness • One such model: work in the windowed FFT domain x(n,k) = G(k) y(n,k) where n=frame index, k=frequency Train a MoG model on the x(n,k) such that different components capture different speech spectra • Plus this model into IFA and use EM to obtain separation of convoluted mixtures

MoG1 s1 s2 MoG2 x1 x2 y1 y2 A,Λ Noise Suppression in Speech Signals • Existing methods based on, e.g., spectral subtraction and array processing, often produce unsatisfactory noise suppression • Algorithms based on probabilistic models can (1) exploit rich speech models, (2) learn the noise from noisy data (not just silent segments) (3) can work with one or more microphones • Use speech model in the windowed FFT domain • Λ(k) = noise precision per frequency (inverse spectrum)

Interference Suppression and Evoked Source Separation in MEG data • y(n) = MEG sensor data, x(n) = evoked brain sources, u(n) = interference sources, v(n) = sensor noise • Evoked stimulus experimental paradigm: evoked sources are active only after the stimulus onset pre-stimulus: y(n) = B u(n) + v(n) post-stimulus: y(n) = A x(n) + B u(n) + v(n) • SEIFA is an extension of IFA to this case: model x by MoG, model u by Gaussians N(0,I), model v by Gaussian N(0,Λ) • Use pre-stimulus to learn interference mixing matrix B and noise precision Λ; use post-stimulus to learn evoked mixing matrix A • Use VB-EM to infer from data the optimal number of interference factors u and of evoked factors x; also protect from overfitting • Cleaned data: y = A x ; Contribution of factor j: yi = Aij xj • Advantages over ICA: no need to discard information by dim reduction; can exploit stimulus onset information; superior noise suppression

MoG s x u y A,B Stimulus Evoked Independent Factor Analysis Pre-stimulus Post-stimulus u y B

Brain Source Localization using MEG • Problem: localize brain sources that respond to a stimulus • Response model is simple: y(n) = F s(n) + v(n) F = lead field (known), s = brain voxel activity • However, the number of voxels (~3000-1000) is much larger than the number of sensors (~100-300) • One approach: fit multiple dipole sources; cost is exponential in the number of sources • Idea: loop over voxels; for each one, use VB-EM to learn a modified FA model y(n) = F’ z(n) + A x(n) + v(n) where F’ = lead field for that voxel, z = voxel activity, x = response from all other active voxels • Obtain a localization map by plotting <z(n)2> per voxel • Superior results to exising (beamforming based) methods; can handle correlated sources

MEG Localization Model Pre-stimulus Post-stimulus u z x u y y B F’ A,B

Conclusion • Statistical machine learning provides a principled framework for formulating and solving adaptive signal processing problems • Process: (1) design a probabilistic model that corresponds to the problem (2) use machinery for exact and approximate inference to learn the model from data, including model order (3) extend the model, by e.g. incorporating rich signal models, to improve performance • Problems treated here: noise suppression, source separation, source localization Domains: Speech, audio, biomedical data • Domains outside this tutorial: image, video, text, coding, ….. • Future: algorithms derived from probabilistic models take over and completely transform adaptive signal processing

Bayesian Machine Learning for Signal Processing

Bayesian Machine Learning for Signal Processing

Presentation Transcript

Signal Processing

CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes

Machine Learning for Natural Language Processing

Signal Processing

Budgeted Machine Learning of Bayesian Networks

Bayesian Machine learning and its application

Signal Processing

Signal Processing

Machine Learning Chapter 6. Bayesian Learning

General Signal Processing and Machine Learning Tools for BCI Analysis (Part-2)

Information Theoretic Signal Processing and Machine Learning

Brain Machine Interfaces: Modeling Strategies for Neural Signal Processing

General Signal Processing and Machine Learning Tools for BCI Analysis (Part-1)

When Signal Processing Meets Machine Learning

Machine Learning for Signal Processing Fundamentals of Linear Algebra - 2

Machine Learning for Signal Processing Principal Component Analysis

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Machine Learning for Signal Processing Linear Gaussian Models

Machine Learning for Signal Processing Eigenfaces and Eigen Representations