**CS b553: Algorithms for Optimization and Learning** Continuous Probability Distributions and Bayesian Networks with Continuous Variables

**Agenda** • Continuous probability distributions • Common families: • The Gaussian distribution • Linear Gaussian Bayesian networks

**Continuous probability distributions** • Let X be a random variable in R, P(X) be a probability distribution over X • P(x)0 for all x, “sums to 1” • Challenge: (most of the time) P(X=x) = 0 for any x

**CDF and PDF** • Probability density function (pdf)f(x) • Nonnegative, f(x) dx = 1 • Cumulative distribution function (cdf) g(x) • g(x) = P(Xx) • g(-) = 0, g() = 1, g(x) = (-,x] f(y) dy, monotonic • f(x) = g’(x) pdf f(x) Both cdfs and pdfs are complete representations of the probability space over X, but usually pdfs are more intuitive to work with. 1 cdf g(x)

**Caveats** • pdfs may exceed 1 • Deterministic values, or ones taking on a few discrete values, can be represented in terms of the Dirac delta functiona(x) pdf (an improper function) • a(x) = 0 if x a • a(x) = if x = a • a(x) dx = 1

**U(a1,b1)** Common Distributions U(a2,b2) • Uniform distribution U(a,b) • p(x) = 1/(b-a) if x [a,b], 0 otherwise • P(Xx) = 0 if x < a, (x-a)/(b-a)if x [a,b], 1 otherwise • Gaussian (normal) distribution N(,) • = mean, = standard deviation • P(Xx) not closed form:(1+erf(x))/2 for N(0,1)

**Multivariate Continuous Distributions** • Consider c.d.f. g(x,y) = P(Xx,Yy) • g(-,y) = 0, g(x,-) = 0 • g(,) = 1 • g(x,) = P(Xx), g(,x) = P(Yy) • g monotonic • Its joint density is given by the p.d.f. f(x,y) iff • g(p,q) = (-,p] (-,q] f(x,y) dy dx • i.e. P(axXbx,ayYby) = [ax,bx] [ay,by] f(x,y) dy dx

**Marginalization works over PDFs** • Marginalizing f(x,y) over y: • If h(x) = (-,) f(x,y) dy, then h(x) is a p.d.f. for P(Xx) • Proof: • P(Xa) = P(Xa,Y) = g(a,) = (-,a] (-,) f(x,y) dydx • h(a) = d/da P(Xa) = d/da (-,a] (-,) f(x,y) dydx (definition) = (-,) f(a,y) dy (fundamental theorem of calculus) • So, the joint density contains all information needed to reconstruct the density of each individual variable

**Conditional densities** • We might want to represent the density P(Y|X=x)… but how? • Naively, P(aYb|X=x) = P(aYb,X=x)/P(X=x), but denominator is 0! • Consider pdf p(x,y), consider taking limit P(aYb|x+eXx+e) as e0 • So p(x,y)/p(x) is the conditional density

**Transformations of continuous random variables** • Suppose we want to compute the distribution of f(X), where X is a random variable distributed w.r.t. pX(x) • Assume f is monotonic and invertible • Consider Y=f(X) a random variable • P(Yy) = I[f(x)y]pX(x) dx= = P(Xf-1(y)) • pY(y) = d/dy P(Yy) = d/dyP(X f-1(y)) • = p(f-1(y)) d/dyf-1(y) by chain rule = pX(f-1(y))/f ’(f-1(y)) by inverse function derivative

**Notes:** • In general, continuous multivariate distributions are hard to handle exactly • But, there are specific classes that lead to efficient exact inference techniques • In particular, Gaussians • Other distributions usually require resorting to Monte Carlo approaches

**Multivariate Gaussians** X~ N(m,S) • Multivariate analog in N-D space • Mean (vector) m, covariance (matrix) S • With a normalization factor

**Independence in Gaussians** • If X ~ N(mX,SX) and Y ~ N(mY,SY) are independent, then • Moreover, if X~N(m,S), then Sij=0 iff Xi and Xj are independent

**Linear Transformations** • Linear transformations of gaussians • If X~ N(m,S), y = A x + b • Then Y ~ N(Am+b, ASAT) • In fact, • Consequence: • If X~ N(mx,Sx), Y ~ N(my,Sy), Z=X+Y • Then Z ~ N(mx+my,Sx+Sy)

**Marginalization and Conditioning** • If (X,Y) ~ N([mXmY],[SXX,SXY;SYX,SYY]), then: • Marginalization • Summing out Y givesX ~ N(mX , SXX) • Conditioning: • On observing Y=y, we haveX ~ N(mX-SXYSYY-1(y-mY), SXX-SXYSYY-1SYX)

**Linear Gaussian Models** • A conditional linear Gaussian model has : • P(Y|X=x) = N(0+Ax,S0) • With parameters 0, A, and S0

**Linear Gaussian Models** • A conditional linear Gaussian model has : • P(Y|X=x) = N(0+Ax,S0) • With parameters 0, A, and S0 • If X ~ N(mX,SX), then joint distribution over is given by: (Recall the linear transformation rule) If X~ N(m,S) and y=Ax+b, then Y ~ N(Am+b, ASAT)

**CLG Bayesian Networks** • If all variables in a Bayesian network have Gaussian or CLG CPTS, inference can be done efficiently! P(X2) = N(2,2) P(X1) = N(1,1) X1 X2 P(Y|x1,x2) = N(ax1+bx2,y) Y Z P(Z|x1,y) = N(c+dx1+ey,z)

**Canonical Representation** • All factors in a CLG Bayes net can be represented as C(x;K,h,g) with C(x;K,h,g) = exp(-1/2 xTKx + hTx + g) • Ex: if P(Y|x) = N(0+Ax,S0) thenP(y|x) = 1/Z exp(-1/2 (y-Ax-0)TS0-1(y-Ax-0)) =1/Z exp(-1/2 (y,x)T [I –A]TS0-1 [I –A](y,x) + 0TS0-1 [I –A](y,x) – ½ 0TS0-10) • Is of form C((y,x);K,h,g) with • K = [I –A]TS0-1 [I –A] • h = [I –A]TS0-1 0 • g= log(1/Z) exp(–½ 0TS0-10)

**Product Operations ** • C(x;K1,h1,g1)C(x;K2,h2,g2) = C(x;K,h,g) with • K=K1+K2 • h=h1+h2 • g = g1+g2 • If the scopes of the two factors are not equivalent, just extend the K’s with 0 rows and columns, and h’s with 0 rows so that each row/column matches

**Sum Operation** • C((x,y);K,h,g)dy = C(x;K’,h’,g’)with • K’=KXX-KXYKYY-1KYX • h’=hX-KXYKYY-1hY • g’ = g+1/2 (log|2pKYY-1|+hYTKYY-1hY) • Using these two operations we can implement inference algorithms developed for discrete Bayes nets: • Top-down inference, variable elimination (exact) • Belief propagation (approximate)

**Monte Carlo with Gaussians** • Assume sample X ~ N(0,1) is given as a primitive RandN() • To sample X ~ N(m,s2), simply m+sRandN() • How to generate a random multivariate Gaussian variable N(m,S)?

**Monte Carlo with Gaussians** • Assume sample X ~ N(0,1) is given as a primitive RandN() • To sample X ~ N(m,s2), simply set x m+sRandN() • How to generate a random multivariate Gaussian variable N(m,S)? • Take Cholesky decomposition: S-1=LLT, L invertible if S is positive definite • Let y = LT(x-m) • P(y) exp(-1/2 (y12 + … + yN2)) is isotropic, and each yiis independent • Sample each component of y at random • Set x L-Ty+m

**Monte Carlo With Likelihood Weighting** • Monte Carlo with rejection has probability 0 of finding a continuous value given as evidence, so likelihood weighting must be used P(X)=N(mX,SX) X Step 1: Sample x ~ N(mX,SX) Step 2: weight by P(y|x) P(Y|x)=N(Ax+mY,SY) Y=y

**Hybrid Networks** • Hybrid networks combine both discrete and continuous variables • Exact inference techniques are hard to apply • Result in Gaussian mixtures • NP hard even in polytree networks • Monte Carlo techniques apply in straightforward way • Belief approximation can be applied (e.g., collapsing Gaussian mixtures to single Gaussians)

**Issues** • Non-gaussian distributions • Nonlinear dependencies • More in future lectures on particle filtering