Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs Affiliation: Kyoto University Name: Kevin Chien, Dr. Oba Shigeyuki, Dr. Ishii Shin Date: Nov 04, 2011

For understanding distributions Terminologies

Terminologies • Schur complement: relationship between original matrix and its inverse. • Completing the square: converting quadratic of form ax2+bx+c to a(…)2+const for equating quadratic components with normal Gaussian to find unknowns, or for solving quadratic. • Robbins-Monro algorithm: iterative root finding for unobserved regression function M(x) expressed as a mean. Ie. E[N(x)]=M(x)

Terminologies (cont.) • [Stochastic appro., wiki., 2011] • Condition on that • Trace Tr(W) is sum of diagonals • Degree of freedom: dimension of subspace. Here it refers to a hyperparameter.

Gaussian distributions and motives Distributions

Conditional Gaussian Distribution Assume y=Xa, x=Xb • Derivation of conditional mean and variance: • Noting Schur complement • Linear Gaussian model: observations are weighted sum of underlying latent variables. Mean is linear w.r.t. dependent variable Xb. Variance is independent of Xb.

Marginal Gaussian Distribution • Goal is also to identify mean and variance by ‘completing the square’. • Solving above integration while noting Schur complement and compare components

Bayesian relationship with Gaussian distr. (quick view) • Consider multivariable Gaussian where • Thus • According to Bayesian equation • The conditional Gaussian must have form where exponent is difference of p(x,y) and p(x) • Ie. becomes

Bayesian relationship with Gaussian distr. • Starting from • Mean and var. for joint Gaussian distr. P(x,y) • Mean and variance for P(x|y) Can be seem as prior Can be seem as likelihood Can be seem as posterior

Bayesian relationship with Gaussian distr., sequential est. • Estimate mean by (N-1)+1 observations • Robbins-Monro algorithm looks like the above form, and can solve mean from maximum likelihood. • solve for by Robbin-Monro

Bayesian relationship with Univariate Gaussian distr. • Conjugate prior for precision (inv. cov.) of univariate Gaussian is gamma function • Conjugate prior of univariate Gaussian is Gaussian-gamma function

Bayesian relationship with Multivariate Gaussian distr. • Conjugate prior for precision (inv. cov.) mat. of Multivariate Gaussian is Wishart distr. • Conjugate prior of Multivariate Gaussian is Gaussian-Wishart distr.

Gaussian distributions variations Distributions

Student’s t-distr • Use in analysis of variance on whether effect is real and statistical significant using t-distri. w/ n-1 degree of freedom. • If Xi are normal random then • T-distr. has lower peak and longer tail (allow more outliers thus robust) than Gaussian distr. • Obtain by Sum up infinite number of univariate Gaussian of same mean but different precision

Student’s t-distr (cont.) • For multivariate Gaussian , corresponding t-distri. • Mahalanobis dist. • Mean, variance

Gaussian with periodic variables • To avoid mean been dependent on choice of origin use polar coordinate • Solve for theta • Von Mises distr. a special case of von Mises-Fiser for N-dimensional sphere: stationary distribution of a drift process on the circle

Gaussian with periodic variables (cont.) • From Gaussian of Cartesian coordinate to polar • Becomes • Von Mises distr. • Mean • Precision (concentration)

Gaussian with periodic variables: mean and variance • Solving log likelihood • mean • precision ‘m’ • By noting

Mixture of Gaussians • In part1 we already know one limitation of Gaussian is unimodal property. • Solution: linear comb. (superposition) of Gaussians • Mixing coefficients sum to 1 • Posterior here is known as ‘responsibilities’ • Log likelihood:

Exponential family • Natural form • Normalize by • 1) Bernoulli • Becomes • 2) Multinomial • Becomes

Exponential family (cont.) • 3) Univariate Gaussian • Becomes • Solve for natural parameter • Becomes • From max. likelihood

And interesting methodologies Parameters of Distributions

Uninformative priors • “Subjective Bayesian”: avoid incorrect assumption by using uninformative (ex. uniform distr.) prior. • Improper prior: prior need not sum to 1 for posterior to sum to 1 as per Bayes equation. • 1) location parameter for translation invariance • 2) scale parameter for scale invariance in

Nonparametric methods • Instead of assume form of distribution, use nonparametric methods. • 1) Histogram of constant bin width • Good for sequential data • Problem: discontinuity, dimensionality increase exp. • 2) Kernel estimators: sum of Parzen windows • ‘N’ Observations falling in region R (volume V) is ‘K’ • becomes

Nonparametric method: Kernel estimators • 2) Kernel estimators: fix V, determine K • Form of kernel function for points falling in R • h>0 is fixed parameter bandwidth for smoothing • Parzen estimator. Can choose k(u) (ex. Gaussian)

Nonparametric method: Nearest-neighbor • 3) Nearest neighbor: this time use data to grow V Prior: • Same as kernel estimator: training set is store as knowledge base. • ‘k’ is number of neighbors, larger ‘k’ for smoother, and less complex boundary, fewer regions. • For classifying N points into Nk points in class Ck from Bayesian maximize

Nonparametric method: Nearest-neighbor (cont.) • 3) Nearest neighbor: assign new point to class Ck by majority vote of its k nearest neighbors ……………… - for k=1 and n->∞ , error is bounded by Bayes error rate [k-nearest neighbor algorithm, wiki., 2011]

From David Barber’s book Ch.2 Basic Graph Concepts

Directed and undirected graphs • G with vertices and edges that are directed or undirected. • Directed graph, A->B but not B->A then A is ancestor or parent, where B is child • Directed Acyclic Graph (DAG): directed graph with no cycle (no revisit of vertex) • Connected undirected graph: path between every vertices • Clique: cycle for undirected graph

Representations of Graphs • Singled connected (tree): only one path from A to B • Spanning tree of undirected graph: singly connected subset covering all vertices • Graph representation (numerical) • Edge list: ex. • Adjacency matrix A: N vertex then NxN where Aij=1 if there is an edge from i to j. For undirected graph this will be symmetric.

Representations of Graphs (cont.) • Directed graph: If vertices are labeled in ancestral order (parent before children) then we have strictly upper triangular adjacency matrix • Provided there are no edge from a vertex to itself • K maximum clique undirected graph has a N x K matrix, where each column Ck express which nodes form a clique. • 2 cliques: vertices {1,2,3} • and {2,3,4}

Incidence Matrix • Adjacency matrix A and incidence matrix Zinc • Maximum clique incidence matrix Z • Property: • Note: Zinc columns denote edges, and rows denote vertices

Additional Information • Excerpt of graph and equations from [Pattern Recognition and Machine Learning, Bishop C.M.] page 84-127. • Excerpt of graph and equations from [Bayesian Reasoning and Machine Learning, David Barber] page 19-23. • Slide uploaded to Google group. Use with reference.

Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Presentation Transcript

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Chapter 2: Probability Concepts and Distributions

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Probability: Part 2 Sampling Distributions

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Continuous Probability Distributions Part 2

Pattern Recognition and Machine Learning