An introduction to machine learning and probabilistic graphical models

An introduction to machine learning and probabilistic graphical models Kevin Murphy MIT AI Lab Presented at Intel’s workshop on “Machine learningfor the life sciences”, Berkeley, CA, 3 November 2003 .

Overview • Supervised learning • Unsupervised learning • Graphical models • Learning relational models Thanks to Nir Friedman, Stuart Russell, Leslie Kaelbling andvarious web sources for letting me use many of their slides

F(x1, x2, x3) -> t Learn to approximate function from a training set of (x,t) pairs Supervised learning no yes

Supervised learning Training data Learner Prediction Testing data Hypothesis

Key issue: generalization yes no ? ? Can’t just memorize the training set (overfitting)

Hypothesis spaces • Decision trees • Neural networks • K-nearest neighbors • Naïve Bayes classifier • Support vector machines (SVMs) • Boosted decision stumps • …

Perceptron(neural net with no hidden layers) Linearly separable data

Which separating hyperplane?

margin The linear separator with the largest margin is the best one to pick

What if the data is not linearly separable?

z3 x2 x1 z2 z1 Kernel trick kernel Kernel implicitly maps from 2D to 3D,making problem linearly separable

Support Vector Machines (SVMs) • Two key ideas: • Large margins • Kernel trick

Boosting Simple classifiers (weak learners) can have their performanceboosted by taking weighted combinations Boosting maximizes the margin

Supervised learning success stories • Face detection • Steering an autonomous car across the US • Detecting credit card fraud • Medical diagnosis • …

Unsupervised learning • What if there are no output labels?

Reiterate K-means clustering • Guess number of clusters, K • Guess initial cluster centers, 1, 2 • Assign data points xi to nearest cluster center • Re-compute cluster centers based on assignments

AutoClass (Cheeseman et al, 1986) • EM algorithm for mixtures of Gaussians • “Soft” version of K-means • Uses Bayesian criterion to select K • Discovered new types of stars from spectral data • Discovered new classes of proteins and introns from DNA/protein sequence databases

Hierarchical clustering

Principal Component Analysis (PCA) PCA reduces the dimensionality of feature space by restricting attention to those directions along which the scatter of the cloud is greatest. PCA seeks a projection that best represents the data in a least-squares sense. .

Discovering nonlinear manifolds

Combining supervised and unsupervised learning

Discovering rules (data mining) Find the most frequent patterns (association rules) Num in household = 1 ^ num children = 0 => language = English Language = English ^ Income < $40k ^ Married = false ^num children = 0 => education {college, grad school}

Unsupervised learning: summary • Clustering • Hierarchical clustering • Linear dimensionality reduction (PCA) • Non-linear dim. Reduction • Learning rules

Discovering networks ? From data visualization to causal discovery

Networks in biology • Most processes in the cell are controlled by networks of interacting molecules: • Metabolic Network • Signal Transduction Networks • Regulatory Networks • Networks can be modeled at multiple levels of detail/ realism • Molecular level • Concentration level • Qualitative level Decreasing detail

Molecular level: Lysis-Lysogeny circuit in Lambda phage Arkin et al. (1998), Genetics 149(4):1633-48 • 5 genes, 67 parameters based on 50 years of research • Stochastic simulation required supercomputer

g1 g2 w12 w55 g5 w23 g4 g3 Concentration level: metabolic pathways • Usually modeled with differential equations

Qualitative level: Boolean Networks

Probabilistic graphical models • Supports graph-based modeling at various levels of detail • Models can be learned from noisy, partial data • Can model “inherently” stochastic phenomena, e.g., molecular-level fluctuations… • But can also model deterministic, causal processes. "The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful. Therefore the true logic for this world is the calculus of probabilities." -- James Clerk Maxwell "Probability theory is nothing but common sense reduced to calculation." -- Pierre Simon Laplace

Graphical models: outline • What are graphical models? • Inference • Structure learning

Simple probabilistic model:linear regression Deterministic (functional) relationship Y =  +  X + noise Y X

Simple probabilistic model:linear regression Deterministic (functional) relationship Y =  +  X + noise Y “Learning” = estimatingparameters , ,  from(x,y) pairs. Is the empirical mean Can be estimate byleast squares X Is the residual variance

Piecewise linear regression Latent “switch” variable – hidden process at work

X Q Y Probabilistic graphical model for piecewise linear regression input • Hidden variable Q chooses which set ofparameters to use for predicting Y. • Value of Q depends on value of input X. • This is an example of “mixtures of experts” output Learning is harder because Q is hidden, so we don’t know whichdata points to assign to each line; can be solved with EM (c.f., K-means)

Classes of graphical models Probabilistic models Graphical models Undirected Directed Bayes nets MRFs DBNs

Qualitative part: Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence Family of Alarm E B P(A | E,B) e b 0.9 0.1 e b 0.2 0.8 e b 0.9 0.1 0.01 0.99 e b Bayesian Networks Compact representation of probability distributions via conditional independence Burglary Earthquake Radio Alarm Call Together: Define a unique distribution in a factored form Quantitative part: Set of conditional probability distributions

MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL FIO2 VENTALV PVSAT ANAPHYLAXIS ARTCO2 EXPCO2 SAO2 TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HISTORY CO CVP PCWP HREKG HRSAT HRBP BP Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients • 37 variables • 509 parameters …instead of 254

Success stories for graphical models • Multiple sequence alignment • Forensic analysis • Medical and fault diagnosis • Speech recognition • Visual tracking • Channel coding at Shannon limit • Genetic pedigree analysis • …

Graphical models: outline • What are graphical models? p • Inference • Structure learning

Burglary Earthquake Radio Alarm Call Probabilistic Inference • Posterior probabilities • Probability of any event given any evidence • P(X|E) Radio Call

Viterbi decoding Compute most probable explanation (MPE) of observed data Hidden Markov Model (HMM) hidden X1 X2 X3 Y1 Y3 observed Y2 “Tomato”

MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL VENTALV PVSAT ARTCO2 EXPCO2 SAO2 TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HISTORY CO CVP PCWP HREKG HRSAT HRBP BP Inference: computational issues Easy Hard Dense, loopy graphs Chains Trees Grids

MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL VENTALV PVSAT ARTCO2 EXPCO2 SAO2 TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HISTORY CO CVP PCWP HREKG HRSAT HRBP BP Inference: computational issues Easy Hard Dense, loopy graphs Chains Trees Grids Many difference inference algorithms,both exact and approximate

Bayesian inference • Bayesian probability treats parameters as random variables • Learning/ parameter estimation is replaced by probabilistic inference P(|D) • Example: Bayesian linear regression; parameters are = (, , ) Parameters are tied (shared)across repetitions of the data  X1 Xn Y1 Yn

Bayesian inference • + Elegant – no distinction between parameters and other hidden variables • + Can use priors to learn from small data sets (c.f., one-shot learning by humans) • - Math can get hairy • - Often computationally intractable

Graphical models: outline • What are graphical models? • Inference • Structure learning p p

Increases the number of parameters to be estimated Wrong assumptions about domain structure Cannot be compensated for by fitting parameters Wrong assumptions about domain structure Truth Earthquake Earthquake Alarm Set AlarmSet Burglary Burglary Earthquake Alarm Set Burglary Sound Sound Sound Why Struggle for Accurate Structure? Missing an arc Adding an arc

E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> Scorebased Learning Define scoring function that evaluates how well a structure matches the data E B E E A A B A B Search for a structure that maximizes the score

Learning Trees • Can find optimal tree structure in O(n2 log n) time: just find the max-weight spanning tree • If some of the variables are hidden, problem becomes hard again, but can use EM to fit mixtures of trees

Heuristic Search • Learning arbitrary graph structure is NP-hard.So it is common to resort to heuristic search • Define a search space: • search states are possible structures • operators make small changes to structure • Traverse space looking for high-scoring structures • Search techniques: • Greedy hill-climbing • Best first search • Simulated Annealing • ...

An introduction to machine learning and probabilistic graphical models

An introduction to machine learning and probabilistic graphical models

Presentation Transcript

CSCI 5822 Probabilistic Models of Human and Machine Learning

An Introduction to Machine Learning

Undirected Probabilistic Graphical Models (Markov Nets)

Graphical Models for Machine Learning and Computer Vision

Introduction to Graphical Models

An introduction to probabilistic graphical models and the Bayes Net Toolbox for Matlab

Probabilistic graphical models

Probabilistic Graphical Models

Directed Graphical Probabilistic Models:

An Introduction to Variational Methods for Graphical Models

Probabilistic graphical models and regulatory networks

Probabilistic Graphical Models

Graphical Models - Learning -

Graphical Models in Machine Learning

Query-Specific Learning and Inference for Probabilistic Graphical Models

Web Information Extraction Learning based on Probabilistic Graphical Models

Graphical Models for Machine Learning and Computer Vision

Graphical Models - Learning -

Probabilistic Graphical Models

An Introduction to Variational Methods for Graphical Models