an introduction to machine learning and probabilistic graphical models l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
An introduction to machine learning and probabilistic graphical models PowerPoint Presentation
Download Presentation
An introduction to machine learning and probabilistic graphical models

Loading in 2 Seconds...

play fullscreen
1 / 88

An introduction to machine learning and probabilistic graphical models - PowerPoint PPT Presentation


  • 396 Views
  • Uploaded on

An introduction to machine learning and probabilistic graphical models. Kevin Murphy MIT AI Lab . Presented at Intel’s workshop on “Machine learning for the life sciences”, Berkeley, CA, 3 November 2003. Overview. Supervised learning Unsupervised learning Graphical models

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'An introduction to machine learning and probabilistic graphical models' - andrew


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
an introduction to machine learning and probabilistic graphical models

An introduction to machine learning and probabilistic graphical models

Kevin Murphy

MIT AI Lab

Presented at Intel’s workshop on “Machine learningfor the life sciences”, Berkeley, CA, 3 November 2003

.

overview
Overview
  • Supervised learning
  • Unsupervised learning
  • Graphical models
  • Learning relational models

Thanks to Nir Friedman, Stuart Russell, Leslie Kaelbling andvarious web sources for letting me use many of their slides

supervised learning

F(x1, x2, x3) -> t

Learn to approximate function

from a training set of (x,t) pairs

Supervised learning

no

yes

supervised learning4
Supervised learning

Training data

Learner

Prediction

Testing data

Hypothesis

key issue generalization
Key issue: generalization

yes

no

?

?

Can’t just memorize the training set (overfitting)

hypothesis spaces
Hypothesis spaces
  • Decision trees
  • Neural networks
  • K-nearest neighbors
  • Naïve Bayes classifier
  • Support vector machines (SVMs)
  • Boosted decision stumps
kernel trick

z3

x2

x1

z2

z1

Kernel trick

kernel

Kernel implicitly maps from 2D to 3D,making problem linearly separable

support vector machines svms
Support Vector Machines (SVMs)
  • Two key ideas:
    • Large margins
    • Kernel trick
boosting
Boosting

Simple classifiers (weak learners) can have their performanceboosted by taking weighted combinations

Boosting maximizes the margin

supervised learning success stories
Supervised learning success stories
  • Face detection
  • Steering an autonomous car across the US
  • Detecting credit card fraud
  • Medical diagnosis
unsupervised learning
Unsupervised learning
  • What if there are no output labels?
k means clustering

Reiterate

K-means clustering
  • Guess number of clusters, K
  • Guess initial cluster centers, 1, 2
  • Assign data points xi to nearest cluster center
  • Re-compute cluster centers based on assignments
autoclass cheeseman et al 1986
AutoClass (Cheeseman et al, 1986)
  • EM algorithm for mixtures of Gaussians
  • “Soft” version of K-means
  • Uses Bayesian criterion to select K
  • Discovered new types of stars from spectral data
  • Discovered new classes of proteins and introns from DNA/protein sequence databases
principal component analysis pca

Principal Component Analysis (PCA)

PCA reduces the dimensionality of feature space by restricting attention to those directions along which the scatter of the cloud is greatest.

PCA seeks a projection that best represents the data in a least-squares sense.

.

discovering rules data mining
Discovering rules (data mining)

Find the most frequent patterns (association rules)

Num in household = 1 ^ num children = 0 => language = English

Language = English ^ Income < $40k ^ Married = false ^num children = 0 => education {college, grad school}

unsupervised learning summary
Unsupervised learning: summary
  • Clustering
  • Hierarchical clustering
  • Linear dimensionality reduction (PCA)
  • Non-linear dim. Reduction
  • Learning rules
discovering networks
Discovering networks

?

From data visualization to causal discovery

networks in biology
Networks in biology
  • Most processes in the cell are controlled by networks of interacting molecules:
    • Metabolic Network
    • Signal Transduction Networks
    • Regulatory Networks
  • Networks can be modeled at multiple levels of detail/ realism
    • Molecular level
    • Concentration level
    • Qualitative level

Decreasing detail

molecular level lysis lysogeny circuit in lambda phage
Molecular level: Lysis-Lysogeny circuit in Lambda phage

Arkin et al. (1998), Genetics 149(4):1633-48

  • 5 genes, 67 parameters based on 50 years of research
  • Stochastic simulation required supercomputer
concentration level metabolic pathways

g1

g2

w12

w55

g5

w23

g4

g3

Concentration level: metabolic pathways
  • Usually modeled with differential equations
probabilistic graphical models
Probabilistic graphical models
  • Supports graph-based modeling at various levels of detail
  • Models can be learned from noisy, partial data
  • Can model “inherently” stochastic phenomena, e.g., molecular-level fluctuations…
  • But can also model deterministic, causal processes.

"The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful. Therefore the true logic for this world is the calculus of probabilities."

-- James Clerk Maxwell

"Probability theory is nothing but common sense reduced to

calculation." -- Pierre Simon Laplace

graphical models outline
Graphical models: outline
  • What are graphical models?
  • Inference
  • Structure learning
simple probabilistic model linear regression
Simple probabilistic model:linear regression

Deterministic (functional) relationship

Y =  +  X + noise

Y

X

simple probabilistic model linear regression32
Simple probabilistic model:linear regression

Deterministic (functional) relationship

Y =  +  X + noise

Y

“Learning” = estimatingparameters , ,  from(x,y) pairs.

Is the empirical mean

Can be estimate byleast squares

X

Is the residual variance

piecewise linear regression
Piecewise linear regression

Latent “switch” variable – hidden process at work

probabilistic graphical model for piecewise linear regression

X

Q

Y

Probabilistic graphical model for piecewise linear regression

input

  • Hidden variable Q chooses which set ofparameters to use for predicting Y.
  • Value of Q depends on value of input X.
  • This is an example of “mixtures of experts”

output

Learning is harder because Q is hidden, so we don’t know whichdata points to assign to each line; can be solved with EM (c.f., K-means)

classes of graphical models
Classes of graphical models

Probabilistic models

Graphical models

Undirected

Directed

Bayes nets

MRFs

DBNs

bayesian networks
Qualitative part:

Directed acyclic graph (DAG)

Nodes - random variables

Edges - direct influence

Family of Alarm

E

B

P(A | E,B)

e

b

0.9

0.1

e

b

0.2

0.8

e

b

0.9

0.1

0.01

0.99

e

b

Bayesian Networks

Compact representation of probability distributions via conditional independence

Burglary

Earthquake

Radio

Alarm

Call

Together:

Define a unique distribution in a factored form

Quantitative part: Set of conditional probability distributions

example icu alarm network

MINVOLSET

KINKEDTUBE

PULMEMBOLUS

INTUBATION

VENTMACH

DISCONNECT

PAP

SHUNT

VENTLUNG

VENITUBE

PRESS

MINOVL

FIO2

VENTALV

PVSAT

ANAPHYLAXIS

ARTCO2

EXPCO2

SAO2

TPR

INSUFFANESTH

HYPOVOLEMIA

LVFAILURE

CATECHOL

LVEDVOLUME

STROEVOLUME

ERRCAUTER

HR

ERRBLOWOUTPUT

HISTORY

CO

CVP

PCWP

HREKG

HRSAT

HRBP

BP

Example: “ICU Alarm” network

Domain: Monitoring Intensive-Care Patients

  • 37 variables
  • 509 parameters

…instead of 254

success stories for graphical models
Success stories for graphical models
  • Multiple sequence alignment
  • Forensic analysis
  • Medical and fault diagnosis
  • Speech recognition
  • Visual tracking
  • Channel coding at Shannon limit
  • Genetic pedigree analysis
graphical models outline39
Graphical models: outline
  • What are graphical models? p
  • Inference
  • Structure learning
probabilistic inference

Burglary

Earthquake

Radio

Alarm

Call

Probabilistic Inference
  • Posterior probabilities
    • Probability of any event given any evidence
  • P(X|E)

Radio

Call

viterbi decoding
Viterbi decoding

Compute most probable explanation (MPE) of observed data

Hidden Markov Model (HMM)

hidden

X1

X2

X3

Y1

Y3

observed

Y2

“Tomato”

inference computational issues

MINVOLSET

KINKEDTUBE

PULMEMBOLUS

INTUBATION

VENTMACH

DISCONNECT

PAP

SHUNT

VENTLUNG

VENITUBE

PRESS

MINOVL

VENTALV

PVSAT

ARTCO2

EXPCO2

SAO2

TPR

INSUFFANESTH

HYPOVOLEMIA

LVFAILURE

CATECHOL

LVEDVOLUME

STROEVOLUME

ERRCAUTER

HR

ERRBLOWOUTPUT

HISTORY

CO

CVP

PCWP

HREKG

HRSAT

HRBP

BP

Inference: computational issues

Easy

Hard

Dense, loopy graphs

Chains

Trees

Grids

inference computational issues43

MINVOLSET

KINKEDTUBE

PULMEMBOLUS

INTUBATION

VENTMACH

DISCONNECT

PAP

SHUNT

VENTLUNG

VENITUBE

PRESS

MINOVL

VENTALV

PVSAT

ARTCO2

EXPCO2

SAO2

TPR

INSUFFANESTH

HYPOVOLEMIA

LVFAILURE

CATECHOL

LVEDVOLUME

STROEVOLUME

ERRCAUTER

HR

ERRBLOWOUTPUT

HISTORY

CO

CVP

PCWP

HREKG

HRSAT

HRBP

BP

Inference: computational issues

Easy

Hard

Dense, loopy graphs

Chains

Trees

Grids

Many difference inference algorithms,both exact and approximate

bayesian inference
Bayesian inference
  • Bayesian probability treats parameters as random variables
  • Learning/ parameter estimation is replaced by probabilistic inference P(|D)
  • Example: Bayesian linear regression; parameters are = (, , )

Parameters are tied (shared)across repetitions of the data

X1

Xn

Y1

Yn

bayesian inference45
Bayesian inference
  • + Elegant – no distinction between parameters and other hidden variables
  • + Can use priors to learn from small data sets (c.f., one-shot learning by humans)
  • - Math can get hairy
  • - Often computationally intractable
graphical models outline46
Graphical models: outline
  • What are graphical models?
  • Inference
  • Structure learning

p

p

why struggle for accurate structure
Increases the number of parameters to be estimated

Wrong assumptions about domain structure

Cannot be compensated for by fitting parameters

Wrong assumptions about domain structure

Truth

Earthquake

Earthquake

Alarm Set

AlarmSet

Burglary

Burglary

Earthquake

Alarm Set

Burglary

Sound

Sound

Sound

Why Struggle for Accurate Structure?

Missing an arc

Adding an arc

score b ased learning

E, B, A

<Y,N,N>

<Y,Y,Y>

<N,N,Y>

<N,Y,Y>

.

.

<N,Y,Y>

Score­based Learning

Define scoring function that evaluates how well a structure matches the data

E

B

E

E

A

A

B

A

B

Search for a structure that maximizes the score

learning trees
Learning Trees
  • Can find optimal tree structure in O(n2 log n) time: just find the max-weight spanning tree
  • If some of the variables are hidden, problem becomes hard again, but can use EM to fit mixtures of trees
heuristic search
Heuristic Search
  • Learning arbitrary graph structure is NP-hard.So it is common to resort to heuristic search
  • Define a search space:
    • search states are possible structures
    • operators make small changes to structure
  • Traverse space looking for high-scoring structures
  • Search techniques:
    • Greedy hill-climbing
    • Best first search
    • Simulated Annealing
    • ...
local search operations

S

C

E

S

C

D

E

D

S

C

S

C

E

E

D

D

Local Search Operations
  • Typical operations:

Add C D

score =

S({C,E} D)

- S({E} D)

Reverse C E

Delete C E

problems with local search
Problems with local search

Easy to get stuck in local optima

“truth”

you

S(G|D)

problems with local search ii

P(G|D)

E

B

R

A

C

Problems with local search II

Picking a single best model can be misleading

problems with local search ii54

E

E

B

B

P(G|D)

B

E

B

E

B

E

R

A

R

A

R

A

R

A

R

A

C

C

C

C

C

Problems with local search II

Picking a single best model can be misleading

  • Small sample size  many high scoring models
  • Answer based on one model often useless
  • Want features common to many models
bayesian approach to structure learning
Bayesian Approach to Structure Learning
  • Posterior distribution over structures
  • Estimate probability of features
    • Edge XY
    • Path X…  Y

Bayesian score

for G

Feature of G,

e.g., XY

Indicator function

for feature f

bayesian approach computational issues
Bayesian approach: computational issues
  • Posterior distribution over structures

How compute sum over super-exponential number of graphs?

  • MCMC over networks
  • MCMC over node-orderings (Rao-Blackwellisation)
structure learning other issues
Structure learning: other issues
  • Discovering latent variables
  • Learning causal models
  • Learning from interventional data
  • Active learning
discovering latent variables
Discovering latent variables

a) 17 parameters

b) 59 parameters

There are some techniques for automatically detecting thepossible presence of latent variables

learning causal models
Learning causal models
  • So far, we have only assumed that X -> Y -> Z means that Z is independent of X given Y.
  • However, we often want to interpret directed arrows causally.
  • This is uncontroversial for the arrow of time.
  • But can we infer causality from static observational data?
learning causal models60
Learning causal models
  • We can infer causality from static observational data if we have at least four measured variables and certain “tetrad” conditions hold.
  • See books by Pearl and Spirtes et al.
  • However, we can only learn up to Markov equivalence, not matter how much data we have.

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

learning from interventional data
Learning from interventional data
  • The only way to distinguish between Markov equivalent networks is to perform interventions, e.g., gene knockouts.
  • We need to (slightly) modify our learning algorithms.

smoking

smoking

Cut arcs cominginto nodes whichwere set byintervention

Yellowfingers

Yellowfingers

P(smoker | do(paint yellow)) = prior

P(smoker|observe(yellow)) >> prior

active learning
Active learning
  • Which experiments (interventions) should we perform to learn structure as efficiently as possible?
  • This problem can be modeled using decision theory.
  • Exact solutions are wildly computationally intractable.
  • Can we come up with good approximate decision making techniques?
  • Can we implement hardware to automatically perform the experiments?
  • “AB: Automated Biologist”
learning from relational data
Learning from relational data

Can we learn concepts from a set of relations between objects,instead of/ in addition to just their attributes?

learning from relational data approaches
Learning from relational data: approaches
  • Probabilistic relational models (PRMs)
    • Reify a relationship (arcs) between nodes (objects) by making into a node (hypergraph)
  • Inductive Logic Programming (ILP)
    • Top-down, e.g., FOIL (generalization of C4.5)
    • Bottom up, e.g., PROGOL (inverse deduction)
ilp for learning protein folding input
ILP for learning protein folding: input

yes

no

TotalLength(D2mhr, 118) ^ NumberHelices(D2mhr, 6) ^ …

100 conjuncts describing structure of each pos/neg example

ilp for learning protein folding results
ILP for learning protein folding: results
  • PROGOL learned the following rule to predict if a protein will form a “four-helical up-and-down bundle”:
  • In English: “The protein P folds if it contains a long helix h1 at a secondary structure position between 1 and 3 and h1 is next to a second helix”
ilp pros and cons
ILP: Pros and Cons
  • + Can discover new predicates (concepts) automatically
  • + Can learn relational models from relational (or flat) data
  • - Computationally intractable
  • - Poor handling of noise
the future of machine learning for bioinformatics69
The future of machine learning for bioinformatics

Prior knowledge

Hypotheses

Replicated experiments

Learner

Biological literature

Expt.design

Real world

  • “Computer assisted pathway refinement”
decision trees
Decision trees

blue?

oval?

yes

big?

no

no

yes

decision trees72
Decision trees

blue?

oval?

yes

+ Handles mixed variables

+ Handles missing data

+ Efficient for large data sets

+ Handles irrelevant attributes

+ Easy to understand

- Predictive power

big?

no

no

yes

feedforward neural network
Feedforward neural network

input

Hidden layer

Output

Sigmoid function at each node

Weights on each arc

feedforward neural network74
Feedforward neural network

input

Hidden layer

Output

- Handles mixed variables

- Handles missing data

- Efficient for large data sets

- Handles irrelevant attributes

- Easy to understand

+ Predicts poorly

nearest neighbor
Nearest Neighbor
  • Remember all your data
  • When someone asks a question,
    • find the nearest old data point
    • return the answer associated with it
nearest neighbor76
Nearest Neighbor

?

- Handles mixed variables

- Handles missing data

- Efficient for large data sets

- Handles irrelevant attributes

- Easy to understand

+ Predictive power

support vector machines svms77
Support Vector Machines (SVMs)
  • Two key ideas:
    • Large margins are good
    • Kernel trick
svm mathematical details

margin

SVM: mathematical details
  • Training data : l-dimensional vector with flag of true or false
  • Separating hyperplane :
  • Margin :
  • Inequalities :
  • Support vector expansion:
  • Support vectors :
  • Decision:
svms summary
SVMs: summary

- Handles mixed variables

- Handles missing data

- Efficient for large data sets

- Handles irrelevant attributes

- Easy to understand

+ Predictive power

General lessons from SVM success:

  • Kernel trick can be used to make many linear methods non-linear e.g., kernel PCA, kernelized mutual information
  • Large margin classifiers are good
boosting summary
Boosting: summary
  • Can boost any weak learner
  • Most commonly: boosted decision “stumps”

+ Handles mixed variables

+ Handles missing data

+ Efficient for large data sets

+ Handles irrelevant attributes

- Easy to understand

+ Predictive power

supervised learning summary
Supervised learning: summary
  • Learn mapping F from inputs to outputs using a training set of (x,t) pairs
  • F can be drawn from different hypothesis spaces, e.g., decision trees, linear separators, linear in high dimensions, mixtures of linear
  • Algorithms offer a variety of tradeoffs
  • Many good books, e.g.,
    • “The elements of statistical learning”,Hastie, Tibshirani, Friedman, 2001
    • “Pattern classification”, Duda, Hart, Stork, 2001
inference

Burglary

Earthquake

Radio

Alarm

Call

Inference
  • Posterior probabilities
    • Probability of any event given any evidence
  • Most likely explanation
    • Scenario that explains evidence
  • Rational decision making
    • Maximize expected utility
    • Value of Information
  • Effect of intervention

Radio

Call

assumption needed to make learning work
Assumption needed to makelearning work
  • We need to assume “Future futures will resemble past futures” (B. Russell)
  • Unlearnable hypothesis: “All emeralds are grue”, where “grue” means:green if observed before time t, blue afterwards.
structure learning success stories gene regulation network friedman et al
Structure learning success stories: gene regulation network (Friedman et al.)

Yeast data [Hughes et al 2000]

  • 600 genes
  • 300 experiments
structure learning success stories ii phylogenetic tree reconstruction friedman et al

leaf

Structure learning success stories II: Phylogenetic Tree Reconstruction (Friedman et al.)

Input: Biological sequences

Human CGTTGC…

Chimp CCTAGG…

Orang CGAACG…

….

Output: a phylogeny

Uses structural EM,

with max-spanning-treein the inner loop

10 billion years

instances of graphical models
Instances of graphical models

Probabilistic models

Graphical models

Naïve Bayes classifier

Undirected

Directed

Bayes nets

MRFs

Mixturesof experts

DBNs

Kalman filtermodel

Ising model

Hidden Markov Model (HMM)

ml enabling technologies
ML enabling technologies
  • Faster computers
  • More data
    • The web
    • Parallel corpora (machine translation)
    • Multiple sequenced genomes
    • Gene expression arrays
  • New ideas
    • Kernel trick
    • Large margins
    • Boosting
    • Graphical models