Loading in 2 Seconds...

An introduction to machine learning and probabilistic graphical models

Loading in 2 Seconds...

- 396 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'An introduction to machine learning and probabilistic graphical models' - andrew

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### An introduction to machine learning and probabilistic graphical models

### Principal Component Analysis (PCA)

Kevin Murphy

MIT AI Lab

Presented at Intel’s workshop on “Machine learningfor the life sciences”, Berkeley, CA, 3 November 2003

.

Overview

- Supervised learning
- Unsupervised learning
- Graphical models
- Learning relational models

Thanks to Nir Friedman, Stuart Russell, Leslie Kaelbling andvarious web sources for letting me use many of their slides

Hypothesis spaces

- Decision trees
- Neural networks
- K-nearest neighbors
- Naïve Bayes classifier
- Support vector machines (SVMs)
- Boosted decision stumps
- …

Perceptron(neural net with no hidden layers)

Linearly separable data

Support Vector Machines (SVMs)

- Two key ideas:
- Large margins
- Kernel trick

Boosting

Simple classifiers (weak learners) can have their performanceboosted by taking weighted combinations

Boosting maximizes the margin

Supervised learning success stories

- Face detection
- Steering an autonomous car across the US
- Detecting credit card fraud
- Medical diagnosis
- …

Unsupervised learning

- What if there are no output labels?

K-means clustering

- Guess number of clusters, K
- Guess initial cluster centers, 1, 2
- Assign data points xi to nearest cluster center
- Re-compute cluster centers based on assignments

AutoClass (Cheeseman et al, 1986)

- EM algorithm for mixtures of Gaussians
- “Soft” version of K-means
- Uses Bayesian criterion to select K
- Discovered new types of stars from spectral data
- Discovered new classes of proteins and introns from DNA/protein sequence databases

PCA reduces the dimensionality of feature space by restricting attention to those directions along which the scatter of the cloud is greatest.

PCA seeks a projection that best represents the data in a least-squares sense.

.

Discovering rules (data mining)

Find the most frequent patterns (association rules)

Num in household = 1 ^ num children = 0 => language = English

Language = English ^ Income < $40k ^ Married = false ^num children = 0 => education {college, grad school}

Unsupervised learning: summary

- Clustering
- Hierarchical clustering
- Linear dimensionality reduction (PCA)
- Non-linear dim. Reduction
- Learning rules

Networks in biology

- Most processes in the cell are controlled by networks of interacting molecules:
- Metabolic Network
- Signal Transduction Networks
- Regulatory Networks
- Networks can be modeled at multiple levels of detail/ realism
- Molecular level
- Concentration level
- Qualitative level

Decreasing detail

Molecular level: Lysis-Lysogeny circuit in Lambda phage

Arkin et al. (1998), Genetics 149(4):1633-48

- 5 genes, 67 parameters based on 50 years of research
- Stochastic simulation required supercomputer

Probabilistic graphical models

- Supports graph-based modeling at various levels of detail
- Models can be learned from noisy, partial data
- Can model “inherently” stochastic phenomena, e.g., molecular-level fluctuations…
- But can also model deterministic, causal processes.

"The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful. Therefore the true logic for this world is the calculus of probabilities."

-- James Clerk Maxwell

"Probability theory is nothing but common sense reduced to

calculation." -- Pierre Simon Laplace

Graphical models: outline

- What are graphical models?
- Inference
- Structure learning

Simple probabilistic model:linear regression

Deterministic (functional) relationship

Y = + X + noise

Y

X

Simple probabilistic model:linear regression

Deterministic (functional) relationship

Y = + X + noise

Y

“Learning” = estimatingparameters , , from(x,y) pairs.

Is the empirical mean

Can be estimate byleast squares

X

Is the residual variance

Piecewise linear regression

Latent “switch” variable – hidden process at work

Q

Y

Probabilistic graphical model for piecewise linear regressioninput

- Hidden variable Q chooses which set ofparameters to use for predicting Y.

- Value of Q depends on value of input X.

- This is an example of “mixtures of experts”

output

Learning is harder because Q is hidden, so we don’t know whichdata points to assign to each line; can be solved with EM (c.f., K-means)

Qualitative part:

Directed acyclic graph (DAG)

Nodes - random variables

Edges - direct influence

Family of Alarm

E

B

P(A | E,B)

e

b

0.9

0.1

e

b

0.2

0.8

e

b

0.9

0.1

0.01

0.99

e

b

Bayesian NetworksCompact representation of probability distributions via conditional independence

Burglary

Earthquake

Radio

Alarm

Call

Together:

Define a unique distribution in a factored form

Quantitative part: Set of conditional probability distributions

KINKEDTUBE

PULMEMBOLUS

INTUBATION

VENTMACH

DISCONNECT

PAP

SHUNT

VENTLUNG

VENITUBE

PRESS

MINOVL

FIO2

VENTALV

PVSAT

ANAPHYLAXIS

ARTCO2

EXPCO2

SAO2

TPR

INSUFFANESTH

HYPOVOLEMIA

LVFAILURE

CATECHOL

LVEDVOLUME

STROEVOLUME

ERRCAUTER

HR

ERRBLOWOUTPUT

HISTORY

CO

CVP

PCWP

HREKG

HRSAT

HRBP

BP

Example: “ICU Alarm” networkDomain: Monitoring Intensive-Care Patients

- 37 variables
- 509 parameters

…instead of 254

Success stories for graphical models

- Multiple sequence alignment
- Forensic analysis
- Medical and fault diagnosis
- Speech recognition
- Visual tracking
- Channel coding at Shannon limit
- Genetic pedigree analysis
- …

Graphical models: outline

- What are graphical models? p
- Inference
- Structure learning

Earthquake

Radio

Alarm

Call

Probabilistic Inference- Posterior probabilities
- Probability of any event given any evidence
- P(X|E)

Radio

Call

Viterbi decoding

Compute most probable explanation (MPE) of observed data

Hidden Markov Model (HMM)

hidden

X1

X2

X3

Y1

Y3

observed

Y2

“Tomato”

KINKEDTUBE

PULMEMBOLUS

INTUBATION

VENTMACH

DISCONNECT

PAP

SHUNT

VENTLUNG

VENITUBE

PRESS

MINOVL

VENTALV

PVSAT

ARTCO2

EXPCO2

SAO2

TPR

INSUFFANESTH

HYPOVOLEMIA

LVFAILURE

CATECHOL

LVEDVOLUME

STROEVOLUME

ERRCAUTER

HR

ERRBLOWOUTPUT

HISTORY

CO

CVP

PCWP

HREKG

HRSAT

HRBP

BP

Inference: computational issuesEasy

Hard

Dense, loopy graphs

Chains

Trees

Grids

KINKEDTUBE

PULMEMBOLUS

INTUBATION

VENTMACH

DISCONNECT

PAP

SHUNT

VENTLUNG

VENITUBE

PRESS

MINOVL

VENTALV

PVSAT

ARTCO2

EXPCO2

SAO2

TPR

INSUFFANESTH

HYPOVOLEMIA

LVFAILURE

CATECHOL

LVEDVOLUME

STROEVOLUME

ERRCAUTER

HR

ERRBLOWOUTPUT

HISTORY

CO

CVP

PCWP

HREKG

HRSAT

HRBP

BP

Inference: computational issuesEasy

Hard

Dense, loopy graphs

Chains

Trees

Grids

Many difference inference algorithms,both exact and approximate

Bayesian inference

- Bayesian probability treats parameters as random variables
- Learning/ parameter estimation is replaced by probabilistic inference P(|D)
- Example: Bayesian linear regression; parameters are = (, , )

Parameters are tied (shared)across repetitions of the data

X1

Xn

Y1

Yn

Bayesian inference

- + Elegant – no distinction between parameters and other hidden variables
- + Can use priors to learn from small data sets (c.f., one-shot learning by humans)
- - Math can get hairy
- - Often computationally intractable

Increases the number of parameters to be estimated

Wrong assumptions about domain structure

Cannot be compensated for by fitting parameters

Wrong assumptions about domain structure

Truth

Earthquake

Earthquake

Alarm Set

AlarmSet

Burglary

Burglary

Earthquake

Alarm Set

Burglary

Sound

Sound

Sound

Why Struggle for Accurate Structure?Missing an arc

Adding an arc

<Y,N,N>

<Y,Y,Y>

<N,N,Y>

<N,Y,Y>

.

.

<N,Y,Y>

Scorebased LearningDefine scoring function that evaluates how well a structure matches the data

E

B

E

E

A

A

B

A

B

Search for a structure that maximizes the score

Learning Trees

- Can find optimal tree structure in O(n2 log n) time: just find the max-weight spanning tree
- If some of the variables are hidden, problem becomes hard again, but can use EM to fit mixtures of trees

Heuristic Search

- Learning arbitrary graph structure is NP-hard.So it is common to resort to heuristic search
- Define a search space:
- search states are possible structures
- operators make small changes to structure
- Traverse space looking for high-scoring structures
- Search techniques:
- Greedy hill-climbing
- Best first search
- Simulated Annealing
- ...

C

E

S

C

D

E

D

S

C

S

C

E

E

D

D

Local Search Operations- Typical operations:

Add C D

score =

S({C,E} D)

- S({E} D)

Reverse C E

Delete C E

E

B

B

P(G|D)

B

E

B

E

B

E

R

A

R

A

R

A

R

A

R

A

C

C

C

C

C

Problems with local search IIPicking a single best model can be misleading

- Small sample size many high scoring models
- Answer based on one model often useless
- Want features common to many models

Bayesian Approach to Structure Learning

- Posterior distribution over structures
- Estimate probability of features
- Edge XY
- Path X… Y
- …

Bayesian score

for G

Feature of G,

e.g., XY

Indicator function

for feature f

Bayesian approach: computational issues

- Posterior distribution over structures

How compute sum over super-exponential number of graphs?

- MCMC over networks
- MCMC over node-orderings (Rao-Blackwellisation)

Structure learning: other issues

- Discovering latent variables
- Learning causal models
- Learning from interventional data
- Active learning

Discovering latent variables

a) 17 parameters

b) 59 parameters

There are some techniques for automatically detecting thepossible presence of latent variables

Learning causal models

- So far, we have only assumed that X -> Y -> Z means that Z is independent of X given Y.
- However, we often want to interpret directed arrows causally.
- This is uncontroversial for the arrow of time.
- But can we infer causality from static observational data?

Learning causal models

- We can infer causality from static observational data if we have at least four measured variables and certain “tetrad” conditions hold.
- See books by Pearl and Spirtes et al.
- However, we can only learn up to Markov equivalence, not matter how much data we have.

Y

Z

X

Y

Z

X

Y

Z

X

Y

Z

X

Learning from interventional data

- The only way to distinguish between Markov equivalent networks is to perform interventions, e.g., gene knockouts.
- We need to (slightly) modify our learning algorithms.

smoking

smoking

Cut arcs cominginto nodes whichwere set byintervention

Yellowfingers

Yellowfingers

P(smoker | do(paint yellow)) = prior

P(smoker|observe(yellow)) >> prior

Active learning

- Which experiments (interventions) should we perform to learn structure as efficiently as possible?
- This problem can be modeled using decision theory.
- Exact solutions are wildly computationally intractable.
- Can we come up with good approximate decision making techniques?
- Can we implement hardware to automatically perform the experiments?
- “AB: Automated Biologist”

Learning from relational data

Can we learn concepts from a set of relations between objects,instead of/ in addition to just their attributes?

Learning from relational data: approaches

- Probabilistic relational models (PRMs)
- Reify a relationship (arcs) between nodes (objects) by making into a node (hypergraph)
- Inductive Logic Programming (ILP)
- Top-down, e.g., FOIL (generalization of C4.5)
- Bottom up, e.g., PROGOL (inverse deduction)

ILP for learning protein folding: input

yes

no

TotalLength(D2mhr, 118) ^ NumberHelices(D2mhr, 6) ^ …

100 conjuncts describing structure of each pos/neg example

ILP for learning protein folding: results

- PROGOL learned the following rule to predict if a protein will form a “four-helical up-and-down bundle”:
- In English: “The protein P folds if it contains a long helix h1 at a secondary structure position between 1 and 3 and h1 is next to a second helix”

ILP: Pros and Cons

- + Can discover new predicates (concepts) automatically
- + Can learn relational models from relational (or flat) data
- - Computationally intractable
- - Poor handling of noise

The future of machine learning for bioinformatics

Prior knowledge

Hypotheses

Replicated experiments

Learner

Biological literature

Expt.design

Real world

- “Computer assisted pathway refinement”

Decision trees

blue?

oval?

yes

+ Handles mixed variables

+ Handles missing data

+ Efficient for large data sets

+ Handles irrelevant attributes

+ Easy to understand

- Predictive power

big?

no

no

yes

Feedforward neural network

input

Hidden layer

Output

- Handles mixed variables

- Handles missing data

- Efficient for large data sets

- Handles irrelevant attributes

- Easy to understand

+ Predicts poorly

Nearest Neighbor

- Remember all your data
- When someone asks a question,
- find the nearest old data point
- return the answer associated with it

Nearest Neighbor

?

- Handles mixed variables

- Handles missing data

- Efficient for large data sets

- Handles irrelevant attributes

- Easy to understand

+ Predictive power

Support Vector Machines (SVMs)

- Two key ideas:
- Large margins are good
- Kernel trick

SVM: mathematical details

- Training data : l-dimensional vector with flag of true or false

- Separating hyperplane :

- Margin :

- Inequalities :

- Support vector expansion:

- Support vectors :

- Decision:

Replace all inner products with kernels

Kernel function

SVMs: summary

- Handles mixed variables

- Handles missing data

- Efficient for large data sets

- Handles irrelevant attributes

- Easy to understand

+ Predictive power

General lessons from SVM success:

- Kernel trick can be used to make many linear methods non-linear e.g., kernel PCA, kernelized mutual information

- Large margin classifiers are good

Boosting: summary

- Can boost any weak learner
- Most commonly: boosted decision “stumps”

+ Handles mixed variables

+ Handles missing data

+ Efficient for large data sets

+ Handles irrelevant attributes

- Easy to understand

+ Predictive power

Supervised learning: summary

- Learn mapping F from inputs to outputs using a training set of (x,t) pairs
- F can be drawn from different hypothesis spaces, e.g., decision trees, linear separators, linear in high dimensions, mixtures of linear
- Algorithms offer a variety of tradeoffs
- Many good books, e.g.,
- “The elements of statistical learning”,Hastie, Tibshirani, Friedman, 2001
- “Pattern classification”, Duda, Hart, Stork, 2001

Assumption needed to makelearning work

- We need to assume “Future futures will resemble past futures” (B. Russell)
- Unlearnable hypothesis: “All emeralds are grue”, where “grue” means:green if observed before time t, blue afterwards.

Structure learning success stories: gene regulation network (Friedman et al.)

Yeast data [Hughes et al 2000]

- 600 genes
- 300 experiments

Structure learning success stories II: Phylogenetic Tree Reconstruction (Friedman et al.)

Input: Biological sequences

Human CGTTGC…

Chimp CCTAGG…

Orang CGAACG…

….

Output: a phylogeny

Uses structural EM,

with max-spanning-treein the inner loop

10 billion years

Instances of graphical models

Probabilistic models

Graphical models

Naïve Bayes classifier

Undirected

Directed

Bayes nets

MRFs

Mixturesof experts

DBNs

Kalman filtermodel

Ising model

Hidden Markov Model (HMM)

ML enabling technologies

- Faster computers
- More data
- The web
- Parallel corpora (machine translation)
- Multiple sequenced genomes
- Gene expression arrays
- New ideas
- Kernel trick
- Large margins
- Boosting
- Graphical models
- …

Download Presentation

Connecting to Server..