- 98 Views
- Uploaded on
- Presentation posted in: General

Parallel Bayesian Phylogenetic Inference

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Parallel Bayesian Phylogenetic Inference

Xizhou Feng

Directed by Dr. Duncan Buell

Department of Computer Science and Engineering

University of South Carolina, Columbia

2002-10-11

- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Parallel implementation
- Numerical result
- Future research

Darwin: Species are related through a history of common descent, and

The history can be organized as a tree structure (phylogeny).

Modern species are put on the leaf nodes

Ancient species are put on the internal nodes

The time of the divergence is described by the length of the branches.

A clade is a group of organisms whose members share homologous features derived from a common ancestor.

Internal node

Ancestral species

Clade

Leaf node

Current species

Branch length

Branch

- Phylogenetic tree is
- fundamental to understand evolution and diversity
- Principle to organize biological data
- Central to organism comparison

- Practical examples
- Resolve quarrel over bacteria-to-human gene transfers (Nature 2001)
- Tracing route of infectious disease transmission
- Identify new pathogens
- Phylogenetic distribution of biochemical pathways

output

input

Major objectives include:

- Estimate the tree topology
- Estimate the branch Length, and
- Describe the credibility of the result

- Algorithmic methods
- Defining a sequence of specific steps that lead to the determination of a tree
- e.g. UPGMA (unweighted pair group method using arithmetic average) and Neighbor-Joining

- Optimality criterion-based methods
- 1) Define a criterion; 2)Search the tree with best values
- Maximum parsimony (minimize the total tree length)
- Maximum likelihood (maximize the likelihood)
- Maximum posterior probability (the tree with the highest probability to be the true tree)

Algorithm

Statistical

Supported

Search Strategy

Bayesian

Methods

Stochastic

search

Maximum

Likelihood

Optimization

method

Divide &

Conquer

Maximum

Parsimony

Greedy search

GA, SA

MCMC

Exact search

Fitch-Margolish

DCM, HGT, Quartet

Algorithmic

method

Stepwise addition

Global arrangement

Star decomposition

Exhaustive

Branch & Bound

Neighbor-join

UPGMA

Distance matrix

Data set

Character data

- Robustness
- If the model or assumption or the data is not exact correct, how about the result?

- Convergence rate
- How long a sequence is needed to recover the true tree?

- Statistical support
- With what probability is the computed tree the true tree?

- Accuracy
- Is the constructed tree a true tree?
- If not, the percentage of the wrong edges?

- Complexity
- Neighbor-Join O(n3)
- Maximum Parsimony (provably NP hard)
- Maximum Likelihood (conjectured NP hard)

- Scalability
- Good for small tree, how about large tree

>1.7 million known species

Number of trees increase exponentially as new species was added

The complex of evolution

Data Collection & Computational system

Compute the tree of life

Source: http://www.npaci.edu/envision/v16.3/hillis.html

- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Parallel implementation
- Numerical result
- Future research

- Both observed data and parameters of models are random variables
- Setting up the joint distribution

Topology

Branch length

Parameter of models

When data D is known, Bayes theory gives:

likelihood

Prior probability

Posterior probability

Unconditional Probability of data

- P(T|D) can be interpreted as the probability of the tree is correct
- We need to do at least two things:
- Approximate the posterior probability distribution
- Evaluate the integral for P(T|D)

- These can be done via Markov Chain Monte Carlo Method

Having the posterior probability distribution , we can compute the marginal probability of T as:

- The basic idea of MCMC is:
- To construct a Markov chain such that:
- Have the parameters as the state space, and
- the stationary distribution is the posterior probability distribution of the parameters

- Simulate the chain
- Treat the realization as a sample from the posterior probability distribution

- To construct a Markov chain such that:
- MCMC = sampling + continue search

- A Markov chain is a sequence of random variables {X0, X1, X2, …} whose transition kernel T(Xt, Xt+1) is determined only by the value of Xt (t>0).
- Stationary distribution: (x)=Σx’((x’)T(x,x’)) is invariant
- Ergodic property: pn(x) converges to (x) as n
- A homogeneous Markov chain is ergodic if min(T(x,x’)/ (x’)>0

- Cornerstone of all MCMC methods, Metropolis(1953)
- Hasting proposed a generalized version in (1970)
- The key point is to how to define the accepted probability:
- Metropolis:
- Hasting:

Proposal probability

Can be any form

Such that

- Initialize x0, set t=0
- Repeat :
- Sample x’ from T(xt,x’)
- Draw U~uniform[0,1]
- Update

- Problems:
- Mixing rate is slow when:
- Small step->low movement
- Larger step->low acceptance

- Stopped at local optima
- Dimension of state space may vary

- Mixing rate is slow when:
- Improvement:
- Metropolis-coupled MCMC
- Multipoint MCMC
- Population-based MCMC
- Time-reversible jump MCMC

- Run several MCMC chains with different distribution i(x) (i=1..m) in parallel
- 1(x) is used to sampling
- i(x) (i=2..m) are used to improve mixing
- For example: i(x) = (x)1/(1+(I-1))

- After each iteration, attempt to swap the states of two chains using a Metropolis-Hasting step with acceptance probability of

1(x)T1=0

2(x) T2=2

3(x) T3=4

4(x) T4=8

Metropolis-coupled MCMC is also called Parallel tempering

Sample from T(y, .)

y1

x1*

y2

x2*

xt

y

xt+1

y3

x3*

y4

x4*

Choose y= yi

Accept y or keep xt using a M-H step

Sample from T(xt, .)

- Metropolis-coupled MCMC uses a minimal interaction between multiple chains, why not more active interaction
- Evolutionary Monte Carlo (Liang et al 2000)
- Combine Genetic Algorithm with MCMC
- Used to Simulate protein folding

- Conjugate Gradient Monte Carlo (Liu et al 2000)
- Use local optimization for adaptation
- An improvement of ADS (Adaptive Direction Sampling)

- Evolutionary Monte Carlo (Liang et al 2000)

- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Choose the evolutionary model
- Compute the likelihood
- Design proposal mechanisms

- Parallel implementation
- Numerical result
- Future research

A

G

C

T

Purine

transition

transversion

Pyrimidine

- Consider inference of un-rooted tree and the computational complications, some simplified models are used (see next slide)

GTR

- GTR: general time-reversible model, corresponding to a symmetric rate matrix.

Three substitution type

(1 transversion, 2 transition)

Equal base frequencies

TN93

SYM

Two substitution types

(transition v.s. tranversion)

Three substitution type

(1 transversion, 2 transition)

HKY85

F84

K3ST

Equal base frequencies

Two substitution types

(transition v.s. tranversion)

Single substitution type

K2P

F81

Equal base frequencies

Single substitution type

JC69

Substitute rates vary across sites

Invariable sites models + gamma distribution

Correlation in the rates at adjacent sites

Codon models 61X61 instantaneous rate matrix

Secondary structure models

a

t

b

- Given substitution rate matrix, how to compute p(b|a,t)-the probability of a is substituted by b after time t

Eigenvalue of Q

x5

t4

x4

t3

t2

t1

x2

x1

x3

When x4 x5 are known,

When x4 x5 are unknown,

Given a rooted tree with n leaf nodes (species) , and each leaf node is represented by a sequence xi with length N, the likelihood of a rooted tree is represented as:

- Felsenstein’s algorithms for likelihood calculation(1981)
- Initiation:
- Set k=2n-1

- Recursion: Compute for all a as follows
- If k is a leaf node:
- Set if ;
- Set if .

- If k is not a leaf node:
- compute for all a its children nodes i, j.
- And set

- If k is a leaf node:
- Termination: Likelihood at site u is

- Initiation:

Note: algorithm modified from Durbin et al (1998)

rate 1

rate 2

rate r

A

1.0

1.0

1.0

C

0.0

0.0

0.0

G

0.0

0.0

0.0

T

0.0

0.0

0.0

Site 1

Site 2

Site m-1

Site m-2

Taxa-1

Taxa-2

Taxa-3

Taxa-n

- The likelihood calculation requires filling an N X M X S X R table
- N: number of sequences M: number of sites
- S: number of state of charactersR: number of rate categories

- If we just change the topology and branch length of tree locally, we only need refresh the table at those affected nodes.
- In the following example, only the nodes with red color need to change their conditional likelihood.

Original tree

Proposed tree

(2) Change m and y randomly

(1)Choose a backbone

c

c

v

d

m*

m

y

u

c

u

c

x

x*

v

d

y*

a

a

- Stochastic Nearest-neighbor interchange (NNI)
- Larget et al (1999) Huelsenbeck (2000)

k+

k-

Min

k

k*

Max

Larget et al. (1999)

- Independent parameter
- e.g. transition/tranversion ratio k

- A set of parameters constrained to sum to a constant
- e.g. base frequency distribution
- Draw a sample from the Dirichlet distribution

Prior probability

DNA Data

Likelihood

Evolutionary model

Phylogenetic tree

Posterior prob.

Proposal

Starting tree

inference

A sequence of Samples

MCMC

Approximate the distribution

- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Parallel implementation
- Challenges of serial computation
- Difficulty:
- MCMC is a serial algorithm
- Multiple chains need to be synchronized

- Choose appropriate grid topology
- Synchronize using random number

- Numerical result
- Future research

- Computing global likelihood needs O(NMRS2) multiplications
- Local updating topology & branch length needs O(MRS2log(N))
- Updating model parameter needs O(NMRS2)
- local update needs all required data in memory
- Given N=1000 species, each sequence has M=5000 sites, rate category R=5, and DNA nucleotide model S=4
- Run 5 chains each with length of 100 million generations
- Needs ~400 days (assume 1% global updates, 99% local update)
- And O(NMRSLX2X2X8)~32Gigabyte memory

- =>So until more advanced algorithms are developed, parallel computation is the direct solution.
- Use 32 processor with 1 gigabyte memory, we can compute the problem in ~2 weeks

- Balancing workload
- Concurrency identify, manage, and granularity

- Reducing communication
- Communication-to-computation ratio
- Frequency, volume, balance

- Reducing extra work
- Computing assignment
- Redundant work

Generate initial state S0,

S(t)= S(0)=, t=0

Propose new state S’

Evaluate S’

Compute R and U

No

U <R?

Yes

S(t+1)=S’

S(t+1)=S(t)

t=t+1

No

Yes

t>max generation

End

Chain #1

Chain #2

Chain #3

Chain #4

Generate S1(0)

t=0

Generate S2(0)

t=0

Generate S3(0)

t=0

Generate S4(0)

t=0

Propose &

Update S1(t)

Propose &

Update S2(t)

Propose &

Update S3(t)

Propose &

Update S4(t)

choose two chains to swap

Compute R and U

U<R

Swap the two selected chains

t=t+1

t=t+1

t=t+1

t=t+1

chain

rate

site

- In one likelihood evaluation, the task can be divided into n_chain x n_site x n_rate concurrent tasks

y

z

x

Concurrent tasks

Processors virtual topology

We organize the processor pool into an nx X ny X nz 3D grid

Each grid runs n_chain/ny chains, each chain has n_site/nx sites, and each site has n_rate/nz category rate. (my current implementation dosen’t consider rate dimension, so the topology is a 2D grid).

Choose nx, ny, and nz such that each processor has equal load

- Chain-level parallel configuration (Mode I)
- Each chain runs on one row
- Number of chains = number of rows

- Site-level parallel configuration (Mode II)
- All chains run on the same row
- Different subsequences run on different columns
- Number of subsequences = number of columns

- Hybrid parallel configuration (Mode III)
- Combine the first two configurations
- Several chains may run on the same row and also
- Divide the whole into segments

- The data can be distributed at chain level and sub-sequence level, spatial locality is always preserved. At the same time, each process only needs 1/(nxXny) original memory
- Each step, each processor needs 1 collective communication along its row and at most 2 (or 1) direct communication if it is one of the selected chain for swapping.
- Processors at the same row can easily be synchronized because they use the same random number.

- Processors at different rows need some trick to synchronize without communication
- The trick is to use two random number generators
- the first one is used to generate the proposal, processors at different row use different seeds for random number generator.
- The second one is used to synchronize processors in different rows, using the same seed. So processors know what’s going on in the whole grid.

- The trick is to use two random number generators
- To reduce the size of the message, the selected processors swap message in two steps:
- Swap current likelihood and temperature
- Compute the acceptance ratio r and draw a random number U, if U<r, then swap the tree and parameters else continue to next step.

- Build grid topology
- Initialize:
- Set seed for random number Rand_1(for proposal) and Rand_2(for synchronize)
- For each chain
- t = 0
- Building random tree T0
- Choose initial parameter θ0
- Set initial state as x0

- While (t<max_generation) do
3.1In-chain update

- For each chain
- Choose proposal type using Rand_2
- Generate the proposal x’
- Evaluate log likelihood locally

- Collect the sum of the local log likelihood
- For each chain
- Compute r using hasting transition kernel
- Draw a sample u from uniform distribution using Rand_1
- If u<r set x=x’
(continue)

- For each chain

- 3.2 inter-chain swap
- Draw two chain indexes I and J from Rand_2
- If I==J go to 3.3
- Map I, J to the processor coordinate row_I and row_J
- If row_I == row_J
- Compute acceptance ratio r’
- Draw u’ from uniform distribution using Rand_2
- If u’<r’ Swap state x and the temperature of chain I and chain J
- Go to 3.3

- If row_I != row_J
- Swap log likelihood and temperature of chain I and chain J between processors
- on chain row_I and row_J
- Compute acceptance r’
- Draw u’ from from uniform distribution using Rand_2
- If u’<r’ Swap state x of chain I and chain J between between processors
- on chain row_I and row_J

- 3.3 t=t+1
- 3.4 if t%sample_cycle==0 output state x

The ith chain only needs to

re-evaluate one node

The jth chain needs to

re-evaluate two nodes

Assume two chains have

the same tree at time t

- For simplicity, from now we just consider 2D grid (x-y).
- Along x-axis, the processors are synchronized since they evaluate the same topology, same branch length, and same local update step
- But along y-axis, different chains will propose different topologies, so according to local likelihood evaluation, imbalance will occur.
- In the parameter update step, we need global likelihood evaluation, balancing is not a problem.

Row group 1

mediator

Row group 2

Row group 3

- The symmetric algorithm assumes each processor has the same performance and deals with the imbalance between different rows by waiting and synchronization.
- An improvement is to use an asymmetric algorithm. The basic idea is:
- Introduce a processor as the mediator
- Each row send its state to the mediator after it finished in-chain update
- The mediator keeps the current states of different rows, and decides which two chains should swap
- The selected chains get new states from the mediator

- Asymmetric algorithm can solve the unbalance problem, but assumes that different chains don’t need to synchronize.

- Background
- Bayesian phylogenetic inference and MCMC
- Serial implementation
- Parallel implementation
- Numerical result
- Future research

Dataset: 7X2439

# of chains=4

#generation=10000

Parallel Mode: I

Dataset: 7X2439

# of chains=4

#generation=10000

Parallel Mode: I

Note: Use the same dataset as the previous one. But only parallel at sequence level.

- Bayesian phylogenetic inference
- Using MCMC sample posterior probability
- Efficient parallel strategies
- Stochastic algorithms vs. NP-problems

- Improved parallel strategies
- Connection between Bayesian & Bootstrap
- Performance of different Bayesian methods
- Alternative MCMC methods
- MCMC for other computational biology problems
- The power of stochastic algorithms
- Super tree construction