Gene Expression Analysis Using Bayesian Networks

Éric Paquet LBIT Université de Montréal Gene Expression Analysis Using Bayesian Networks

Biological basis RNA Polymerase (Copy DNA in RNA) DNA (Storage of Genetic Information) Ribosome (Translate Genetic Information in Proteins) mRNA (Storage & Transport of Genetic Information) Proteins (Expression of Genetic Information) *-PDB file 1L3A, Transcriptional Regulator Pbf-2 2

Biological basis • How do proteins get regulated? • E. coli operon lactose example : • In normal time, E. coli uses glucose to get energy, but how does it react if there is no more glucose but only lactose? 3

Glucose Lactose X Lactose getter (permease) Lactose decomposor (β-galactosidase) Biological basis E. coli environment RNA Polymerase ... ... Gene Lac I associated protein Polymerase action is blocked because of a DNA lock 4

Glucose Lactose X Lactose getter (permease) Lactose decomposor (β-galactosidase) Biological basis E. coli environment X RNA Polymerase ... ... Lactose Lactose unlocking the DNA that is then accessible to the polymerase Lactose recruits gene lacI associated protein… 5

= inhibit Biological basis Lactose decomposor (β-galactosidase) Lactose getter (permease) 6

CAP Glucose Lactose c-AMP Lactose getter (permease) Lactose decomposor (β-galactosidase) Biological basis E.coli environment X RNA Polymerase ... ... Lactose In absence of glucose, a polymerase magnet binds to the DNA to accelerate the products of information that help lactose decomposition 7

Research goal: Infer these links = inhibit Biological basis Lactose decomposor (β-galactosidase) Lactose getter (permease) = activate 8

Why? Get insights about cellular processes Help understand diseases Find drug targets 9

Lactose decomposor (β-galactosidase) Lactose getter (permease) How? Using gene expression data and tools for learning Bayesian networks Experiments * + Tools for Learning Bayesian networks [mRNA] *-Spellman et al.(1998) Mol Biol Cell 9:3273-97 10

What is gene expression data? Data showing the concentration of a specific mRNA at a given time of the cell life. Experiments * [mRNA] A real value is coming from one spot and tells if the concentration of a specific mRNA is higher(+) or lower(-) than the normal value Every columns are the result of one image *-Spellman et al.(1998) Mol Biol Cell 9:3273-97

A B C D E What is Bayesian networks? Graphic representation of a joint distribution over a set of random variables. P(A,B,C,D,E) = P(A)*P(B) *P(C|A)*P(D|A,B) *P(E|D) Nodes represent gene expression while edges encode the interactions (cf. inhibition, activation)

Transcription factor dimer Histeric oscillator Switch Bayesian networks little problem A Bayesian network should be a DAG (Direct Acyclic Graph), but there are a lot of example of regulatory networks having directed cycles. * *-Husmeier D.,Bioinformatics,Vol. 19 no. 17 2003, pages 2271–2282

A A1 A2 B B1 B2 How can we deal with that? Using DBN (Dynamic Bayesian Networks*) and sequential gene expression data t t+1 We unfold the network in time DBN = BN with constraints on parents and children nodes *-Friedman, Murphy, Russell,Learning the Structure of Dynamic Probabilitic Networks

What are we searching for? • A Bayesian network that is most probable given the data D (gene expression) • We found this BN like that : • BN* = argmaxBN{P(BN|D)} Where: Marginal likelihood Prior on network structure Data probability Naïve approach to the problem : try all possible dags and keep the best one!

It is impossible to try all possible DAGs because • The number of dags increases super-exponentially with the number of nodes • n = 3 → 25 dags • n = 4 → 543 dags • n = 5 → 29281 dags • n = 6 → 3781503 dags • n = 7 → 1138779265 dags • n = 8 → 783702329343 dags • … We are interested in problem having around 60 nodes ….

Learning Bayesian Networks from data? Choosing search space method and a conditional distribution representation • Networks space search methods • Greedy hill-climbing • Beam-search • Stochastic hill-climbing • Simulated annealing • MCMC simulation • Conditional distribution representation • Linear Gaussian • Multinomial, binomial A P(a) = ? P(b) = ? P(c|a,b) = ? C Basically add, remove and reverse edges B

-1.06 -0.12 0.18 0.21 1.16 1.19 0 1 2 We use three types of gene expression level? Sort -1.06 -0.12 0.18 0.21 1.16 1.19 Split data in 3 equal buckets Discretized data 0 0 2 2 1 1

Marginal likelihood Prior on network structure Data probability Return on:

Insight on each terms • P(BN) → prior on network • In our research, we always use a prior equals to 1 • We could incorporate knowledge using it • Eg. : we know the presence of an edge. If the edge is in the BN, P(BN) = 1 else P(BN) = 0 • Efforts are made to reduce the search space by using knowledge eg. limit the number of parents or children

Insight on each terms • P(D|BN) → marginal likelihood • Easy to calculate using Multinomial distribution with Dirichlet prior * *-Heckerman,A Tutorial on Learning With Bayesian Networks and Neapolitan,Learning Bayesian Networks

A B C MCMC (Markov Chain Monte Carlo) simulation • Markov Chain part: • Zoom on a node of the chain A A P(BNnew) B C B C 1/5 1/5 A 1/5 0 A B C B 1/5 C 1/5 A A B C B C

MCMC (Markov Chain Monte Carlo) simulation • Monte Carlo part: • Choose next BN with probability P(BNnew) • Accept the new BN with the following Metropolis–Hastings acceptance criterion :

A B C Monte Carlo part example : Choose a random path. Each path having a P(BNnew) of 1/5 Choose another random number. If it is smaller than the Metropolis-Hasting criterion, accept BNnew else return to BNold Choose a random path. Each path having a P(BNnew) of 1/5 A A P(BNnew) B C B C 1/5 1/5 A 1/5 0 A B C B 1/5 C 1/5 A A B C B C

MCMC (Markov Chain Monte Carlo) simulation recap: Choose a starting BN at random Burning phase (generate 5*N BN from MCMC without storing them) Storing phase (get 100*N BN structure from MCMC) = Burning phase = Storing phase log(P(D | BN)P(BN)) Iteration

Why 100*N BN and not only 1: Cause we don’t have enough data and there are a lot of high scoring networks Instead, we associate confidence to edge. Eg. : how many time in the sample can we find edge going from A to B? We could fix a threshold on confidence and retrieve a global network construct with edges having confidence over the threshold

What we are working on: • Mixing both sequential and non-sequential data to retrieve interesting relation between genes • How? • Using DBN and MCMC for sequential data + BN and MCMC for non-sequential 100*N networks from BN 100*N networks from DBN Information tuner Learn network

How to test the approach: Problem : There is no way to test it on real data cause there is no completely known network Solution : Work on realistic simulation where we know the network structure Example : * 0 1 12 2 4 13 Simulate 3 5 6 7 8 9 10 11 *-Hartemink A.” Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks”

0 1 12 2 4 13 3 5 6 7 8 9 10 11 How to test the approach: BN MCMC Info tuner DBN MCMC Sequential data Non-Sequential data Compare using ROC curves * 0 1 12 2 4 13 Simulate 3 5 6 7 8 9 10 11 *-Hartemink A.” Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks”

Test description: • Generate 60 sequential data • Generate 120 non-sequential data (~reality proportion) • Run DBN MCMC on sequential data keep 100*N sample net • Run BN MCMC on non-sequential data keep 100*N sample net • Test performance using weight on sample • 0 BN 1 DBN • .05 BN 0.95 DBN • … • 0.95 BN .05 DBN • 1 BN 0 DBN • The metric used is the area under ROC curve. Perfect learner gets 1.0 , random gets 0.5 and the worst one gets 0.

Results: Area under ROC curve 1 0 BN 0 1 DBN

Perspective: • Working on more sophisticated ways to mix sequential and non-sequential data • Working on real cases: • Yeast cell-cycle • Arabidopsis Thaliana circadian rhythm • Real data also means missing values • Evaluate missing values solution (EM, KNNImpute)

Acknowledgements: François Major

Experimental problems Low correlation Why are there missing datas?

ROC Curve Receiver Operating Characteristic curve * *-http://gim.unmc.edu/dxtests/roc2.htm

MCMC simulation and number of sampled networks

Gene Expression Analysis Using Bayesian Networks