1 / 43

Genome evolution: a computational approach

Genome evolution: a computational approach. Lecture 1: Modern challenges in evolution. Markov processes. Amos Tanay, Ziskind 204, ext 3579 עמוס תנאי amos.tanay@weizmann.ac.il http://www.wisdom.weizmann.ac.il/~atanay/GenomeEvo/. The Genome. intergenic. exon. intron.

venus
Download Presentation

Genome evolution: a computational approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome evolution: a computational approach Lecture 1: Modern challenges in evolution. Markov processes. Amos Tanay, Ziskind 204, ext 3579 עמוס תנאי amos.tanay@weizmann.ac.il http://www.wisdom.weizmann.ac.il/~atanay/GenomeEvo/

  2. The Genome intergenic exon intron exon intron exon intron exon intergenic Triplet Code A T C G

  3. 3X109 {ACGT} 3X109 {ACGT} Genome alignment Humans and Chimps ~5-7 million years • Where are the “important” differences? • How did they happen?

  4. 9% 1.2% 0.8% 3% 1.5% 0.5% 0.5% Where are the “important” differences? How did new features were gained? Gorilla Chimp Gibbon Baboon Human Macaque Marmoset Orangutan

  5. Antibiotic resistance: Staphylococcus aureus Timeline for the evolution of bacterial resistance in an S. aureus patient (Mwangi et al., PNAS 2007) • Skin based • killed 19,000 people in the US during 2005 (more than AIDS) • Resistance to Penicillin: 50% in 1950, 80% in 1960, ~98% today • 2.9MB genome, 30K plasmid • How do bacteria become resistant to antibiotics? • Can we eliminate resistance by better treatment protocols, given understanding of the evolutionary process?

  6. Ultimate experiment: sequence the entire genome of the evolving S. aureus Mutations Resistance to Antibiotics 8 9 10 11 12 13 14 15…18 1 2 3 4-6 7 S. Aureus got found just few “right” mutations and survived multi-antibiotics

  7. Yeast Genome duplication • The budding yeast S. cerevisiae genome have extensive duplicates • We can trace a whole genome duplication by looking at yeast species that lack the duplicates (K. waltii, A. gosypii) • Only a small fraction (5%) of the yeast genome remain duplicated

  8. How can an organism tolerate genome duplication and massive gene loss? • Is this critical in evolving new functionality?

  9. “Junk” and ultraconservation Baker’s yeast 12MB ~6000 genes 1 cell The worm c.elegans 100MB ~20,000 genes ~1000 cells Humans 3GB ~27,000 genes ~50 trillions cells

  10. From: Lynch 2007

  11. ENCODE Data intergenic exon intron exon intron exon intron exon intergenic

  12. Grand unifying theory of everything Biology (phenotype) Genomes (genotype) Strings of A,C,G,T (Total DNA on earth: A lot, but only that much)

  13. recombination mutation Species B selection Species A Fitness Evolution: bird’s eyes view Ecology (many species) Geography (Communication barriers) Environment (Changing fitness)

  14. (Probability, Calculus/Matrix theory, some graph theory, some statistics) Course outline Probabilistic models Genome structure Inference Mutations Parameter estimation Population Inferring Selection

  15. Models: Markov chains discrete continuous Bayesian networks Factor Graphs Inference: Dynamic programming Sampling Variational methods Generalized Belief propagation Parameter estimation: EM, function optimization Introduction to the human genome Point mutations Insertion/Deletions Repeats Basic population genetics Drift/Fitness/Selection Probabilistic models Genome structure Inference Mutations Parameter estimation Population Selection Protein coding genes Transcription factor binding sites RNA Networks

  16. Things you need to know or catch up with: • Graph theory • Basic definitions,Trees, Cycles • Matrix algebra • Basic definitions, Eigenvalues • Probability • Basic discrete probability, std distributions What you’ll learn: • Modern methods for inference in complex probabilistic models in general • Intro to genome organization and key concepts in evolution • Inferring selection using comparative genomics Books: Graur and Li, Molecular Evolution Lynch, Origin of genome architecture Hartl and Clark, Population genetics Durbin et al. Biological sequence analysis Karlin and Taylor, Markov Processes Freidman and Koller draft textbook (handouts) Papers as we go along.. N. Friedman D. Koller BN and beyond

  17. Course duties • 5 exercises, 40% of the grade • Mainly theoretical, math questions, usually ~120 points to collect • Trade 1 exercise for ppt annotations (extensive in-line notes) • 1 Genomic exercise (in pairs) for 10% of the grade • Compare two genomes of your choice: mammals, worms, flies, yeasts, bacteria, plants • Exam: 60% (110% in total)

  18. Ancestral Genome sequence 2 Ancestral Genome sequence 1 Model Parameters Model Parameters Genome sequence 1 Genome sequence 2 Genome sequence 3 (0) Modeling the genome sequences Probabilistic modeling: P(data | q) Using few parameters to explain/regenerate most of the data Hidden variables make model explicit and mechanistic • Inferring ancestral genomes • Based on some model compute the distribution of ancestral genomes (2) Learning an evolutionary model Using extant genomes, learn a “reasonable” model

  19. Ancestral Genome sequence 2 Ancestral Genome sequence 1 Model Parameters Genome sequence 1 Genome sequence 2 Genome sequence 3 • Decoding the genome • Genomic regions with different function evolve differently • Learn to read the genome through evolutionary modelling (2)Understanding the evolutionary process The model parameters describe evolution (3) Inferring phylogenies Which tree structure explain the data best? Is it a tree?

  20. Probabilities • Our probability space: • DNA/Protein sequences: {A,C,G,T} • Time/populations • Queries: • If a locus have an A at time t, what is the chance it will be C at time t+1? • If a locus have an A in an individual from population P, what is the chance it will be C in another individual from the same population? • What is the chance to find the motif ACGCGT anywhere in a random individual of population P? what is the chance it will remain the same after 2m years? Conditional Probability: Chain Rule: Bayes Rule: A B

  21. Random Variables & Notation • Val(X) – set of possible values of RV X • Upper case letters denote RVs (e.g., X, Y, Z) • Upper case bold letters denote set of RVs (e.g., X, Y) • Lower case letters denote RV values (e.g., x, y, z) • Lower case bold letters denote RV set values (e.g., x)

  22. Stochastic Processes and Stationary Distributions Process Model t Stationary Model

  23. t Poisson process 0 1 2 3 4 Random walk -1 0 1 2 3 Markov chain Brownian motion A B C D Discrete time T=1 T=2 T=3 T=4 T=5 Continuous time

  24. The Poisson process Events are occurring interpedently in disjoint time intervals : an r.v. that counts the number of events up to time t. Assume: probability of two or more events in time h is Now: .

  25. The Poisson process Probability of m events at time t:

  26. The Poisson process Solving the recurrence:

  27. Markov chains Transition probability Stationary transition probabilities One step transitions Stationary process General Stochastic process: The Markov property: A set of states: Finite or Countable. (e.g., Integers, {A,C,G,T}) Discrete time: T=0,1,2,3,….

  28. Markov chains 4 Nucleotides A G C T A G C T A G C T A G C T 20 Amino Acids A R N D C E Q G H I L K M F P S T W Y V A R N D C E Q G H I L K M F P S T W Y V A R N D C E Q G H I L K M F P S T W Y V A R N D C E Q G H I L K M F P S T W Y V The loaded coin A B T=1 A B T=2 A B T=3 A B T=4 pab A B 1-pab 1-pba pba

  29. Markov chains Transition matrix P: A discrete time Markov chain is completely defined given an initial condition and a probability matrix. The Markov chain graph G is defined on the states. We connect (a,b) whenever Pab>0 Distribution after T time steps given x as an initial condition Matrix power

  30. Right,left eigenvector: When an eigen-basis exists We can find right eigenvectors: And left eigenvectors: With the eigenvalue spectrum: Which are bi-orthogonal: And define the spectral decomposition: Spectral decomposition T=1 T=2 T=3 A A A B B

  31. Spectral decomposition To compute transition probabilities: O(|E|)*T ~ O(N2)*T per initial condition T matrix multiplications to preprocess for time T Using spectral decomposition: O(Spectral pre-process) + 2 matrix multiplications per condition

  32. Fixed point: l2 = second largest eigenvalue. Controlling the rate of process convergence Convergence Spec(P) = P’s eignvalues, l1 > l2>... l1= largest, always = 1. A Markov Chain is irreducible if its underlying graph is connected. In that case there is a single eigenvalue that equals 1. What does the left eigenvector corresponding to l1 represent?

  33. Continuous time Think of time steps that are smaller and smaller Markov Conditions on transitions: Kolmogorov Theorem: exists (may be infinite) exists and finite

  34. Rates and transition probabilities The process’s rate matrix: Transitions differential equations (backward form):

  35. Summing over different path lengths: 1-path 2-path 3-path 4-path 5-path Matrix exponential The differential equation: Series solution:

  36. Computing the matrix exponential

  37. Computing the matrix exponential Series methods: just take the first k summands reasonable when ||A||<=1 if the terms are converging, you are ok can do scaling/squaring: Eigenvalues/decomposition: good when the matrix is symmetric problems when having similar eigenvalues Multiple methods with other types of B (e.g., triangular)

  38. Alignment AGCAACAAGTAAGGGAAACTACCCAGAAAA…. AGCCACATGTAACGGTAATAACGCAGAAAA…. Statistics A G C T A G C T Modeling: simple case Learning Inference Modeling Genome 1 Genome 2 Maximum likelihood model:

  39. Alignment AGCAACAAGTAAGGGAAACTACCCAGAAAA…. AGCCACATGTAACGGTAATAACGCAGAAAA…. Statistics A G C T A G C T (t=1) Modeling: simple case Learning Inference Modeling Genome 1 Genome 2

  40. Q,t’ Q,t Q,t+t’ Modeling: but is it kosher?

  41. Symmetric processes Definition: we call a Markov process symmetric if its rate matrix is symmetric: What would a symmetric process converge to? whiteboard/ exercise Reversing time:

  42. Reversibility Time: t  s Definition: A reversible Markov process is one for which: i j j i Claim: A Markov process is reversible iff such that: whiteboard/ exercise If this holds, we say the process is in detailed balance. qji pi pj qij

  43. Q,t’ Q,t Q,t’ Q,t Q,t+t’ Reversibility Claim: A Markov process is reversible iff we can write: where S is a symmetric matrix. whiteboard/ exercise

More Related