1 / 61

Welcome to Introduction to Computational Genomics for Infectious Disease

Welcome to Introduction to Computational Genomics for Infectious Disease. Course Instructors. Instructor James Galagan Teaching Assistants Lab Instructors. Schedule and Logistics. Lectures Labs. Tues/Thurs 11-12:30 Harvard School of Public Health: FXB-301

kirk-reed
Download Presentation

Welcome to Introduction to Computational Genomics for Infectious Disease

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Welcome toIntroduction to Computational Genomics for Infectious Disease

  2. Course Instructors • Instructor James Galagan • Teaching Assistants • Lab Instructors

  3. Schedule and Logistics • Lectures • Labs Tues/Thurs 11-12:30 Harvard School of Public Health: FXB-301 The François-Xavier Bagnoud Center, Room 301 Wed/Fri 1-3 Broad Institute: Olympus Room First floor of Broad Main Lobby See front desk attendant near entrance Individual computers and software provided No programming experience required

  4. Website www.broad.mit.edu/annotation/winter_course_2006/ • Contact information • Directions to Broad • Lecture slides • Lab handouts • Resources

  5. Goals of Course • Introduction to concepts behind commonly used computational tools • Recognize connection between different concepts and applications • Hands on experience with computational analysis

  6. Concepts and Applications • Lectures will cover concepts • Computationally oriented • Labs will provide opportunity for hands on application of tools • Nuts and bolts of running tools • Application of tools not covered in lectures

  7. Computational Genomics Overview Slide Credit: Manolis Kellis

  8. Topics • Probabilistic Sequence Modeling • Clustering and Classification • Motifs • Steady State Metabolic Modeling

  9. Topics Not Covered • Sequence Alignment • Phylogeny (maybe in labs) • Molecular Evolution • Population Genetics • Advanced Machine Learning • Bayesian Networks • Conditional Random Fields

  10. Applications to Infectious Disease • Examples and labs will focus on the analysis of microbial genomics data • Pathogenicity islands • TB expression analysis • Antigen prediction • Mycolic acid metabolism • But approaches are applicable to any organism and to many different questions

  11. Probabilistic Modeling of Biological Sequences Concepts Statistical Modeling of Sequences Hidden Markov Models Applications Predicting pathogenicity islands Modeling protein families Lab Practical Basic sequence annotation

  12. Probabilistic Sequence Modeling • Treat objects of interest as random variables • nucleotides, amino acids, genes, etc. • Model probability distributions for these variables • Use probability calculus to make inferences

  13. Why Probabilistic Sequence Modeling? • Biological data is noisy • Probability provides a calculus for manipulating models • Not limited to yes/no answers – can provide “degrees of belief” • Many common computational tools based on probabilistic models

  14. Sequence Annotation GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG

  15. Sequence Annotation GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG Gene

  16. Sequence Annotation GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG Promoter Motif Gene Kinase Domain

  17. Probabilistic Sequence Modeling • Hidden Markov Models (HMM) • A general framework for sequences of symbols (e.g. nucleotides, amino acids) • Widely used in computational genomics • Hmmer – HMMs for protein families • Pathogenicity Islands

  18. Neisseria meningitidis, 52% G+C GC Content (from Tettelin et al. 2000. Science) Pathogenicity Islands • Clusters of genes acquired by horizontal transfer • Present in pathogenic species but not others • Frequently encode virulence factors • Toxins, secondary metabolites, adhesins • (Flanked by repeats, gene content, phylogeny, regulation, codon usage) • Different GC content than rest of genome

  19. Application: Bacillus subtilis

  20. ... C C TA A G T T A G A G G A T T G A G A …. A: 0.15 T: 0.13 G: 0.30 C: 0.42 A: 0.15 T: 0.13 G: 0.30 C: 0.42 A: 0.15 T: 0.13 G: 0.30 C: 0.42 Modeling Sequence Composition • Calculate sequence distribution from known islands • Count occurrences of A,T,G,C • Model islands as nucleotides drawn independently from this distribution … … P(Si|MP)

  21. A: 0.15 T: 0.13 G: 0.30 C: 0.42 The Probability of a Sequence • Can calculate the probability of a particular sequence (S) according to the pathogenicity island model (MP) Example S = AAATGCGCATTTCGAA

  22. A: 0.15 T: 0.13 G: 0.30 C: 0.42 Score Matrix A: -0.73 A: T: -0.94 T: G: 0.26 G: C: 0.74 C: Sequence Classification PROBLEM: Given a sequence, is it an island? • We can calculate P(S|MP), but what is a sufficient P value? SOLUTION: compare to a null model and calculate log-likelihood ratio • e.g. background DNA distribution model, B Pathogenicity Islands Background DNA

  23. Finding Islands in Sequences • Could use the log-likelihood ratio on windows of fixed size • What if islands have variable length? • We prefer a model for entire sequence TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC

  24. 0.15 0.85 0.75 0.25 A: 0.15 T: 0.13 G: 0.30 C: 0.42 A: 0.25 T: 0.25 G: 0.25 C: 0.25 A More Complex Model Background Island TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC

  25. P P P B B B B B G C A A A T G C A: 0.25 T: 0.25 G: 0.25 C: 0.25 A: 0.42 T: 0.30 G: 0.13 C: 0.15 A Generative Model P P P P P P P P P P P P P B B B B B B B B B B B S: P(Li+1|Li) P(S|B) P(S|P)

  26. A Hidden Markov Model • Hidden States • L = { 1, ..., K } • Transition probabilities • aij = Transition probability from state i to state j • Emission probabilities • ei(b) = P( emitting b | state=i) • Initial state probability • p(b) = P(first state=b) Transition Probabilities Statei State j ei(b) ej(b) Emission Probabilities

  27. What can we do with this model? The model defines a joint probability over labels and sequences, P(L,S) Implicit in model is what labels “tend to go” with what sequences (and vice versa) Rules of probability allow us to use this model to analyze existing sequences

  28. Fundamental HMM Operations Computation Biology Decoding • Given an HMM and sequence S • Find a corresponding sequence of labels, L Evaluation • Given an HMM and sequence S • Find P(S|HMM) Training • Given an HMM w/o parameters and set of sequences S • Find transition and emission probabilities the maximize P(S | params, HMM) Annotate pathogenicity islands on a new sequence Score a particular sequence (not as useful for this model – will come back to this later) Learn a model for sequence composed of background DNA and pathogenicity islands

  29. The Hidden in HMM • DNA does not come conveniently labeled (i.e. Island, Gene, Promoter) • We observe nucleotide sequences • The hidden in HMM refers to the fact that state labels, L, are not observed • Only observe emissions (e.g. nucleotide sequence in our example) Statei State j …A A G T T A G A G…

  30. “Decoding” With HMM Given observables, we would like to predict a sequence of hidden states that is most likely to have generated that sequence Pathogenicity Island Example Given a nucleotide sequence, we want a labeling of each nucleotide as either “pathogenicity island” or “background DNA”

  31. The Most Likely Path • Given a sequence, one reasonable choice for a labeling is: The sequence of labels, L*, (or path) that makes the labels and sequence most likely given the model

  32. B B B B B B B B 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 Probability of a Path,Seq P P P P P P P P L: B B B B B B B B G C A A A T G C S:

  33. 0.75 0.75 P P P 0.15 0.25 B B B B B 0.85 0.85 0.85 0.25 0.25 0.25 0.42 0.42 0.30 0.25 0.25 Probability of a Path,Seq P P P P P P P P L: B B B B B B B B G C A A A T G C S: We could try to calculate the probability of every path, but….

  34. Decoding • Viterbi Algorithm • Finds most likely sequence of labels, L*, given sequence and model • Uses dynamic programming (same technique used in sequence alignment) • Much more efficient than searching every path

  35. Sum over all paths P P P P P P P P B B B B B B B B Probability of a Single Label • Calculate most probable label, L*i , at each position i • Do this for all N positions gives us {L*1, L*2, L*3…. L*N} P P P P P P P P L: B B B B B B B B G C A A A T G C S: Forward algorithm (dynamic programming) P(Label5=B|S)

  36. Two Decoding Options • Viterbi Algorithm • Finds most likely sequence of labels, L*, given sequence and model • Posterior Decoding • Finds most likely label at each position for all positions, given sequence and model {L*1, L*2, L*3…. L*N} • Forward and Backward equations

  37. Application: Bacillus subtilis

  38. Method Three State Model Second Order Emissions P(Si)=P(Si|State,Si-1,Si-2) (capturing trinucleotide Frequencies) Train using EM Predict w/Posterior Decoding Gene+ Gene- AT Rich Nicolas et al (2002) NAR

  39. Results Gene on positive strand Gene on negative strand • A/T Rich • Intergenic regions • Islands Each line is P(label|S,model) color coded by label Nicolas et al (2002) NAR

  40. Fundamental HMM Operations Computation Biology Decoding • Given an HMM and sequence S • Find a corresponding sequence of labels, L Evaluation • Given an HMM and sequence S • Find P(S|HMM) Training • Given an HMM w/o parameters and set of sequences S • Find transition and emission probabilities the maximize P(S | params, HMM) Annotate pathogenicity islands on a new sequence Score a particular sequence (not as useful for this model – will come back to this later) Learn a model for sequence composed of background DNA and pathogenicity islands

  41. Training an HMM • Transition probabilities • e.g. P(Pi+1|Bi) – the probability of entering a pathogenicity island from background DNA • Emission probabilities • i.e. the nucleotide frequencies for background DNA and pathogenicity islands P(Li+1|Li) B P P(S|B) P(S|P)

  42. start End B B B B B Learning From Labelled Data Maximum Likelihood Estimation If we have a sequence that has islands marked, we can simply count P P P P P P P P L: B B B B B B B B G C A A A T G C S: P(S|B) P(S|P) P(Li+1|Li) A: 1/5 T: 0 G: 2/5 C: 2/5 A: T: G: C: ! ETC..

  43. ? Unlabelled Data How do we know how to count? P P P P P P P P L: start B B B B B B B B End G C A A A T G C S: P(S|B) P(S|P) P(Li+1|Li) A: T: G: C: A: T: G: C:

  44. P P P B B B B B B B B B B B B B P(Li+1|Li)0 P(S|B)0 P(S|P)0 P(Li+1|Li)1 P(S|B)1 P(S|P)1 P(Li+1|Li)2 P(S|B)2 P(S|P)2 P(Li+1|Li)K P(S|B)K P(S|P)K Unlabeled Data P P P P P P P P An idea: • Imagine we start with some parameters • We could calculate the most likely path, P*, given those parameters and S • We could then use P* to update our parameters by maximum likelihood • And iterate (to convergence) L: start B B B B B B B B End G C A A A T G C S: …

  45. Expectation Maximization (EM) • Initialize parameters • E StepEstimateprobability of hidden labels , Q, given parameters and sequence • M Step Choose new parameters to maximizeexpected likelihood of parameters given Q • Iterate P(S|Model) guaranteed to increase each iteration

  46. Expectation Maximization (EM) • Remember the basic idea! • Use model to estimate (distribution of) missing data • Use estimate to update model • Repeat until convergence • EM is a general approach for learning models (ML estimation) when there is “missing data” • Widely used in computational biology EM frequently used in motif discovery Lecture 3

  47. A More Sophisticated Application Modeling Protein Families • Given amino acid sequences from a protein family, how can we find other members? • Can search databases with each known member – not sensitive • More information is contained in full set • The HMM Profile Approach • Learn the statistical features of protein family • Model these features with an HMM • Search for new members by scoring with HMM We will learn features from multiple alignments

  48. Human Ubiquitin Conjugating Enzymes UBE2D2FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK UBE2D3FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK BAA91697FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK UBE2D1FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK UBE2E1FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK UBCH9FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK UBE2NLPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-------------DKWSPALQIRT AAF67016IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---------PKGAWRPSLNIAT UBCH10FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-------------EKWSALYDVRT CDC34FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT BAA91156FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT UBE2G1FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGEDKYGYEKPEERWLPIHTVET UBE2BFSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN-------------RWSPTYDVSS UBE2IFKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED-----------KDWRPAITIKQ E2EPF5LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR-------------DWTAELGIRH UBE2L1FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA------------ENWKPATKTDQ UBE2L6FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS------------ENWKPCTKTCQ UBE2HLPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-------------QTWTALYDLTN UBC12VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-------------EDWKPVLTINS

  49. D1 Dj DN I I1 Ij IN Start M1 Mj MN End E2EPF5 LG K D F PA S PP K G YF L T K I F H P N VGA N - G E ICV N VL KR A------------ D W T A E LGI RH UBE2L1 F PA E Y P F K PP K I T F K T K I Y H P N I DE K - G Q VCLPVI S A A----------- E N W K PA T K T D Q UBE2L6 F PP E Y P F K PPMI K F TT K I Y H P N V DE N - G Q ICLPII SS A----------- E N W K PC T K T C Q UBE2H LP D K Y P F K S P S IG F M N K I F H P N I DE A S G T VCL D VI N -P----------- QT W T AL Y D L TN Profile HMM

  50. Using Profile HMMs Computation Biology Decoding Find sequence of labels, L, that maximizes P(L|S, HMM) Evaluation • Find P(S|HMM) Training • Find transition and emission probabilities the maximize P(S | params, HMM) Align a new sequence to a protein family Score a sequence for membership in family Discover and model family structure

More Related