1 / 29

CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics . 19 (11):1404-1411. CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003. What is Multiple Sequence Alignment (MSA) ?.

basil
Download Presentation

CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SATCHMO: sequence alignment and tree construction using hidden Markov modelsEdgar, R.C. and Sjolander, K. Bioinformatics. 19(11):1404-1411. CECS 694-04 Bioinformatics Journal Club Eric Rouchka, D.Sc. September 10, 2003 Eric C. Rouchka, University of Louisville

  2. What is Multiple Sequence Alignment (MSA) ? • Taking more than two sequences and aligning based on similarity Eric C. Rouchka, University of Louisville

  3. Globin Example >gamma_A MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVASALSSRYH >alfa VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR >beta VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH >delta VHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVANALAHKYH >epsilon VHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH >gamma_G MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVASALSSRYH >myoglobin MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG >teta1 ALSAEDRALVRALWKKLGSNVGVYTTEALERTFLAFPATKTYFSHLDLSPGSSQVRAHGQKVADALSLAVERLDDLPHALSALSHLHACQLRVDPASFQLLGHCLLVTLARHYPGDFSPALQASLDKFLSHVISALVSEYR >zeta SLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHGSKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTAEAHAAWDKFLSVVSSVLTEKYR Eric C. Rouchka, University of Louisville

  4. Globin Multiple Alignment Eric C. Rouchka, University of Louisville

  5. Why do MSA? • Homology Searching • Important regions conserved across (or within) species • Genic Regions • Regulatory Elements • Phylogenetic Classification • Subfamily classification • Identification of critical residues Eric C. Rouchka, University of Louisville

  6. MSA Approaches • All columns alignable across all sequences • MSA • ClustalW • Columns alignable throughout all sequences singled out (Profile HMM) • HMMER • SAM Eric C. Rouchka, University of Louisville

  7. MSA • N-dimensional dynamic programming • Time consuming • High memory usage • Guaranteed to yield maximum alignment Eric C. Rouchka, University of Louisville

  8. ClustalW • Progressive Alignment • Sequences aligned in pair-wise fashion • Alignment scores produce phylogenetic tree • Enhanced dynamic programming approach Eric C. Rouchka, University of Louisville

  9. Hidden Markov Models • Match State, Insert State, Delete State Eric C. Rouchka, University of Louisville

  10. HMMs • Models conserved regions • Successful at detecting and aligning critical motifs and conserved core structure • Difficulty in aligning sequence outside of these regions Eric C. Rouchka, University of Louisville

  11. SATCHMO • Simultaneous Alignment and Tree Construction using Hidden Markov mOdels www.lib.jmu.edu/music/composers/ armstrong.htm Eric C. Rouchka, University of Louisville

  12. SATCHMO • Progressive Alignment • Built iteratively in pairs • Profile HMMs used • Alignments of same sequences not same at each node • Number of columns predicted smaller as structures diverge • Output not represented by single matrix Eric C. Rouchka, University of Louisville

  13. Why HMMs? • Homologs ranked through scoring • Accurate profiles from small numbers of sequences • Accurately combines two alignments having low sequence similarity Eric C. Rouchka, University of Louisville

  14. Bits saved relative to background • K = 1..M: HMM node number • a: amino acid type • Pk(a): emission probability of a in kth match state • P0(a): approximation of background probability of a Eric C. Rouchka, University of Louisville

  15. Sequence weights • Sequences weighted such that b converges on a desired value • Weights compensate for correlation in sequences Eric C. Rouchka, University of Louisville

  16. HMM Construction • Profile HMM constructed from multiple alignment • Some columns alignable; others not Eric C. Rouchka, University of Louisville

  17. HMM Construction • Given an alignment a, a profile HMM is generated • Each column in a is assigned to an emitter state – transition probabilities are calculated based on observed amino acids Eric C. Rouchka, University of Louisville

  18. Transition Probabilities • If we have a total of five match states, the probabilities can be stored in the following table: Eric C. Rouchka, University of Louisville

  19. HMM Terminology • : Path through an HMM to produce a sequence s • P(A|) = P(s| s) • +: maximum probability path through the HMM Eric C. Rouchka, University of Louisville

  20. Aligning Two Alignments • One alignment is converted to an HMM • Second alignment is aligned to the HMM • Some columns remain alignable • Affinities (relative match scores) calculated • New MSA results • HMM Constructed from new MSA Eric C. Rouchka, University of Louisville

  21. Aligning Two Alignments Eric C. Rouchka, University of Louisville

  22. SATCHMO Algorithm • Step 1: • Create a cluster for each input sequence and construct an HMM from the sequence • Step 2: • Calculate the similarity of all pairs of clusters and identify a pair with highest similarity • align the target and template to produce a new node Eric C. Rouchka, University of Louisville

  23. SATCHMO Algorithm • Repeat set 2 until: • All sequences assigned to a cluster • Highest similarity between clusters is below a threshold • No alignable positions are predicted • Output: A set of binary trees • Nodes are sequences • Each node contains an HMM aligning the sequences in the subtree Eric C. Rouchka, University of Louisville

  24. Graphical Interface for SATCHMO Eric C. Rouchka, University of Louisville

  25. Demonstration of SATCHMO Eric C. Rouchka, University of Louisville

  26. Validation Set • BAliBASE benchmark alignment set used • Ref1: equidistant sequences • Ref2: distantly related sequences • Ref3: subgroups of sequences; < 25% similarity between groups • Ref4: alignments with long extensions on the ends • Ref5: alignments with long insertions Eric C. Rouchka, University of Louisville

  27. Comparision of Results • SATCHMO compared to: • ClustalW (Progressive Pairwise Alignment) • SAM (HMM) Eric C. Rouchka, University of Louisville

  28. Eric C. Rouchka, University of Louisville

  29. Discussion • SATCHMO effective in identifying protein domains • Comparison to T-Coffee and PRRP would be useful • Time and sensitivity • Tree representation is unique, modeling structural similarity Eric C. Rouchka, University of Louisville

More Related