1 / 48

Multiple Sequence Alignment Construction, Visualization, and Analysis Using Partial Order Graphs

Multiple Sequence Alignment Construction, Visualization, and Analysis Using Partial Order Graphs. Catherine S. Grasso Christopher J. Lee. Overview of Talk. Intro to Partial Order Multiple Sequence Alignment Representation Biological Feature Extraction Using Partial Order Graphs

glenys
Download Presentation

Multiple Sequence Alignment Construction, Visualization, and Analysis Using Partial Order Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignment Construction, Visualization, and Analysis Using Partial Order Graphs Catherine S. Grasso Christopher J. Lee

  2. Overview of Talk • Intro to Partial Order Multiple Sequence Alignment Representation • Biological Feature Extraction Using Partial Order Graphs • Multiple Sequence Alignment Construction Using Partial Order Graphs • Conclusions

  3. Q: Why Do Multiple Sequence Alignment?A: To model the process which constructed a set of sequences from a common source sequence.

  4. A multiple sequence alignment allows biologists to infer: • Protein Structure • Protein Function • Protein Domains • Protein Active Sites • Splice Sites • Regulatory Motifs • Single Nucleotide Polymorphisms • mRNA Isoforms For example, protein sequences that are >30% identical often have the same structure and function.

  5. Row Column Multiple Sequence Alignment RC-MSA

  6. RC-MSA Representation Does Not Reveal Large Scale Features While it is easy to interpret single residue changes in this format, Large scale changes are not easy to interpret.

  7. The Scale of Features of Interest Should Inform MSA Representation • Features from single residue changes can be easily seen in RC-MSA Representation: • Regulatory Motifs • Single Nucleotide Polymorphisms • Promoter Binding Sites • Features from large scale changes cannot: • Protein Domain Differences • Alternative Splicing • Genome Duplications

  8. Degeneracy of RC-MSA Representation Alignment A is biologically equivalent to alignment A’. However, they look different solely due to representation degeneracy. We’d like a representation that is not degenerate. .....ACATGTCGAT.....AGGTG TGCAC.....TCGATACATAAGGTG ACATG.....TCGAT.....AGGTG .....TGCACTCGATACATAAGGTG A: A’:

  9. What do we really want to know about an MSA? • The order of letters within a sequence. 5’ to 3’ or N-terminal to C-terminal. • Which letters are aligned between sequences. One sequence can impose its order on another sequence only through alignment. What do we really want to do with an MSA? • We want to use it as an object in multiple sequence alignment method. • We want to analyze it for biologically interesting features.

  10. Partial Order Multiple Sequence Alignment PO-MSA Conventional Format (RC-MSA) Draw each sequence as a directed graph: node for each letter, connect by directed edges Fuse aligned, identical letters (PO-MSA)

  11. Returning to the previous example… In the PO-MSA format, both A and A’ .....ACATGTCGAT.....AGGTG TGCAC.....TCGATACATAAGGTG ACATG.....TCGAT.....AGGTG .....TGCACTCGATACATAAGGTG A: A’: Can be represented as A C A T G T C G A T A G G T G T G C A C A C A T A

  12. PO-MSA of EST sequences from UniGene cluster Hs. 100194 A N N N A C N G A N N A T PO-MSA Format A G T T C C T G C C T G C G T T T G C T G G A C T G A T G T A C T T G T T T G T G A G G C A A C G T G T C G A G N A RC-MSA Format Note: PO-MSA format stores information in a multiple sequence alignment more efficiently.

  13. Real Example: Human SH2 Domain Containing Proteins Hand Rendered PO-MSA Showing Domain Structure POAVIZ Rendered PO-MSA Reflects Domain Structure RC-MSA in Text Format

  14. Protein Multiple Sequence Alignment from BALIBASE 1amk_ref1Equi-distant sequences with various levels of conservation

  15. Protein Multiple Sequence Alignment from BALIBASE 1aboA_ref2 Family with a highly divergent “orphan” sequence

  16. Protein Multiple Sequence Alignment from BALIBASE 1ubi_ref3 Subgroups with <25% residue identity between groups

  17. Protein Multiple Sequence Alignment from BALIBASE 1dynA_ref4 Sequences with N/C-terminal extensions

  18. Protein Multiple Sequence Alignment from BALIBASE kinase2_ref5 Sequences with Internal Insertions

  19. What do we really want to do with an MSA? We want to analyze it for biologically interesting features.

  20. Alternative Splicing Annotation Using PO-MSA Representation

  21. PO-MSA of ESTs Aligned to Genomic Sequence exon 1 exon 2 exon 3 exon 4 Genomic intron 1 intron 2 intron 3

  22. Annotate Splices and Alternative Splicing as Edges in PO-MSA

  23. Infer mRNA Isoforms from Paths in PO-MSA

  24. What do we really want to do with an MSA? We want to use it as an object for building multiple sequence alignments.

  25. Background on Progressive Multiple Sequence Alignment

  26. Pair-wise Sequence Alignment Using Dynamic Programming Finding a PSA = Finding a path through a 2-Dim matrix. It’s O(L2), where L is the sequence length.

  27. Multiple Sequence Alignment Using Dynamic Programming Finding an MSA = Finding a path through an N-Dim matrix. It’s O(LN), where N is the number of sequences and L is the sequence length. Note: More than 5 sequences takes a prohibitive amount of time. Heuristic methods, such as those used by CLUSTAL W, are used instead.

  28. Progressive Alignment (CLUSTAL W) Approach 3. Align N sequences using guide tree. 1. Compute pairwise distances of all N sequences. • a. Use standard PSA to align leaf sequence. • b. Profile multiple sequence alignments at branch nodes. • c. Use standard PSA on profiles. • d. Recurse. seqB seqA seqC seqF seqE seqD 2. Build Guide Tree C D A B E F

  29. Progressive Multiple Sequence Alignment 3. Align sequences using guide tree. 15 13 14 11 12 9 10 1 2 3 4 5 6 7 8

  30. Multiple Sequence Alignment Stored As A Linear Profile 2pia .....pqedgflrlkIASKEKIar.....DIWSFELtdpqgaplppFEAG vanb_pses9 ..........mldlmIRGLRLEap.....GILGLELvatdgsplptFEAG POBB_PSEPS ..msaaatmapvslrIHAIAYGad.....DVLLFDLrapardglapFDAG YEAX_ECOLI .....msdyqmfevqVSQVEPLte.....QVKRFTLvatdgkplpaFTGG 1fnc tvnkfk.pktpyvgrCLLNTKItgddapgETWHMVFs...hegeipYREG fenr_cyapa plnlfr.panpyigkCIYNERIvgegapgETKHIIFt...hegkvpYLEG fenr_spisp pvniyk.pknpyigkCLSNEELvreggtgTVRHLIFdi..sggdlrYLEG fens_orysa .lntyk.pkepytatIVSVERIvgpkapgETCHIVId...hggnvpYWEG nia_emeni prptfltpkawtkatLTKKTSVss.....DTHIFTLslehpsqalgLPTG nia_lepmc lrptfldsrtwskalLSSKTKVsw.....DTRIFRFkldhasqtlgLPTG nia_aspng pratflqskswtkatLVKRTDVsw.....DTRIFTFqlqhdkqtlgLPIG nia_fusox ...............LDKKTSIsp.....DTKIFSFklnheaqkigLPTG nia_neucr ...............LTFKESVsp.....DTKIFHFalshpaqsigLPVG nia_ustma ....fldpkkwratrLGEQANHsp.....DARIFRFalgsedqelgLPWP nia_beaba ....flqpkywskaiLETKTDVss.....DSKIFSFrldhaaqsigLPTG

  31. Multiple Sequence Alignment Stored As A Linear Profile 2pia .....pqedgflrlkIASKEKIar.....DIWSFELtdpqgaplppFEAG vanb_pses9 ..........mldlmIRGLRLEap.....GILGLELvatdgsplptFEAG POBB_PSEPS ..msaaatmapvslrIHAIAYGad.....DVLLFDLrapardglapFDAG YEAX_ECOLI .....msdyqmfevqVSQVEPLte.....QVKRFTLvatdgkplpaFTGG 1fnc tvnkfk.pktpyvgrCLLNTKItgddapgETWHMVFs...hegeipYREG fenr_cyapa plnlfr.panpyigkCIYNERIvgegapgETKHIIFt...hegkvpYLEG fenr_spisp pvniyk.pknpyigkCLSNEELvreggtgTVRHLIFdi..sggdlrYLEG fens_orysa .lntyk.pkepytatIVSVERIvgpkapgETCHIVId...hggnvpYWEG nia_emeni prptfltpkawtkatLTKKTSVss.....DTHIFTLslehpsqalgLPTG nia_lepmc lrptfldsrtwskalLSSKTKVsw.....DTRIFRFkldhasqtlgLPTG nia_aspng pratflqskswtkatLVKRTDVsw.....DTRIFTFqlqhdkqtlgLPIG nia_fusox ...............LDKKTSIsp.....DTKIFSFklnheaqkigLPTG nia_neucr ...............LTFKESVsp.....DTKIFHFalshpaqsigLPVG nia_ustma ....fldpkkwratrLGEQANHsp.....DARIFRFalgsedqelgLPWP nia_beaba ....flqpkywskaiLETKTDVss.....DSKIFSFrldhaaqsigLPTG Profile of : A R N D C Q E G H I L K M F P S T W Y V 0 2 1 2 0 0 1 0 0 0 1 3 0 0 1 3 0 0 1 0

  32. PSA of sequences at leaf nodes: Requires a scoring function which can score a match between residues. PSA of profiles at branch nodes: Requires a scoring function which can score a match between profiles of columns of residues and gaps. Pair-wise Sequence Alignment of Leaf Nodes V. Branch Nodes Q R . S Q . Q

  33. Problem with Aligning Profiles: Gap Artifacts! Alignment A is biologically equivalent to alignment A’. .....ACATGTCGAT.....AGGTG TGCAC.....TCGATACATAAGGTG ACATG.....TCGAT.....AGGTG .....TGCACTCGATACATAAGGTG A: A’: If we try to align another sequence which is identical to the second sequence in the alignment… S: TGCACTCGATACATAAGGTG We find that Score(S,A) not equalto Score(S,A’), but it should be.

  34. In doing pair-wise sequence alignment on RC-MSA profiles: • Each column is treated in isolation. • But interpreting what’s a true gap requires looking outside of column. • We can try to solve this problem by adjusting the scoring process. • This results in a non-local scoring function, which violates dynamic programming.

  35. We can instead replace the profile RC-MSA representation with the PO-MSA representation. In the PO-MSA representation, both A and A’ .....ACATGTCGAT.....AGGTG TGCAC.....TCGATACATAAGGTG ACATG.....TCGAT.....AGGTG .....TGCACTCGATACATAAGGTG A: A’: Can be represented as A C A T G T C G A T A G G T G T G C A C A C A T A

  36. We can align S to A using Sequence to PO-MSA alignment algorithm. A: A C A T G T C G A T A G G T G T G C A C A C A T A S: T G C A C T C G A T A C A T A A G G T G

  37. We Can Replace the Linear Profile Intermediates Used in Progressive Multiple Sequence Alignment 15 13 14 11 12 9 10 1 2 3 4 5 6 7 8

  38. With PO-MSA Intermediates 15 13 14 11 12 9 10 1 2 3 4 5 6 7 8

  39. Requires PO-MSA to PO-MSA Alignment Algorithm We need to be able to align PO-MSA G and G’ G’ P* G P** P*** G’’ Yielding PO-MSA G’’, which is a fusion of G and G’

  40. Recall Sequence to PO-MSA Alignment Algorithm Partial Order Alignment of a Sequence to an Alignment. Conventional Alignment of Two Sequences

  41. Sequence to PO-MSA Alignment Algorithm Requires a Simple Extension of Sequence to Sequence Alignment Algorithm p1 p2 q p3 n m . . . . . . pN Simply extend dynamic programming move set to include partial order moves: at each position (n,m) in the matrix, choose best move by: Considering all predecessor nodes that have a directed edge from p  n. Note: MATCH and INSERT moves may have more than one incoming edge p.

  42. Next Construct Resulting Fusion PO-MSA from Path Returned by Sequence to PO-MSA Alignment Algorithm: POA algorithm returns optimal alignment as a set of aligned node pairs. 1. Store aligned letters in linked lists called align rings. 2. Fuse identical nodes in align rings removing redundant edges.

  43. Sequence to PO-MSA Fusion Given sets of aligned nodes between PO-MSA G and sequence S, S P* G P** P*** G’ We can construct PO-MSA G’, which is a fusion of G and S.

  44. PO-MSA to PO-MSA Alignment Algorithm Requires a Simple Extension of Sequence to PO-MSA Alignment Algorithm p1 q1 p2 q2 p3 n q3 m . . . . . . . . . . . . pN qM Simply extend dynamic programming move set to include partial order moves: at each position (n,m) in the matrix, choose best move by: Considering all predecessor nodes that have a directed edge from p  n and q  m. Note: MATCH and INSERT moves may have more than one incoming edge p or q.

  45. PO-MSA to PO-MSA Fusion Given a set of aligned nodes between PO-MSA G and PO-MSA G’ G’ P* G P** P*** G’’ We can construct PO-MSA G’’, which is a fusion of G and G’.

  46. Progressive Multiple Sequence Alignment Using PO-MSA to PO-MSA Alignment Algorithm 15 13 14 11 12 9 10 1 2 3 4 5 6 7 8

  47. Future Work • Implement Profile Nodes and Profile Scoring in Progressive POA • Develop benchmark test set for multiple sequence alignment of complete protein sequences.

  48. Acknowledgements • I’d like to thank: • Chris Lee for all of his guidance and support. • Michael Quist for all of his help with this project. • Barmak Modrek with whom I worked on annotating alternative splicing using PO-MSAs and POA. • Everyone in the Lee Lab for hours of helpful discussion. • DOE CSGF for supporting the work. To use or download POA or POAVIZ go to: http://www.bioinformatics.ucla.edu/poa

More Related