Multiple Sequence Alignment by Iterative Tree-Neighbor Alignments
290 likes | 487 Views
Multiple Sequence Alignment by Iterative Tree-Neighbor Alignments. Susan Bibeault June 9, 2000. Outline. Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work. Outline. Problem Statement and Importance
Multiple Sequence Alignment by Iterative Tree-Neighbor Alignments
E N D
Presentation Transcript
Multiple Sequence Alignment by Iterative Tree-Neighbor Alignments Susan Bibeault June 9, 2000
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
V-LSPADN--VKAAWGKVGAHAGEYGAEALERM---F- VHLTPEEKSAVTALWGKVNVD--EVGGEALGRLLVVYP G-LSDGEWQLVLNVWGKVEA---DIPGHVLIRL---FK -VLSPADN--VKAAWGKVGAHAGEYGAEALERMF---- VHLTPEEKSAVTALWGKVNVD--EVGGEALGRLLVVYP -GLSDGEWQLVLNVWGKVEA---DIPGHVLIRLFK--- Multiple Sequence Alignment • Problem Given Sequence Set: • Insert gaps into sequences so that evolutionary conserved regions are aligned • Important tool • Relate Homologous Proteins • Discover Conserved Regions VLSPADNVKAAWGKVGAHAGEYGAEALERMF VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVY GLSDGEWQLVLNVWGKVEADIPGHVLIRLFK
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
Sum of Pairs Tree based gorilla human orangutan chimpanzee gibbon cost(i,j) cost(edge)m Scoring Multiple Alignments cost(i,j) = 6 cost(edge) = 1m
Scoring Cost Matrix: C (aa1, aa2) Gaps Penalties: Simple: C (aa, -) Affine: C(-) + Len * C (aa,-) Alignments V L S P A D N V K A G L S D G E W Q L V L Cost(s[1..i],t[i..j]) = min( Cost(s[1..i],t[i..j-1]) – g, Cost(s[1..i-1],t[i..j-1]) – C(s[i],t[j]) Cost(s[1..i-1],t[i..j]) – g))
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
Current Approaches Global Alignment ABCDEFGHI :::: :::: ABCD-FGHI Local Alignment XXXABCDYYY :::: ZZZABCDEEEE • Global Methods • Optimal Algorithms (MSA, MWT, MUSEQAL) • Progressive (MULTALIGN, PILEUP, CLUSTAL, MULTAL, AMULT, DFALIGN, MAP, PRRP, AMPS) • Local methods • PIMA, DIALIGN, PRALIGN, MACAW, BlockMaker, Iteralign • Combined (GENALIGN, ASSEMBLE, DCA) • Statistical (HMMT, SAGA, SAM, Match Box) • Parsimony (MALIGN, TreeAlign) • Global Methods • Optimal Algorithms (MSA, MWT, MUSEQAL) • Progressive (MULTALIGN, PILEUP, CLUSTAL, MULTAL, AMULT, DFALIGN, MAP, PRRP, AMPS) • Local methods • PIMA, DIALIGN, PRALIGN, MACAW, BlockMaker, Iteralign • Combined (GENALIGN, ASSEMBLE, DCA) • Statistical (HMMT, SAGA, SAM, Match Box) • Parsimony (MALIGN, TreeAlign)
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
Distance Estimation Tree Construction Node Initialization Tree Partitioning Iteration Our Heuristic
PESLALYNKFSIKSDVW PEALNYGRY-SSESDVW PESLALYNKF---SIKSDVW PEALNYGRY----SSESDVW PESLALYNKFSIKSDVW PEAL-NYGRYSSESDVW Estimation of Protein Distance Aligned Sequences Estimated Pair Distances Issue: Implied vs. Optimal Pair Alignments PEAAALYGRFT---IKSDVW PESAALYGRFT---IKSDVW PESLALYNKF---SIKSDVW PEALNYGRY----SSESDVW PEALNYGWY----SSESDVW PEVIRMQDDNPFSFSQSDVY PEALNYGWY----SSESDVW PEVIRMQDDNPFSFSQSDVY
Interior Node Classification • Interior Nodes Classified by Percent Identity • PID = (# matched residues) / (# total residues) • User Specified Tiers • User Specified Cost Criterion • Example: • PID > 60% -- PAM 40 – High Gap Penalties • PID > 40% -- PAM 120 – Medium Gap Penalties • PID < 40% -- PAM 200 – Low Gap Penalty
Ordering Alignments Isolate Sub Trees Threshold PID Order Alignments • Sub Tree • Border Nodes • Integrate All
Sum of Pairs Bounded Search Implementation Modular Reentrant Flexible Cost Criterion Interior Alignments
Generating Consensus Alignment (A1,A2,A3) Consensus X • Min ( Di(Ai,X) ) For Each Position i: Xi = A1 D1 D2 A2 X D3 A3 Min (cost(, A1i) + cost(, A2i) + cost(, A3i))
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
Testing the Method • BAliBASE benchmark • “Correct” Alignments • Core Blocks of Conserved Motifs • Typical “Hard Problem” Sets • Protein Parsimony • Measures “Evolutionary Steps” of Alignment
Baseline BAliBASE SP better
Baseline BAliBASE TC better
Baseline - ProtPars better
Orphans/Families BAliBASE SP better
Orphans/Families ProtPars better
Larger Families better
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
Conclusions • Solution Quality • Captures Evolutionary Information • Iterations Converge Quickly • Useful Tool
Outline • Problem Statement and Importance • Terminology • Current Approaches • Our Alignment Heuristic • Performance Results • Conclusions • Future Work
Future Work • Improved Alignment Consensus • Multiple Partitioning Thresholds • Multiple Solutions • Integrated Phylogeny Modifications • Parallel Implementation