1 / 42

פרויקט בתכנות מתקדם – 512 236 פונקציות מרחק אופטימליות לשיחזור עצי אבולוציה סמסטר אביב 2010

פרויקט בתכנות מתקדם – 512 236 פונקציות מרחק אופטימליות לשיחזור עצי אבולוציה סמסטר אביב 2010. http://webcourse.cs.technion.ac.il/236512/. ההשפעה של פונקציות מרחק על שיחזור עצי אבולוציה לאחר שלב ההודעות, נעביר היום קורס בזק מקוצר על: עצי אבולוציה: הגדרות ומודלים מבוססי DNA

inge
Download Presentation

פרויקט בתכנות מתקדם – 512 236 פונקציות מרחק אופטימליות לשיחזור עצי אבולוציה סמסטר אביב 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. פרויקט בתכנות מתקדם – 512236פונקציות מרחק אופטימליותלשיחזור עצי אבולוציהסמסטר אביב 2010 http://webcourse.cs.technion.ac.il/236512/ .

  2. ההשפעה של פונקציות מרחק על שיחזור עצי אבולוציה • לאחר שלב ההודעות, נעביר היום קורס בזק מקוצר על: • עצי אבולוציה: הגדרות ומודלים מבוססי DNA • שיטות מבוססות מרחקים לבניית עצי אבולוציה • פונקציות מרחק למודלים אבולוציוניים • הערכת מרחקים בין זנים על סמך השוני בין סדרות הDNA • לאחר ה"קורס המזורז" עדיין תזדקקו להשלמות מסוימות בהמשך הסמסטר. • במהלך "קורס הבזק" יוצגו הפרויקטים. נושא הפרויקט .

  3. דרישות קדם: אלגוריתמים 1, הסתברות רצוי (אך לא הכרחי): אלגוריתמים בביולוגיה חישובית ככלל, הפרויקטים יעשו בזוגות. תוך שבוע הודיעונו על החלוקה לזוגות (בדוא"ל) בחירת פרוייקט: יהיו שני כיוונים עיקריים. השלב הראשוני דומה בשני הכיוונים. התמקדות בכיוון מסוים תעשה בהמשך (תוך כחודש). (מכאן והלאה שקפים באנגלית) אדמיניסטרציה .

  4. Crash course on evolutionary distances

  5. ThePhylogeneticReconstrutction Problem

  6. Evolution is modeled by DNA sequences which evolve along an Evolution Tree (Phylogeny) ACGGTCA (All our sequences are DNA sequences, consisting of {A,G,C,T}) AAAGTCA ACGGATA ACGGGTA AAAGGCG AAACACA AAAGCTG GGGGATT TCTGGTA ACCCGTG GAACGTA AATCCTG AATGGGC AAACCGA TCTGGGA ATAGCTG ACCGTTG TCCGGAA AGCCGTG

  7. Phylogenetic Reconstruction GGGGATT GAACGTA AATCCTG AATGGGC AAACCGA TCTGGGA ATAGCTG ACCGTTG TCCGGAA AGCCGTG

  8. A I J B (root) reconstruct F C D F D G B G A H E H I J E C Phylogenetic Reconstruction A :AATGGGC B :AATCCTG C :ATAGCTG D :GAACGTA E :AAACCGA F :GGGGATT G :TCTGGGA H :TCCGGAA I :AGCCGTG J :ACCGTTG Goal: reconstruct the ‘true’ tree as accurately as possible

  9. Three Methods of Tree Construction • Parsimony – A tree with minimum number of mutations. • Maximum likelihood - Finding the “most probable” tree. • Distance- A weighted tree that realizes thedistances between the species.

  10. edge-weighted ‘true’ tree reconstructed tree D D E E 2 C C 2 5 3 0.3 F 0.4 F 4 6 6 5 B A B G A G reconstruction in O(n2) reconstruction noise α Distance Based Reconstruction:Exact vs. approximate distances Major problem: sensitivity to noise Exact distances

  11. edge-weighted ‘true’ tree D E 2 C 2 5 3 0.3 0.4 F 4 6 6 5 B A G reconstruction in O(n2) The Algorithmic Aspect Many algorithms can reconstruct a weighted tree from the exact distances. In this project we will use the “Saitou&Nei Neighbor Joining algorithm”, or simply the “NJ algorithm”. Exact distances 1

  12. noise α The Distance Estimation Aspect Evolutionary Distances:- How are they defined?- How are they extracted from the DNA sequences? We’ll show this on a specific model the Kimura 2 Parameters (K2P) model

  13. The Kimura 2 Parameter (K2P) model [Kimura80]:each edge corresponds to a “Rate Matrix” Transitions K2P generic rate matrix u Transversions Transitions v

  14. K2P standard distance:Δtotal =Total substitution rate The total substitution rate of a K2P rate matrix R is This is the expected number of mutations per site. It is an additivedistance. + α + 2β α’ + 2β’ u v w (α+α’) + 2(β+ β’)

  15. The distance Δtotal(Ruv) = dK2P(u,v) is estimated from the aligned sequences since mutations may overwrite each other, this is a “noisy” process K2P total rate “distance correction” procedure

  16. wsep A basic question:How good is a reconstruction method which uses K2P distances? The performance of tree reconstructions method is often tested on quartets, which are trees with 4 taxa.A quartet contains a single internal edge, which defines the quartet-split. A C B D

  17. wsep A correct reconstruction of the quartet requires finding of the true quartet-split There are 3 possible splits: A C A B A C B D C D B D Distance methods reconstruct the true split by the 4-point condition: The 4-point condition for noisy distances is:

  18. We evaluate the accuracy of theK2P distance estimation by Split Resolution Test: root t is “evolutionary time” The diameter of the quartet is 22t D A C B

  19. Phase A: simulate evolution D A C B

  20. ç ÷ ç ÷ Apply the 4p condition. Was the correct split found? ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ è ø D C A B Phase B: reconstruct the split by the 4p condition compute distances between sequences, Repeat this process 10,000 times, count number of failures

  21. the split resolution test was applied on the model quartet with various diameters … … • For each diameter, mark the fraction (percentage) of the simulations in which the 4p condition failed (next slide)

  22. Performance of K2P distances in resolving quartets, small diameters: 0.01-0.2 Template quartet

  23. “site saturation” Performance for larger diameters

  24. When β < α, we can postpone the “site saturation” effect. For this, use another distance function for the same model, Δtv , which counts only transversions: Transitions α Transversions This is the CFN model [Cavendar78, Farris73, Neymann71] β α Transitions

  25. Apply the same split resolution test on the transversions only distance: Transversions only Distance correction procedure

  26. transversions only performs better on large, worse on small rates Transversions only total K2P rate

  27. æ ö ç ÷ 1 5 2 4 6 ç ÷ 10 1 ç ÷ 2 7 ç ÷ = ç ÷ ç ÷ ç ÷ Find a distance function d which is good for the input ç ÷ è ø Conclusion: Distance based reconstruction methods should be adaptive: Projects goal: Evaluate the performance of distance functions in reconstructing phylogenies .

  28. 1st step in finding good distance functions (for the K2P model): Characterize the available distance functions. Ideally, we would like to use the K2P distance associated with the rate matrix of each edge, but...

  29. Rate matrices are hard to observe, hence we use Substitution matrices u Evolution of a finite sequence by unknown model parameters α, β A stochastic substitution matrix Puv v

  30. u v w Subtitution matrices are extended to paths:

  31. u v w Substitution matrices are converted to distances by a Substitution Rate function • SR function need to satisfy the following for all substitution matrices P,Q inK2P: • Δ(PQ) = Δ(P)+ Δ(Q)(additivity) • Δ(P)>0 (positivity)

  32. To define SR functions which are additive: Δ(PQ) = Δ(P)+ Δ(Q) We use some linear algebra

  33. Lemma: There is a matrix U which diagonalizes each K2P Substitution Matrix P: P = U-1 PU = Where: 0 < λP <1 0 < μP < 1

  34. 1 1 0 0 0 0 0 0 0 0 λP λQ 0 0 0 0 0 0 0 0 μQ μP 0 0 0 0 0 0 0 0 μP μQ 1 0 0 0 0 λP λQ 0 0 0 0 μP μQ 0 0 0 0 μP μQ Let P,Q be two matrices in K2P. Then: U-1 P U = U-1 Q U = U-1 PQ U = U-1 PQ U =

  35. Hence, the functions: Dλ(P)= -ln(λP) , Dμ (P)=-ln(μP) are additive distance functions For the K2P model Proof: Dλ (PQ) = -ln(λPλQ)= -ln(λP)-ln(λQ) = Dλ (P)+ Dλ (Q) And the same for Dμ (P )= -ln(μP)

  36. u v w Moreover, Each positive linear combination ofDλ and Dμis an additive distance function Our goal: given set of input sequences, select D which guarantees best reconstruction of the true tree.

  37. The approximate distance function is defined by the observable noisy version of the substitution matrices ACGGTCA u ACGGATA v w GGGGATT We would like to use functions which minimize the influence of the “noise” on the reconstruction. Such a function can be defined&computed analytically for a single distance . Computing it for even small trees looks hard.

  38. Summary • We have infinitely many additive distance functions for the K2P model. • Which one should we use for the given input DNA sequences? • If we have the exact substitution matrices for all pairs of taxa, then all functions are equally good. • But we have only finite sequences, whose alignments provide only estimations of the true substitution matrices

  39. 3 phases of the project • Phase 1: Distance functions on simulated quartets :1 month • Phase 2: Distance functions on larger simulated trees: (1+) month • Phase 3: Extensions to real data and/or different models: 1 month • Phase 2 and 3 are flexible

  40. Phase I: Quartets (~one month) • Study the relevant info in “Towards Optimal....” • http://webcourse.cs.technion.ac.il/236512/Spring2010/ho/WCFiles/optimal_distance_functions.pdf. • Write a program (in MATLAB or C..) which compute optimaldistance functions as in the above paper • Repeat the “quartet resolution test” given in this presentation, and extend it to include optimal distance functions. • Feel free modify the simulation by your judgment.

  41. Phase II: Reconstructing Larger Trees using the Neighbor Joining Algorithm Study the Neighbor Joining algorithm Newick trees representations, and Robinson Fould measure. Make similar tests, but this time on larger trees. Implementation of NJ, and “Tree Templates” can be downloaded from the www. More information will be given later, either via the course site or in a meeting.

  42. Phase III: Trees from Real Data Get Homologeous DNA sequences from existing databases. Align the sequences using public domain software. Select appropriate distance functions, and estimate distances between the aligned sequences, using appropriate distance functions Use the various distance functions to reconstruct the trees, and compare their perfomance.

More Related