1 / 36

Locating conserved genes in whole genome scale

Locating conserved genes in whole genome scale. Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU), WK Sung (NUS). Outline. Motivation Challenges of Whole Genome Alignment Four approaches and their performance

gino
Download Presentation

Locating conserved genes in whole genome scale

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Locating conserved genesin whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU), WK Sung (NUS)

  2. Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks

  3. Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks

  4. Mouse & Human Mouse and human are genetically very similar Do they look like the same? What do we mean by similar? Many genes that can be found in human are also found in mouse as well – conserved genes Mouse Chromosome 16 Human Chromosome 16 m16 h03

  5. Whole Genome Alignment Genome A Genome B Gene X Gene Y Gene Z Gene X Gene Z Gene Y Identify regions on the genomes that possibly contain their conserved genes. possibly a mutation Difference in ordering of conserved could be related to mutations. For related species, num. of mutations is usually small.

  6. Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks

  7. Data size • Usually very large (e.g., human chromosomes vs mouse chromosomes) Cannot use global alignment tools because of the large size

  8. Observations Gene X Gene Y Gene Y Gene X Noise • a conserved gene may not be identical in the two genomes, nevertheless, there are some common substrings unique to this conserved gene (called MUM) • locate all MUMs over the two genomes, yet not every MUM corresponds to conserved genes

  9. Number of MUMs Size is smaller comparing with chromosome length

  10. MUMs for M16-H03 Conserved genes Mouse Chromosome 16 Human Chromosome 03

  11. How to choose the right MUMs? Generation of MUM using suffix tree

  12. Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks

  13. MUM Selection • MUMmer-1[Delcher et al. Nucleic Acids Research 1999] • longest common subsequences (effectively assume no mutations) • MUMmer-2[Delcher et al. Nucleic Acids Research 2002] & MUMmer-3[Kurtz et al. Genome Biology 2004] • clustering heuristics • most popular tool to uncover conserved genes in WG scale • MaxMinCluster[Wong et al. Bioinformatics 2004*] • clustering, optimization • MSSMutation Sensitive Selection [Chan et al. Bioinformatics 2005*] • capture mutations • Hybrid approach [Chan et al. Bioinformatics 2005*] • combine mutation sensitive and clustering approaches * our results

  14. Overview of Results • Average coverage (sensitivity) — in % • coverage: % of published conserved genes reported • sensitivity: % of MUMs reported that reside in published conserved genes

  15. Overview of Results • Average coverage (sensitivity) — in % MSS outperforms MaxMinCluster and MUMmer-3 on closely related species • coverage: % of published conserved genes reported • sensitivity: % of MUMs reported that reside in published conserved genes

  16. Overview of Results • Average coverage (sensitivity) — in % BUT MSS performs worse on species relatively farther apart • coverage: % of published conserved genes reported • sensitivity: % of MUMs reported that reside in published conserved genes

  17. Overview of Results • Average coverage (sensitivity) — in % • coverage: % of published conserved genes reported • sensitivity: % of MUMs reported that reside in published conserved genes both hybrid approaches perform well for species farther apart

  18. Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks

  19. Longest Common Subsequence LCS

  20. Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks LCS Approach (MUMmer-1) does not take mutations into account • MUMmer-2 & -3 cluster by heuristic • MaxMinCluster formalizes clustering as a combinatorial optimization problem

  21. Clustering approach • Observations • Noise MUMs are usually short and isolated • A conserved gene usually contains a sequence of MUMs that are close and have sufficient length => clusters Gene X Gene Y Gene Y Gene X Noise

  22. Challenge • Challenge: some conserved genes do not induce clusters of sufficient length • Solution: relax the definition of clusters to allow the presence of noise

  23. Noisy cluster • Suppose Gap=100, MinSize=40 > 100 apart length = 20 a 1-noisy cluster

  24. Noisy cluster • Suppose Gap=100, MinSize=40 > 100 apart length = 20 a 2-noisy cluster

  25. MaxMinClustesr • Problem formulation • find a collection of k-noisy clusters such that the smallest cluster has the maximum weight • Dynamic programmingO(k2n2) time, O(k2n) space

  26. Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks Capture mutations more directly

  27. Mutation Sensitive Selection • select subsets of MUMs transformed by a few mutations subset of MUMs • three types of mutations:reversal, transposition, reversed-transposition

  28. k-mutated subsequences • Given two sequences A & B and an integer k, • a pair of subsequence X of A & subsequence Y of B is called a pair of k-mutated subsequences ifX can be transformed to Y by at most k mutations a pair of 2-mutated subsequences reversal transposition MUMs are signed; reversal reverts sign of MUMs

  29. Mutation Sensitive Selection • Problem formulation: • To find a pair of k-mutated subsequences with maximum weight • We believe that the problem is NP-hard • The Genome Rearrangement Problem, believed to be NP-hard, can be reduced to this problem • We give an efficient approximation algorithm • the resulting weight is close to (at least 1/(3k+1) times) the maximum possible weight O(n2logn + kn2) time, O(n2) space

  30. Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks

  31. Hybrid Approach • first apply clustering approach to identify clusters which are obviously conserved genes • can apply either MUMmer-3 or MaxMinCluster • these clusters are treated as MUM with bigger weight • then apply MSS to process these MUM together with the remaining MUM

  32. Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks

  33. Remarks • Experiments show that • MaxMinCluster > LCS • MMS > MaxMinCluster for closely related species • MMS does not perform well for species relatively farther apart • Hybrid approach is the best for both closely related and farther apart species

  34. Thank you! Q & A

  35. Approximation Algorithm • Super-Backbone • maximum weight common subsequences • Identify k mutation blocks • having high weight • do not overlap with Super-Backbone too much • this is formulated as a sub-problem and solved optimally by dynamic programming • Report Super-Backbone & k mutation blocks O(n2logn + kn2) time, O(n2) space

  36. Mutations reversal transposition reversed-transposition • three types of mutations:reversal, transposition, reversed-transposition a b c d e f g h i j k l m n o p q r s t u v w x y z a d c b e f g h i j k l m n o p q r s t u v w x y z a d c b e k l m n o p q r s t u v w x y f g h i j z a d c b e k l t s r q p o m n u v w x y f g h i j z

More Related