1 / 105

Identifying conserved spatial patterns in genomes

Identifying conserved spatial patterns in genomes. Rose Hoberman Computer Science Department Carnegie Mellon University. University of Chicago 0ct 20, 2006. My focus: Spatial Comparative Genomics.

olinda
Download Presentation

Identifying conserved spatial patterns in genomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Identifying conserved spatial patterns in genomes Rose Hoberman Computer Science Department Carnegie Mellon University University of Chicago 0ct 20, 2006

  2. My focus:Spatial Comparative Genomics Understanding genome structure, especially how the spatial arrangement of elements within the genome changes and evolves.

  3. 3 6 2 5 7 3 1 4 17 20 16 19 18 8 12 11 13 10 14 15 9 A simple model of a genome an ordered list of genes

  4. Genomic Change Ancestral genome speciation species 2 species 1 Sequence Mutation + Chromosomal Rearrangements

  5. 17 20 16 19 18 8 12 11 13 10 14 15 9 12 8 13 11 15 14 10 9 20 17 19 16 18 17 20 16 19 18 2 3 1 4 6 8 12 2 5 7 11 13 3 10 14 15 1 4 9 17 16 13 14 15 19 18 Inversions Types of Genomic Rearrangements Duplications/Insertions Loss 3 6 2 5 7 3 1 4 20

  6. 12 8 2 5 7 13 11 3 3 15 14 10 4 9 1 3 20 17 19 16 18 Inversions Types of Genomic Rearrangements Duplications/Insertions Loss Fissions and fusions 17 16 8 12 20 11 7 10 9 6 2 5 13 2 3 14 15 1 4 3 4 1

  7. 20 17 19 16 18 An Essential Task forSpatial Comparative Genomics Identify chromosomal regions that descended from the same region in the genome of the common ancestor Species 1 12 8 11 10 9 2 5 7 13 3 3 14 15 4 1 17 16 20 8 12 11 7 10 9 6 13 2 2 5 2 14 15 3 3 3 4 4 4 1 1 1 Species 2

  8. Outline • Introduction and Motivation • Evolution of spatial organization • Applications: why identify related genomic regions? • Problem Background • Why is this challenging? • Introduction to cluster finding • Results • Statistics for pairwise clusters • Statistics for three-way clusters

  9. Identification of homologous chromosomal segments is a key task in comparative genomics • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications Pevzner, Tesler. Genome Research 2003

  10. Identification of homologous chromosomal segments is a key task in comparative genomics • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications Guillaume Bourque et al. Genome Research 2004

  11. Identification of homologous chromosomal segments is a key task in comparative genomics Ancestral chromosome • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications Whole genome duplication chromosome 2 chr 1 chromosome 1 chr 2

  12. Identification of homologous chromosomal segments is a key task in comparative genomics • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications McLysaght et al Nature Genetics, 2002.

  13. Identification of conserved chromosomal structure is a key task in comparative genomics • Understand gene function and regulation in bacteria • Infer functional associations • Predict operons • Identify horizontal transfers Insertion Loss Inversions

  14. Outline • Introduction and Motivation • Evolution of spatial organization • Applications: why identify related genomic regions? • Problem Background • Why is this challenging? • Introduction to cluster finding • Results • Statistics for pairwise clusters • Statistics for three-way clusters

  15. 20 17 19 16 18 Closely related genomes Species 1 12 8 11 10 9 2 5 7 13 3 3 14 15 4 1 17 16 20 8 12 11 7 10 9 13 6 2 2 5 2 14 15 3 3 3 1 1 4 4 1 4 Species 2 Related regions are easy to identify

  16. Five hundred million years...

  17. More Distantly Related Genomes 12 11 8 5 18 9 11 20 19 7 18 17 2 3 16 13 10 4 3 14 15 1 Homologous regions are harder to detect, but there is still spatial evidence of common ancestry • Similar gene content • Neither gene content nor order is perfectly preserved 12 17 8 6 20 2 2 5 11 7 13 16 3 10 14 15 4 9 4 1 1 1

  18. The signature of diverged regions 12 11 8 5 18 9 11 20 19 7 18 17 2 3 16 13 10 4 3 14 15 1 Gene clusters • Similar gene content • Neither gene content nor order is perfectly 12 17 8 6 20 2 2 5 11 7 13 16 3 10 14 15 4 9 4 1 1 1

  19. A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Design an algorithm to find clusters • Statistically verify clusters

  20. Why Validate Clusters Statistically? • After sufficient time has passed, gene order will become randomized • Uniform random data tends to be “clumpy” • Some genes will end up close together in both genomes simply by chance 11 20 19 7 18 17 2 3 16 13 10 4 3 14 15 1 12 17 8 6 20 2 2 5 11 7 13 16 3 10 14 15 4 9 4 1 1 1

  21. Cluster Statistics Cluster Models

  22. Outline • Introduction and Motivation • Evolution of spatial organization • Applications: why identify related genomic regions? • Problem Background • Why is this challenging? • Introduction to cluster finding • Results • Statistics for pairwise clusters • Statistics for three-way clusters

  23. A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Design an algorithm to find clusters • Statistically verify clusters

  24. Gene Homology • Identification of homologous gene pairs • generally based on sequence similarity • conserved genomic context is also informative • Assumptions • matches are binary (similarity scores are discarded) • each gene is homologous to at most one other gene in the other genome

  25. A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Devise an algorithm to identify clusters • Statistically verify clusters

  26. Where are the gene clusters? Intuitive notion: pairs of regions that are dense with homologs How can we formalize this intuition?

  27. The r-window cluster definition r = 6 • Two windows of sizer that share at least m homologous gene pairs • r is a user-specified parameter (Calvacanti et al 03, Durand and Sankoff 03, Friedman and Hughes 01, Raghupathy and Durand 05)

  28. A max-gap chain g= 2 gap= 3 • The distance or “gap” between genes is equal to the number of intervening genes • A set of genes in a genome form a max-gap chain if • the gap between adjacent genes is never greater than g (a user-specified parameter)

  29. The max-gap cluster definition gap= 3 g= 2 g= 3 A set of genes form a max-gap cluster in two genomes if • the genes forms a max-gap chain in each genome • the cluster is maximal (i.e. not contained within a larger cluster)

  30. A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Devise an algorithm to identify clusters • Statistically verify clusters

  31. Max-gap search algorithms • Many genomic studies use a max-gap criteria • Each group designs their own search algorithm • These are often greedy algorithms, but greedy algorithms miss disordered max-gap clusters Hoberman et al, RECOMB Comp Genomics 2005

  32. Greedy, Agglomerative Algorithms g = 2 • initialize a cluster as a single homologous pair • search for a gene in proximity on both chromosomes • either extend the cluster and repeat, or terminate

  33. Greedy Algorithms Impose Order Constraints g = 2 A max-gap cluster of size four A greedy, agglomerative algorithm will not find this cluster since there is no max-gap cluster of size two

  34. Algorithms and Definition Mismatch • Greedy, bottom-up algorithms will not find all max-gap clusters • There is an efficient divide-and-conquer algorithm to find maximal max-gap clusters (Bergeron et al, WABI, 2002) • Cluster statistics depend on the search space, which depends on which algorithm is used

  35. A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Devise an algorithm to identify clusters • Statistically verify clusters An example…

  36. Example: Whole genome self-comparison to detect duplicated blocks • Chose g=30 • Compared all human chromosomes to all other chromosomes to find max-gap gene clusters McLysaght et al. Nature Genetics, 2002.

  37. How can we use statistical analysis? Chr 17 10 genes duplicated out of ~100 29 genes • Could two regions display this degree of similarity simply by chance? • Is g=30 a reasonable choice of gap size? • Are larger clusters less likely to occur by chance? • How large does a cluster have to be before we are surprised to observe it? Chr 3 3 genes duplicated out of ~25 13 genes McLysaght et al. Nature Genetics, 2002.

  38. Outline • Introduction and Motivation • Evolution of spatial organization • Applications: why identify related genomic regions? • Problem Background • Why is this challenging? • Introduction to cluster finding • Results • Statistics for pairwise clusters • Statistics for three-way clusters

  39. Cluster Statistics Cluster Models

  40. The max-gap definition is the most widely used cluster definition in genomic analyses Overbeek et al 1999: inferring functional coupling of genes in bacteria Vision et al 2000: origins of genomic duplications in Arabidopsis Friedman and Hughes 2001: gene duplication and structure of eukaryotic genomes Tamames 2001: evolution of gene order conservation in prokaryotes Vandepoele et al 2002: microcolinearity between Arabidopsis and rice McLysaght et al 2002: genomic duplication during early chordate evolution Simillion et al 2002: hidden duplications in Arabidopsis Blanc et al 2003: recent polyploidy in Arabidopsis Luc et al 2003: gene teams for comparative genomics Chen et al 2004: operon prediction in newly sequenced bacteria Bourque et al 2005: comparison of mammalian and chicken genome architectures … Yet there is noformal statistical model for max-gap clusters

  41. The Question Suppose two whole genomes were compared, and this max-gap cluster was identified: • Is this cluster biologically meaningful? • Could it have occurred in a comparison of two random genomes?

  42. Statistical Testing • Hypothesis testing • Alternate hypothesis: shared ancestry • Null hypothesis: random gene order • Discard clusters that could have arisen under the null model • Determine the probability of observing a similar cluster under the null hypothesis

  43. The Problem Given an allowed gap size of g, what is the probability of observing a max-gap cluster containing exactly hmatching gene pairs? How do we calculate this probability? h=4

  44. The Inputs n=22 m=6 h=4 g=2 • n:number of genes in each genome • m:number of matching genes pairs • g:the maximum gap allowed in a cluster • h:number of matching genes in the cluster

  45. The probability when m = n If gene content is identical… …the probability of a max-gap cluster is 1 (regardless of the allowed gap size)

  46. Probability of a cluster of size hwhen m < n m-h genes m genes h genes Basic approach Enumerate all ways to: • Place m-h remaining genes so they do not extend the cluster • Create chains of h genes in both genomes * • Normalize to get a probability

  47. Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n

  48. Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n

  49. h genes m-h genes m genes Number of ways to place h genes in two genomes so they form a cluster Select h spots in each genome, so they form a max-gap chain Choose h genes to compose the cluster Assign each gene to a selected spot in each genome Hoberman, Sankoff, Durand RECOMB Comparative Genomics, 2004

  50. Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n

More Related