identifying conserved spatial patterns in genomes n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Identifying conserved spatial patterns in genomes PowerPoint Presentation
Download Presentation
Identifying conserved spatial patterns in genomes

Loading in 2 Seconds...

play fullscreen
1 / 105

Identifying conserved spatial patterns in genomes - PowerPoint PPT Presentation


  • 122 Views
  • Updated on

Identifying conserved spatial patterns in genomes. Rose Hoberman Computer Science Department Carnegie Mellon University. University of Chicago 0ct 20, 2006. My focus: Spatial Comparative Genomics.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Identifying conserved spatial patterns in genomes


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Identifying conserved spatial patterns in genomes Rose Hoberman Computer Science Department Carnegie Mellon University University of Chicago 0ct 20, 2006

    2. My focus:Spatial Comparative Genomics Understanding genome structure, especially how the spatial arrangement of elements within the genome changes and evolves.

    3. 3 6 2 5 7 3 1 4 17 20 16 19 18 8 12 11 13 10 14 15 9 A simple model of a genome an ordered list of genes

    4. Genomic Change Ancestral genome speciation species 2 species 1 Sequence Mutation + Chromosomal Rearrangements

    5. 17 20 16 19 18 8 12 11 13 10 14 15 9 12 8 13 11 15 14 10 9 20 17 19 16 18 17 20 16 19 18 2 3 1 4 6 8 12 2 5 7 11 13 3 10 14 15 1 4 9 17 16 13 14 15 19 18 Inversions Types of Genomic Rearrangements Duplications/Insertions Loss 3 6 2 5 7 3 1 4 20

    6. 12 8 2 5 7 13 11 3 3 15 14 10 4 9 1 3 20 17 19 16 18 Inversions Types of Genomic Rearrangements Duplications/Insertions Loss Fissions and fusions 17 16 8 12 20 11 7 10 9 6 2 5 13 2 3 14 15 1 4 3 4 1

    7. 20 17 19 16 18 An Essential Task forSpatial Comparative Genomics Identify chromosomal regions that descended from the same region in the genome of the common ancestor Species 1 12 8 11 10 9 2 5 7 13 3 3 14 15 4 1 17 16 20 8 12 11 7 10 9 6 13 2 2 5 2 14 15 3 3 3 4 4 4 1 1 1 Species 2

    8. Outline • Introduction and Motivation • Evolution of spatial organization • Applications: why identify related genomic regions? • Problem Background • Why is this challenging? • Introduction to cluster finding • Results • Statistics for pairwise clusters • Statistics for three-way clusters

    9. Identification of homologous chromosomal segments is a key task in comparative genomics • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications Pevzner, Tesler. Genome Research 2003

    10. Identification of homologous chromosomal segments is a key task in comparative genomics • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications Guillaume Bourque et al. Genome Research 2004

    11. Identification of homologous chromosomal segments is a key task in comparative genomics Ancestral chromosome • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications Whole genome duplication chromosome 2 chr 1 chromosome 1 chr 2

    12. Identification of homologous chromosomal segments is a key task in comparative genomics • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications McLysaght et al Nature Genetics, 2002.

    13. Identification of conserved chromosomal structure is a key task in comparative genomics • Understand gene function and regulation in bacteria • Infer functional associations • Predict operons • Identify horizontal transfers Insertion Loss Inversions

    14. Outline • Introduction and Motivation • Evolution of spatial organization • Applications: why identify related genomic regions? • Problem Background • Why is this challenging? • Introduction to cluster finding • Results • Statistics for pairwise clusters • Statistics for three-way clusters

    15. 20 17 19 16 18 Closely related genomes Species 1 12 8 11 10 9 2 5 7 13 3 3 14 15 4 1 17 16 20 8 12 11 7 10 9 13 6 2 2 5 2 14 15 3 3 3 1 1 4 4 1 4 Species 2 Related regions are easy to identify

    16. Five hundred million years...

    17. More Distantly Related Genomes 12 11 8 5 18 9 11 20 19 7 18 17 2 3 16 13 10 4 3 14 15 1 Homologous regions are harder to detect, but there is still spatial evidence of common ancestry • Similar gene content • Neither gene content nor order is perfectly preserved 12 17 8 6 20 2 2 5 11 7 13 16 3 10 14 15 4 9 4 1 1 1

    18. The signature of diverged regions 12 11 8 5 18 9 11 20 19 7 18 17 2 3 16 13 10 4 3 14 15 1 Gene clusters • Similar gene content • Neither gene content nor order is perfectly 12 17 8 6 20 2 2 5 11 7 13 16 3 10 14 15 4 9 4 1 1 1

    19. A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Design an algorithm to find clusters • Statistically verify clusters

    20. Why Validate Clusters Statistically? • After sufficient time has passed, gene order will become randomized • Uniform random data tends to be “clumpy” • Some genes will end up close together in both genomes simply by chance 11 20 19 7 18 17 2 3 16 13 10 4 3 14 15 1 12 17 8 6 20 2 2 5 11 7 13 16 3 10 14 15 4 9 4 1 1 1

    21. Cluster Statistics Cluster Models

    22. Outline • Introduction and Motivation • Evolution of spatial organization • Applications: why identify related genomic regions? • Problem Background • Why is this challenging? • Introduction to cluster finding • Results • Statistics for pairwise clusters • Statistics for three-way clusters

    23. A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Design an algorithm to find clusters • Statistically verify clusters

    24. Gene Homology • Identification of homologous gene pairs • generally based on sequence similarity • conserved genomic context is also informative • Assumptions • matches are binary (similarity scores are discarded) • each gene is homologous to at most one other gene in the other genome

    25. A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Devise an algorithm to identify clusters • Statistically verify clusters

    26. Where are the gene clusters? Intuitive notion: pairs of regions that are dense with homologs How can we formalize this intuition?

    27. The r-window cluster definition r = 6 • Two windows of sizer that share at least m homologous gene pairs • r is a user-specified parameter (Calvacanti et al 03, Durand and Sankoff 03, Friedman and Hughes 01, Raghupathy and Durand 05)

    28. A max-gap chain g= 2 gap= 3 • The distance or “gap” between genes is equal to the number of intervening genes • A set of genes in a genome form a max-gap chain if • the gap between adjacent genes is never greater than g (a user-specified parameter)

    29. The max-gap cluster definition gap= 3 g= 2 g= 3 A set of genes form a max-gap cluster in two genomes if • the genes forms a max-gap chain in each genome • the cluster is maximal (i.e. not contained within a larger cluster)

    30. A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Devise an algorithm to identify clusters • Statistically verify clusters

    31. Max-gap search algorithms • Many genomic studies use a max-gap criteria • Each group designs their own search algorithm • These are often greedy algorithms, but greedy algorithms miss disordered max-gap clusters Hoberman et al, RECOMB Comp Genomics 2005

    32. Greedy, Agglomerative Algorithms g = 2 • initialize a cluster as a single homologous pair • search for a gene in proximity on both chromosomes • either extend the cluster and repeat, or terminate

    33. Greedy Algorithms Impose Order Constraints g = 2 A max-gap cluster of size four A greedy, agglomerative algorithm will not find this cluster since there is no max-gap cluster of size two

    34. Algorithms and Definition Mismatch • Greedy, bottom-up algorithms will not find all max-gap clusters • There is an efficient divide-and-conquer algorithm to find maximal max-gap clusters (Bergeron et al, WABI, 2002) • Cluster statistics depend on the search space, which depends on which algorithm is used

    35. A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Devise an algorithm to identify clusters • Statistically verify clusters An example…

    36. Example: Whole genome self-comparison to detect duplicated blocks • Chose g=30 • Compared all human chromosomes to all other chromosomes to find max-gap gene clusters McLysaght et al. Nature Genetics, 2002.

    37. How can we use statistical analysis? Chr 17 10 genes duplicated out of ~100 29 genes • Could two regions display this degree of similarity simply by chance? • Is g=30 a reasonable choice of gap size? • Are larger clusters less likely to occur by chance? • How large does a cluster have to be before we are surprised to observe it? Chr 3 3 genes duplicated out of ~25 13 genes McLysaght et al. Nature Genetics, 2002.

    38. Outline • Introduction and Motivation • Evolution of spatial organization • Applications: why identify related genomic regions? • Problem Background • Why is this challenging? • Introduction to cluster finding • Results • Statistics for pairwise clusters • Statistics for three-way clusters

    39. Cluster Statistics Cluster Models

    40. The max-gap definition is the most widely used cluster definition in genomic analyses Overbeek et al 1999: inferring functional coupling of genes in bacteria Vision et al 2000: origins of genomic duplications in Arabidopsis Friedman and Hughes 2001: gene duplication and structure of eukaryotic genomes Tamames 2001: evolution of gene order conservation in prokaryotes Vandepoele et al 2002: microcolinearity between Arabidopsis and rice McLysaght et al 2002: genomic duplication during early chordate evolution Simillion et al 2002: hidden duplications in Arabidopsis Blanc et al 2003: recent polyploidy in Arabidopsis Luc et al 2003: gene teams for comparative genomics Chen et al 2004: operon prediction in newly sequenced bacteria Bourque et al 2005: comparison of mammalian and chicken genome architectures … Yet there is noformal statistical model for max-gap clusters

    41. The Question Suppose two whole genomes were compared, and this max-gap cluster was identified: • Is this cluster biologically meaningful? • Could it have occurred in a comparison of two random genomes?

    42. Statistical Testing • Hypothesis testing • Alternate hypothesis: shared ancestry • Null hypothesis: random gene order • Discard clusters that could have arisen under the null model • Determine the probability of observing a similar cluster under the null hypothesis

    43. The Problem Given an allowed gap size of g, what is the probability of observing a max-gap cluster containing exactly hmatching gene pairs? How do we calculate this probability? h=4

    44. The Inputs n=22 m=6 h=4 g=2 • n:number of genes in each genome • m:number of matching genes pairs • g:the maximum gap allowed in a cluster • h:number of matching genes in the cluster

    45. The probability when m = n If gene content is identical… …the probability of a max-gap cluster is 1 (regardless of the allowed gap size)

    46. Probability of a cluster of size hwhen m < n m-h genes m genes h genes Basic approach Enumerate all ways to: • Place m-h remaining genes so they do not extend the cluster • Create chains of h genes in both genomes * • Normalize to get a probability

    47. Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n

    48. Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n

    49. h genes m-h genes m genes Number of ways to place h genes in two genomes so they form a cluster Select h spots in each genome, so they form a max-gap chain Choose h genes to compose the cluster Assign each gene to a selected spot in each genome Hoberman, Sankoff, Durand RECOMB Comparative Genomics, 2004

    50. Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n