evaluating the significance of max gap clusters n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Evaluating the Significance of Max-gap Clusters PowerPoint Presentation
Download Presentation
Evaluating the Significance of Max-gap Clusters

Loading in 2 Seconds...

play fullscreen
1 / 50

Evaluating the Significance of Max-gap Clusters - PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on

Evaluating the Significance of Max-gap Clusters. Rose Hoberman David Sankoff Dannie Durand. original genome. large scale duplication or speciation event. rearrangement, mutation. Gene content and order are preserved. Similarity in gene content. C. h. r. o. m. o. s. o. m. e. C.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Evaluating the Significance of Max-gap Clusters' - marc


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
evaluating the significance of max gap clusters

Evaluating the Significance of Max-gap Clusters

Rose Hoberman

David Sankoff

Dannie Durand

slide2

original genome

large scale duplication

or speciation event

rearrangement, mutation

Gene content and order are preserved

Similarity in gene content

local spatial evidence of duplication

C

h

r

o

m

o

s

o

m

e

C

h

r

o

m

o

s

o

m

e

5

1

1

C

r

y

b

a

4

6

0

4

0

C

r

y

b

b

1

/

2

/

3

L

h

x

5

T

c

f

2

6

5

4

5

T

c

f

1

C

r

y

b

a

1

,

N

o

s

2

L

h

x

1

T

b

x

3

/

5

T

b

x

2

/

4

5

0

7

0

N

o

s

1

7

5

5

5

Ruvinsky and Silver, Gene, 97

Local Spatial Evidence of Duplication
gene clustering for functional inference in bacterial genomes
Gene Clustering for Functional Inference in Bacterial Genomes

The Use of Gene Clusters to Infer Functional Coupling, Overbeek et al., PNAS 96: 2896-2901, 1999.

clusters are commonly used in genomic analyses1
Clusters are Commonly Used in Genomic Analyses

Need to assess cluster significance

slide7

Given: a genome:G = 1, …, n unique genes

a set ofm special genes

Can we find a significant cluster of (a subset of) the

m homologs?

slide8

Given: a genome:G = 1, …, n unique genes

a set ofm special genes

Can we find a significant cluster of the red genes?

How do we formally define a cluster?

slide9

size = 3 genes

Possible Cluster Parameters

  • size: number of red genes in the cluster
    • Example: cluster size ≥ 3
slide10

length = 6

Possible Cluster Parameters

  • size: number of red genes in the cluster
    • Example: cluster size ≥ 3
  • length: number of genes between first and last red gene
    • Example: cluster length ≤ 6
slide11

length = 6

Possible Cluster Parameters

  • size: number of red genes in the cluster
    • Example: cluster size ≥ 3
  • length: number of genes between first and last red gene
    • Example: cluster length ≤ 6
slide12

density = 6/11

Possible Cluster Parameters

  • size: number of red genes in the cluster
    • Example: cluster size ≥ 3
  • length: number of genes between first and last red genes
    • Example: cluster length ≤ 6
  • density: proportion of red genes (size/length)
    • Example: density ≥ 0.5
slide13

density = 6/11

Possible Cluster Parameters

  • size: number of red genes in the cluster
    • Example: cluster size ≥ 3
  • length: number of genes between first and last red genes
    • Example: cluster length ≤ 6
  • density: proportion of red genes (size/length)
    • Example: density ≥ 0.5
slide14

gap ≤ 4 genes

Possible Cluster Parameters

  • size: number of red genes in the cluster
    • Example: cluster size ≥ 3
  • length: number of genes between first and last red genes
    • Example: cluster length ≤ 6
  • density: proportion of red genes (size/length)
  • compactness: maximum gap between adjacent red genes
max gap cluster
Max-Gap Cluster

gap g

  • Commonly used in genomic analyses
  • Expandable, ensures minimum local and global density
  • Efficient algorithm to find them (Bergeron, Corteel, Raffinot 2002)
a gap between statistical tests and cluster definitions used in practice
A gap between statistical tests and cluster definitions used in practice
  • no analytical statistical model for max-gap clusters
  • statistical significance assessed through Monte Carlo randomization
this talk
This Talk
  • Analytical statistical tests
    • reference region model with m genes of interest
    • complete and incomplete clusters
  • Relationship between cluster parameters and significance
  • Future extension to whole-genome comparison
outline
Outline
  • Introduction
  • Complete Clusters
  • Incomplete Clusters
  • Genome Comparison
  • Open problems
complete clusters
Complete Clusters
  • Given
    • a genome: G = 1, …, n unique genes
    • a set of m special genes (the “red” genes)
    • a maximum-gap size g
  • Null hypothesis:

Random gene order

  • Alternate hypotheses:

Evolutionary history

Functional selection

  • Test statistic
    • the probability of observing all m genes in a max-gap cluster in G ?
probability of a complete cluster
Probability of a Complete Cluster

Count how many of the permutations ofmred genes andn-mblue genes contain a max-gap cluster

  • At how many positions in the genome can we locate a cluster (e.g. place the leftmost red gene)?
  • Given the location of the first red gene, how many ways are there to place the remaining m-1 red genes so that they form a max-gap cluster?
slide24

w = (m-1)g + m

ways to choose

m-1 gaps

ways to place the

first gene and still

have w-1 slots left

edge

effects

slide25

Counting clusters at the end of the genome

w-1

For clusters at the end of the genome:

  • Length of the cluster is constrained
  • Sum of the gaps is constrained:
slide27

How many ways can we form a max-gap cluster of size m with length≤ l ?

g2

g3

gm-1

g1

l < w

Number of ways of choosing m-1 gaps

between 0 and g so sum ≤l-m

=

Number of ways of rolling m-1 dice with faces labeled 0 to g so faces total ≤ l-m

slide28

How many ways can we form a max-gap cluster of size m with length≤ l ?

g2

g3

gm-1

g1

Number of ways of choosing m-1 gaps

between 0 and g so sum ≤l-m

=

Number of ways of rolling m-1 dice with faces labeled 0 to g so faces total ≤ l-m

final probability
Final Probability

Accounting for edge effects is the least efficient part of calculation

  • Eliminate for faster approximation when w << n

Now we can calculate probabilities

for various parameter values

outline1
Outline
  • Motivation
  • Complete Clusters
  • Incomplete Clusters
  • Genome Comparison
  • Open problems
incomplete clusters
Incomplete clusters

Given: mgenes of interest

  • Is it significant to find a max-gap clusterof size h < m?
  • Test statistic: probability of finding at least one cluster

m=5, g=1

h=3

probability of incomplete gene clusters
Probability of incomplete gene clusters

Enumerating clusters by starting position will lead to overcounting of permutations that include more than one cluster

m = 6, h = 3, g = 1

alternative approach
Alternative Approach
  • Enumerate all permutations that do not contain any clusters of size h or larger
  • Dynamic programming
    • Iteratively place “red” or “blue” genes making sure not to create any cluster of size h or larger by judicious placement of “blue genes”
slide37

n = 6, m = 4, h = 3, g=1

k = 4

c = 0,j =

8

r = 6

slide38

n = 6, m = 4, h = 3, g=1

k = 4

c = 0,j =

8

r = 6

k = 4

c = 1,j = 0

c = 0,j =

8

k = 3

r = 5

r = 5

slide39

n = 6, m = 4, h = 3, g=1

k = 4

c = 0,j =

8

r = 6

k = 4

c = 1,j = 0

c = 0,j =

8

k = 3

r = 5

r = 5

k = 3

c = 1,j = 1

c = 1,j = 0

k = 2

r = 4

r = 4

slide40

n = 6, m = 4, h = 3, g=1

k = 4

c = 0,j =

8

r = 6

k = 4

c = 1,j = 0

c = 0,j =

8

k = 3

r = 5

r = 5

k = 3

c = 1,j = 1

c = 1,j = 0

k = 2

r = 4

r = 4

c = 2,j = 0

k = 2

r = 3

X

incomplete cluster significance
Incomplete cluster significance
  • h
  • g

m ( )

n = 1000

m = 50

n = 500

significant region of parameter space shown in paper

outline2
Outline
  • Motivation
  • Complete Clusters
  • Incomplete Clusters
  • Extensions
  • Genome Comparison
possible extensions
Possible Extensions
  • Tandem duplications
  • Gene families
  • Extensions for prokaryotic genomes
    • Gene orientation
    • Circular genomes
    • Physical distance between genes
  • Whole genome comparison
whole genome comparison
Whole genome comparison

g 30

g 30

    • Assuming identical gene content …
  • What is the probability that at least k genes form a max-gap cluster
  • in both genomes?
an odd property
An Odd Property

The probability of finding a max-gap cluster of size at least k is always one

Example: g =1

an odd property1
An Odd Property

The probability of finding a max-gap cluster of size at least k is always one

Example: g =1

an odd property2
An Odd Property

The probability of finding a max-gap cluster of size at least k is always one

Example: g =1

gap

gap

gap

  • A cluster of size kdoes not necessarily
  • contain a cluster of size k-1!
an odd property3
An Odd Property

The probability of finding a max-gap cluster of size at least k is always one

There will always be a cluster of size n

Example: g =1

conclusions
Conclusions

Presented statistical tests for max-gap clusters

  • Evaluate the significance of clusters of a pre-specified set of genes
  • Choose parameters effectively
  • Understand trends