Exploring the role of non coding dna in the function of the human genome through variation
Download
1 / 42

Exploring the Role of Non-Coding DNA in the Function of the Human Genome through Variation. - PowerPoint PPT Presentation


  • 124 Views
  • Uploaded on

Exploring the Role of Non-Coding DNA in the Function of the Human Genome through Variation. Christine Bird [email protected] Hypothesis: Conserved non-coding DNA has a function in the human genome. Does human variation data suggest selection is acting on noncoding DNA?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Exploring the Role of Non-Coding DNA in the Function of the Human Genome through Variation.' - keita


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Hypothesis conserved non coding dna has a function in the human genome
Hypothesis: Conserved non-coding DNA has a function in the human genome

Does human variation data suggest selection is acting on noncoding DNA?

  • Are conserved non-coding sequences selectively constrained?

  • Detection of fast evolving conserved non-coding sequence.

  • Exploring the properties and genomic context of human fast evolving non-coding regions.


The human genome
The Human Genome: human genome

~25,000 genes

1 to 1.5% of human DNA is coding

Is the remaining 98.5% “junk”?


Neutral human genome

Constrained 5%

Selective constraint in mammalian genomes

Waterston et al. Nature 2002


Proportions of lineage specific conserved non coding cnc sequences
Proportions of Lineage Specific Conserved non-coding (CNC) sequences

418 MCSs (Multiple vertebrate Conserved Sequences) in 571Kb:

58 coding, 46 UTRs and 314 non-coding. ~ 27 species

Margulies et al. PNAS 2005


CNCs are evenly distributed in the human genome sequences

Dermitzakis et al. Nat Rev Genet 2005


The density of cncs and exons is negatively correlated
The density of CNCs and exons is negatively correlated sequences

Dermitzakis et al. Nat Rev Genet 2005


Why study conserved non coding dna
Why study conserved non-coding DNA? sequences

  • Abundance beyond that expected under neutral evolution.

  • If function is gene regulation, understanding is limited.

  • Gene regulation is considered a crucial contributor to evolutionary change (King and Wilson, 1975).

  • Conserved non-coding sequences (CNCs) may well harbour critical regulatory changes that have driven recent human evolution.


Conserved non coding sequences
Conserved non-coding sequences sequences

  • Top conserved 5% of the human genome as detected with a phylogenetic hidden Markov model (phyloHMM) (Siepel, 2005).

    • Best-in-genome pairwise alignments by blastz, followed by chaining.

    • A multiple alignment constructed by MULTIZ.

    • PhastCons constructs a two-state phylo-HMM for conserved and non-conserved regions.

  • Remove overlap with Ensembl gene annotation.

http://genome.ucsc.edu/


Are conserved non coding sequences selectively constrained
Are conserved non-coding sequences selectively constrained? sequences

  • Conservation of non-coding sequence due to forces acting on the human genome.

  • CNC SNP density only 82% of noncoding non-conserved sequence.3.9 x 10-4 vs. 4.8 x 10-4; chi2= 686, 1 df; p<10-99

    Just due to low local mutation rates?

    Or

    Are New alleles deleterious, therefore less likely to be fixed in population?

  • Address this by looking at the derived allele frequency (DAF) spectra as it is unaffected by local mutation rates.

Drake et al. Nat Genet 2006


Derived allele frequency
Derived Allele Frequency sequences

  • Selective constraint shifts the distribution of constrained alleles toward rarer frequencies (Fay & Wu, 2000).

  • Allele frequencies in 4 populations from 210 unrelated individuals in the HapMap project:

    CEU - American of European ancestry (60)

    YRI - Yoruba from Nigeria (60)

    JPT - Japanese from Tokyo (45)

    CHB - Han Chinese from Beijing (45)

  • Derived Allele Frequency (DAF) was generated for 1 million Phase I HapMap SNPs & 4 million Phase II.

  • The ancestral allele was inferred by comparison to chimp and/or macaque.

  • SNPs were assigned to defined genomic features to allow comparison.

Drake et al. Nat Genet 2006


Selective constraint sequences

CNCs are selectively constrained

High

Low

Drake et al. Nat Genet 2006

Mann-Whitney-U test; P<<10-4


Cncs have an excess of low frequency derived alleles compared to introns
CNCs have an excess of low frequency derived alleles compared to Introns

High

Low

Mann-Whitney-U test; CNC vs Introns P<<10-16


Cnc sequences are selectively constrained and not mutation cold spots
CNC sequences are selectively constrained and not mutation cold spots

  • Nucleotide variation revealed strong selective constraints upon CNCs in human populations.

    • SNP density 82% lower in CNCs

    • CNCs have an excess of low frequency derived alleles.

  • CNCs subject to purifying selection in humans, likely to harbour functionally important variants.

Drake et al. Nat Genet 2006


Why are they conserved
Why are they conserved? cold spots

  • Regions of the genome are therefore selectively constrained despite being non-coding.

    But what is the reason for this conservation…?

  • What is novel about their biology?

  • How can we tackle this question for so many elements?

  • What are the most interesting regions?

  • A subset of CNCs undergoing rapid change with potential common properties or roles.


Why study fast evolving non coding
Why study fast-evolving non-coding? cold spots

  • If CNCs are part of chimpanzee-human lineage differentiation by changes in gene regulation then changes in their nucleotide sequence should be expected despite their overall conservation.

  • Following gene duplication subfunctionalization by the partitioning of gene regulation among descendant copies (Force, 1999)

  • Older models of gene duplication proposed an important role for positive selection after duplication (Bridges 1935, Ohno 1970, Ohta, 1987).


Subfunctionalization

Brain cold spots

Heart

Heart

Subfunctionalization

  • Duplicated genes preserved through subfunctionalization by the Duplication-Degeneration-Complementation model.

  • If CNCs are regulatory elements involved in this process they would have changed rapidly since duplication.

Duplicated gene and separated tissue specific regulation

Lynch and Force, Genetics 2000


Detecting fast evolving non coding sequences

S1 cold spots

Human

S2

Chimp

Macaque

(S1 - S2)2

= χ2

(S1 + S2)

Detecting fast-evolving non-coding sequences

Human

Chimp

Macaque

GACTACGTTTGGTTTAGAGAT

GACTGGCTTTACTTTTGAGAT

GTCTGGGTTTACTTTTCAGAT

GACTACGTTTGGTTTAGAGAT

GACTGGCTTTACTTTTGAGAT

GTCTGGGTTTACTTTTCAGAT

5

1

2

MULTIZ alignments (Webb Miller).

Lineage

Specific

Substitutions

Tajima’s Relative rate test

Tajima, Genetics 1993


  • χ cold spots2 test of base substitutions.

    Alignments = 304,291

    Power to detect acceleration = 26,477

    P < 0.05 Accelerated = 2,794 (11%)

    Accelerated in chimp = 1438

    Accelerated in human = 1356

ANC (Accelerated Non-Coding)


Are accelerated non coding ancs sequences functional
Are Accelerated Non-Coding (ANCs) sequences functional? cold spots

  • Compare to 3 sets of control sequences:

    • Power CNCs (not lineage specific):

      CNCs with >= 4 substitutions = 23,683

    • Non-accelerated CNCs:

      CNCs < 4 substitutions = 277,814

    • DAF controls 1&2:

      1356 x 20Kb windows 500Kb from 5’ & 3’ of ANCs.

      Repeat analyses excluding potential confounder: Segmental Duplications (SD), Copy Number Variants (CNV), pseudogenes and retroposed genes.


Are anc sequences functional
Are ANC sequences functional? cold spots

  • Does nucleotide variation data indicate particular modes of selection implying function?

    (Is acceleration recent or ancient?)

    • Derived allele frequency spectrum comparisons

    • Population differentiation, FST

  • Are ANCs involved in subfunctionalization?

    • Is there enrichment in recently duplicated sequences?

  • What function do these rapidly evolving sequences have?

    • Association of ANC variation with expression levels of nearby genes


Excess of high frequency derived alleles in ancs

Selective constraint cold spots

Loss of constraint & Directional Selection?

Excess of high frequency derived alleles in ANCs

Mann-Whitney-U test; Non-accelerated CNC vs ANCs P =1.63x10-6


Power cncs are neutral
Power CNCs are neutral cold spots

Mann-Whitney-U test; Power CNC vs Control P =0.15


Excess of rare alleles in ancs excluding confounding elements

Loss of constraint & Directional Selection? cold spots

Excess of rare alleles in ANCs excluding confounding elements

Mann-Whitney-U test; ANCs vs ANC no confounders P =0.48


Detecting recent evolution and population specific selection

F cold spotsST = HT - HS

HT

Detecting recent evolution and population-specific selection

  • A measure of population structure, Wright’s FST.

  • Compares the mean amount of genetic diversity found within subpopulations to the meta-population.

  • Sampling from 2 diverged subpopulations as if it is a panmitic population gives an excess of homozygotes & a deficiency of heterozygotes.

  • FST can be defined as:

  • Calculated for ANCs

    • MSG - mean square error within populations

    • MSP - mean square error between populations

    • nc - variance-corrected average sample size

Weir and Cockerham, Evolution 1984


Anc f st values higher than non accelerated cncs
ANC F cold spotsST values higher than non-accelerated CNCs

Mann-Whitney-U-test; Non-accelerated CNCs vs ANCs P = 0.0504

; Non-accelerated CNCs vs ANCsno confoundersP = 0.0363


Enrichment in segmental duplications
Enrichment in Segmental Duplications cold spots

  • Approximately 5-6% of the human genome in SDs

    (Bailey et al, Science 2002)

    ANCs 8%

    power CNCs 10%

    non-accelerated CNCs 5%

  • Excess of ANCs and power CNCs in SDs (chi-square; P< 10-4).

  • The general enrichment in SDs is not surprising, as it has been observed that sequence divergence is elevated in duplicated sequences.

    (Hurles et al. GenBio. 2004; She et al. GenRes. 2006).


Excess of recent segmental duplications associated with ancs

Human cold spots

Specific

Excess of recent segmental duplications associated with ANCs

Mann-Whitney-U test; P<<10-4


Testing for evidence of involvement in gene regulation
Testing for evidence of involvement in Gene Regulation cold spots

GENE

ANC

Association

SNP

mRNA


Anc snp expression association

0 cold spots

1

2

ANC SNP- Expression Association

Additive association model:

Linear regression

e.g. CC = 0, CT = 1, TT = 2.

  • What is the functional impact of ANC variation on gene expression phenotypes?

  • 47,294 transcripts probed in lymphoblastoid cell lines of 210 unrelated HapMap

  • Associate SNPs genotypes within ANCs to transcript expression levels by linear regression.

  • Statistical significance adjusted following 10,000 permutations per gene.


Snps within ancs are significantly associated with gene expression phenotypes
SNPs within ANCs are significantly associated with gene expression phenotypes.

  • Significant SNPs at the 0.01 permutation threshold:

    68% ANCs SNPs tested (496 out of 729)

    9% Power CNCs SNPs tested (1047 out of 11468)

    A SNP within an ANC is 7 times more likely to be associated with gene expression levels than a SNP within a power CNC.

  • Significant at the 0.01 permutation threshold:

    16% of ANCs tested (59 out of 366)

    3% of Power CNCs tested (165 out of 5968)

    Nucleotide variation within ANCs is 5 times more likely to be associated with gene expression levels than variation in a power CNC.

  • Tendency for derived alleles within ANCs to be associated with lower expression levels.


Summary
Summary expression phenotypes.

  • CNCs are not mutation cold spots but selectively constrained.

  • Fast evolving noncoding sequences in the human lineage have lost this constraint and some are potentially undergoing positive selection.

  • This may have contributed to some recent differentiation in human populations.

  • ANCs are enriched in the most recent segmental duplications.

  • SNPs in ANCs are associated with significant change in gene expression phenotypes.


Acknowledgements
Acknowledgements expression phenotypes.

Thanks to my joint supervisors Emmanouil Dermitzakis and Matthew Hurles and the members of their teams;

  • Barbara Stranger

  • Dan Jeffares

  • Catherine Ingle

  • Julian Huppert

  • Antigone Dimas

  • Sarah Lindsay

  • Dan Andrews

  • Dan Turner

  • Chris Barnes

    Particular thanks to my other co-authors,

  • Webb Miller - human-chimpanzee-macaque alignments

  • Daryl Thomas - DAF for both phase I and II SNPs

  • Maureen Liu - quantifying gene density

    The Rhesus Macaque Genome Sequencing Consortium (RMGSC) and the HapMap consortium for making data available, and the Wellcome Trust and MRC for funding.


Exploring the role of non coding dna in the function of the human genome through variation1

Exploring the Role of Non-Coding DNA in the Function of the Human Genome through Variation.

By Christine Bird

[email protected]


Fig. 3. Human Genome through Variation.Phylogenetic tree of vertebrate species. By using the generated 27-species multisequence alignment, branch lengths were calculated based on analysis of synonymous coding positions. The branch lengths (as substitutions per synonymous site) between human and each species are listed (with additional pair-wise branch lengths provided in the supporting information).

The last common ancestor among the catarrhine primates (A) is estimated at 25 mya (36, 37), between the rodents and primates

(B) at 75 mya (5,6),between eutherians and metatherians (C) at 185 mya (14), between monotremes and other therians

(D) at 200 mya (14), and between mammals and birds (E) at 310 mya (13).

Margulies et al. PNAS 2005


Proportions of lineage specific conserved non coding sequences
Proportions of Lineage Specific Conserved non-coding sequences

Fig. 4. Lineage specificity of MCSs. The proportion of nonexonic MCSs found in the sequences of species in each category is indicated. Note that virtually all MCSs overlapping known exonic sequences are present in all mammals (data not shown). All Mammals: cat, dog, cow, pig, rat, mouse, N.A. opossum, wallaby, and platypus; Eutherian: cat, dog, cow, pig, rat, and mouse; Marsupials: N.A. opossum and wallaby; and Other: species combinations containing

2% of the analyzed MCSs (see the supporting information for the complete data set). Hashed areas of ‘‘All Mammals’’ reflect portions lacking one or both rodents, and hashed portions of ‘‘Eutherian Marsupials’’ reflect portions lacking both rodents.

Margulies et al. PNAS 2005


4 sequences

0

0

y

c

3

0

0

n

e

u

q

e

r

2

0

0

F

1

0

0

0

0

1

0

2

0

3

0

M

e

g

a

b

a

s

e

s

(

l

o

n

g

a

r

m

)

Distribution of large and small CNCs (Conserved Non-Coding sequences) and exons on Hsa21

4

0

0

Exons

exons

y

c

3

0

0

n

e

u

Frequency

Frequency

q

e

r

2

0

0

F

Big CNCs

’’CNGs big’’

1

0

0

Small CNCs

’’CNGs small’’

0

Mb

Mb

0

1

0

2

0

3

0

Big CNCs: 70% ID, 100 bps ungapped

Small CNCs: 85% ID, 35-99 bps ungapped

Dermitzakis et al. Nature 2002


Conservation of CNCs in multiple species sequences

human

Conserved

block

Dermitzakis et al. 2003 Science

mouse



Testing daf spectrum distributions
Testing DAF spectrum distributions sequences

  • Non-parametric distributions of unequal sample size

  • Mann-Whitney U-test:

    • Compares the median of two populations

    • Uses the rank order of values in the two samples.

  • Kolmogorov-Smirnov test:

    • Measures differences in the entire distributions of two samples in both shape and location of distributions, but at the cost that it is less sensitive to differences in location only.

  • KS is less powerful with respect to the alternative hypothesis of differences in location than the Mann-Whitney U-test


ad