variation and functional genomics l.
Skip this Video
Loading SlideShow in 5 Seconds..
Variation and Functional Genomics PowerPoint Presentation
Download Presentation
Variation and Functional Genomics

Loading in 2 Seconds...

play fullscreen
1 / 50

Variation and Functional Genomics - PowerPoint PPT Presentation

  • Uploaded on

Variation and Functional Genomics. Overview. Genomic Diversity (SNPs) Variations in the Ensembl Browser Human genome HapMap Gen2Phen and EGA A bit about Functional Genomics. Genomic Diversity. SNPs (Single Nucleotide Polymorphisms) base pair substitutions InDels

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Variation and Functional Genomics' - dacia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
  • Genomic Diversity (SNPs)
  • Variations in the Ensembl Browser
  • Human genome
  • HapMap
  • Gen2Phen and EGA
  • A bit about Functional Genomics
genomic diversity
Genomic Diversity

SNPs (Single Nucleotide Polymorphisms)

base pair substitutions


insertion/deletion (frameshifts)

occur in

1 in every 300 bp (human)

~3 billion base pairs in mammalian genomes!

single nucleotide polymorphisms snps
Single nucleotide polymorphisms (SNPs)
  • Polymorphism: a DNA variation in which each possible sequence is present in at least 1% of the population
  • Most polymorphisms (~90%) take the form of SNPs: variations that involve just one nucleotide
origin of snps











Origin of SNPs

Mutation in


Selection of alleles

Increase of the allele to a substantial population frequency


Fixation of the allele in a populations


Adapted from Bioinformatics for Geneticists, Eds Barnes and Gray

studying variation why
Studying variation – why?
  • SNPs can cause disease

(SNP in clotting factor IX codes for a stop codon: haemophilia)

  • SNPs can increase disease risk

(SNP in LDL receptor reduces efficiancy: high cholesterol)

  • SNPs can affect drug response

(SNP in CYP2D8, a gene in the drug breakdown pathway in the liver, disrputs breakdown of debrisoquine, a treatment for high blood pressure.)

studying variation why8
Studying variation – why?
  • Determine disease risk
  • Individualised medicine (pharmacogenomics)
  • Forensic studies
  • Biological markers
  • Hybridisation studies, marker-assisted breeding
  • Understanding Evolution
snps in ensembl
SNPs in Ensembl

Most SNPs imported from dbSNP (rs……):

Imported data: alleles, flanking sequences, frequencies, ….

Calculated data: position, synonymous status, peptide shift, ….

For human also:


Affy GeneChip 100K and 500K Mapping Array

Affy Genome-Wide SNP array 6.0

Ensembl-called SNPs (from Celera reads and Jim Watson’s and Craig Venter’s genomes)

For mouse, rat, dog and chicken also:

Sanger- and Ensembl-called SNPs (other strains / breeds)

10 of 25


Central repository for simple genetic polymorphisms:

single-base nucleotide substitutions

small-scale multi-base deletions or insertions

retroposable element insertions and microsatellite repeat variations

For human (dbSNP build 129):

19,125,432 submissions (ss#’s)

2,920,818 new RefSNPs (rs#’s)

11 of 25

snps in ensembl types
SNPs in Ensembl - Types

Non-synonymous In coding sequence, resulting in an aa change

Synonymous In coding sequence, not resulting in an aa change

Frameshift In coding sequence, resulting in a frameshift

Stop lost In coding sequence, resulting in the loss of a stop codon

Stop gained In coding sequence, resulting in the gain of a stop codon

Essential splice site In the first 2 or the last 2 basepairs of an intron

Splice site 1-3 bps into an exon or 3-8 bps into an intron

Upstream Within 5 kb upstream of the 5'-end of a transcript

Regulatory region In regulatory region annotated by Ensembl

5' UTR In 5' UTR

Intronic In intron

3' UTR In 3' UTR

Downstream Within 5 kb downstream of the 3'-end of a transcript

Intergenic More than 5 kb away from a transcript

12 of 25

snps in ensembl species


  • Chimp
  • Mouse
  • Rat
  • Dog
  • Cow
  • Platypus
  • Chicken
  • Zebrafish
  • Tetraodon
  • Mosquito
SNPs in Ensembl - Species
  • Genomic Diversity (SNPs)
  • Variations in the Ensembl Browser
  • Human genome
  • HapMap
  • Gen2Phen and EGA
  • A bit about Functional Genomics
focus on human
Focus on Human
  • Venter and Watson genomes
  • 1000 genomes project
  • HapMap
first diploid genomes for human
First diploid genomes for human

Craig Venter:

  • Sequence & analysis ongoing since 2003

Jim Watson:

  • 454 technology (7.4x)
  • 100 mill unpaired reads (25 billion bps)
  • $1,000,000

“The Diploid Genome Sequence of an Individual Human” PLoS Biology 5: 10 2113-2144 (2007)

“The Complete Genome of an Individual by Massively Parallel DNA Sequencing” Nature 452:872-876 (2008)

“Accurate Whole Human Genome Sequencing Using Reversible Terminator Chemistry ” Nature 456:53-59 (2008)

“The Diploid Genome Sequence of an Asian Individual” Nature 456:60-65 (2008)


1000 Genomes

  • Delivering 20TB of sequence data…
    • First Pilot. 60 HapMap samples sequenced (low coverage)
    • Second Pilot. Two trios of European and African descent (high coverage)
    • Third Pilot. Sequence 1,000 genes in 1,000 individuals (high coverage)

1000 Genomes Browser

Main page

  • Built on Ensembl
  • Navigation on the left hand side
  • Options as drop down menus
  • Currently only includes human data
    • In the future comparative genomics information will be available
    • All pages link to Ensembl and UCSC
reference sequence
Reference Sequence
  • The Human Genome Project gave the “average” DNA sequence of a small number of people.
  • This helps us find out how a human develops and works
  • Does not show us the DNA differences between different humans
  • Does not reflect the major alleles
hapmap www hapmap org
  • A multi-country effort to identify and catalogue genetic similarities and differences in people.
  • Collaboration among scientists and funding agencies from Japan, the United Kingdom, Canada, China, Nigeria, and the United States.
  • All of the information generated by the project released into the public domain.
hapmap phase i ii
HapMap (phase I & II)
  • Samples from populations with African, Asian and European ancestry.
  • 270 DNA samples from 4 populations:
    • 30 trios (two parents and an adult child) from the Yoruba people of Ibadan, Nigeria
    • 45 unrelated Japanese from the Tokyo area
    • 45 unrelated Han Chinese from Beijing
    • 30 trios from Utah with Northern and Western European ancestry (CEPH)
hapmap phase iii
HapMap (phase III)
  • Genotypes from 1115 individual from 11 populations:
    • ASW African ancestry in Southwest USA(71)
    • CEU Utah residents with Northern and Western European ancestry from the CEPH collection (162)
    • CHB Han Chinese in Beijing, China (70)
    • CHD Chinese in Metropolitan Denver, Colorado (70)
    • GIH Gujarati Indians in Houston, Texas (83)
    • JPT Japanese in Tokyo, Japan (82)
    • LWK Luhya in Webuye, Kenya (83)
    • MEX Mexican ancestry in Los Angeles, California (71)
    • MKK Maasai in Kinyawa, Kenya (171)
    • TSI Toscani in Italia (77)
    • YRI Yoruba in Ibadan, Nigeria (163)
  • A haplotype is a set of SNPs (on average ~25 kb) found to be statistically associated on a single chromatid and which therefore tend to be inherited together over time.
  • Haplotyping involves grouping subjects by haplotypes.
linkage disequilibrium
Linkage Disequilibrium

LD is the deviation from equilibrium, or random association.

(i.e. in a population, two alleles are always inherited together, though they should undergo recombination some of the time.)

measures of ld
Measures of LD
  • D = P(AB) – P(A)P(B)
    • D ranges from – 0.25 to + 0.25
    • D = 0 indicates linkage equilibrium
    • dependent on allele frequencies, therefore of little use
  • D’ = D / maximum possible value
    • D’ = 1 indicates perfect LD
    • estimates of D’ strongly inflated in small samples
  • r2 = D2 / P(A)P(B)P(a)P(b)
    • r2 = 1 indicates perfect LD
    • measure of choice
  • High LD, or perfect LD, shows high association of SNPs.
linkage disequilibrium28
Linkage Disequilibrium

LD values between two variants are displayed by means of inverted coloured triangles going from white (low LD) to red (high LD).


Tag SNPs define a haplotype

Adapted from Nature 426, 6968: 789-796 (2003)

tag snps
Tag SNPs
  • ‘Tag SNPs’ define the minimum SNP set to identify a haplotype.
  • r2 = 1 between 2 SNPs means 1 would be ‘redundant’ in the haplotype.
locus specific databases lsdb
Locus specific databases (LSDB)
  • Databases that focus on one gene or one disease
    • e.g. p53, ABO, collagen
    • e.g. Albinism, cystic fibrosis, Alzheimer’s disease
    • User communities:
  • Research groups-disease and function driven
  • Clinicians – driven by genetic testing of patients
  • >700 on the Human Genome Variation Society website
why is it difficult to merge these data
Why is it difficult to merge these data?
  • Historical reasons. LSDBs sometimes
    • Use sequences which do not start at Methionine
    • Use transcript coordinates not genomic
    • Use a different transcript for reporting mutations
  • Regularly changes with new assemblies/gene builds
    • It may contain minor alleles or rare alleles
  • It may be inaccurate
    • Missing genes (e.g. no α-haemoglobin - Thalasemia)
    • Mixture of sequences from different individuals
ensembl and lrgs
Ensembl and LRGs
  • Define an exchange format for LRGs with the NCBI
  • Create an LRG website
  • Create a pipeline for receiving the data and creating an LRG
  • Extend e! databases to store LRGs
  • Develop an API to query LRGs and associated annotation
  • Consult with the LSDBs to develop useful visualisation tools
  • Build displays for LRG data and annotation
why is this important for ensembl
Why is this important for Ensembl
  • Ensembl has traditionally focused on an infrastructure for molecular biologists
  • Needs to expand to provide support for more stable transcript sequences used for reporting mutations
  • It will give central databases access to patient variation, genotype, phenotype and disease data
  • This will improve our data resources
advantages to lsdbs
Advantages to LSDBs
  • LRGs in Ensembl gives LSDBs access to:
    • Genome annotation (including comparative, functional genomics and variation data)
    • Data integration with other variation resources (dbSNP, EGA, 1000 Genomes, NHGRI GWA catalogue)
    • Sequence search and data mining tools
    • A Perl API to query the data
    • A genome browser website for visualisation in genomic context and local context
  • Promotes discoverability of LSDBs
  • Data is mapped from one assembly to the next
variations team
Variations Team

Fiona Cunningham

Yuan Chen

Will McLaren

functional genomics
Functional Genomics

(Wikipedia): Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic projects

(such as genome sequencing projects)

to describe gene (and protein) functions and interactions.

In Ensembl:

Regulatory build using ENCODE project information

Promoters and Enhancers from CisRED and VISTA

FlyReg features (for Drosophila)


Encylopedia Of DNA Elements

Where are the promoter, enhancer, and other regulatory regions of the human genome?

Pilot project showed: Use chromatin accessibility and histone modification analysis to predict TSS

14 June 2007, Nature

regulatory build
Regulatory Build

Uses CTCF and DNAse1 data from multiple cell types as “core features”. Overlapping methylation sites expand these regions.

there are other sets
There are other sets…

Sequence motifs determined by experimental and prediction tools.

VISTA Enhancer Set

Tissue-specific enhancers. Tested experimentally.

Nucleic Acids Res. 2007 January; 35(Database issue): D88–D92.

total list of regulation info
Total List of Regulation Info.
  • Homo sapiens
  • Mus musculus
  • Danio rerio
  • Drosophila melanogaster
  • DNase I Hypersensitivitiy sites for GM06990 and CD4+ T cells
  • CTCF binding sites
  • Histone modification data
  • MeDIP-chip methylation data for 17 human tissues and cell lines
  • VISTA Enhancer Assay (
  • cisRED motifs (
  • miRanda microRNA target prediction
  • Expression Quantitative Trait Loci (eQTL) from the Sanger Institute
  • DNase1 Hypersensititvity site (ES cells)
  • Histone modifications for ES, MEF, and NPC cells
  • cisRED motifs (
  • ZFMODELS-enhancers
  • REDfly TFBSs
  • REDfly CRMs
functional genomics team
Functional Genomics Team
  • eFG Ian Dunham
    • Nathan Johnson
    • Daniel Sobral (starts Dec 1)
    • Andy Yates (multi-species support)
    • Steven Wilder
    • Damian Keefe
end of course survey
End of course survey!