Variation and Functional Genomics. Overview. Genomic Diversity (SNPs) Variations in the Ensembl Browser Human genome HapMap Gen2Phen and EGA A bit about Functional Genomics. Genomic Diversity. SNPs (Single Nucleotide Polymorphisms) base pair substitutions InDels
SNPs (Single Nucleotide Polymorphisms)
base pair substitutions
1 in every 300 bp (human)
~3 billion base pairs in mammalian genomes!
(SNP in clotting factor IX codes for a stop codon: haemophilia)
(SNP in LDL receptor reduces efficiancy: high cholesterol)
(SNP in CYP2D8, a gene in the drug breakdown pathway in the liver, disrputs breakdown of debrisoquine, a treatment for high blood pressure.)
9 of 25
Most SNPs imported from dbSNP (rs……):
Imported data: alleles, flanking sequences, frequencies, ….
Calculated data: position, synonymous status, peptide shift, ….
For human also:
Affy GeneChip 100K and 500K Mapping Array
Affy Genome-Wide SNP array 6.0
Ensembl-called SNPs (from Celera reads and Jim Watson’s and Craig Venter’s genomes)
For mouse, rat, dog and chicken also:
Sanger- and Ensembl-called SNPs (other strains / breeds)
10 of 25
Central repository for simple genetic polymorphisms:
single-base nucleotide substitutions
small-scale multi-base deletions or insertions
retroposable element insertions and microsatellite repeat variations
For human (dbSNP build 129):
19,125,432 submissions (ss#’s)
2,920,818 new RefSNPs (rs#’s)
11 of 25
Non-synonymous In coding sequence, resulting in an aa change
Synonymous In coding sequence, not resulting in an aa change
Frameshift In coding sequence, resulting in a frameshift
Stop lost In coding sequence, resulting in the loss of a stop codon
Stop gained In coding sequence, resulting in the gain of a stop codon
Essential splice site In the first 2 or the last 2 basepairs of an intron
Splice site 1-3 bps into an exon or 3-8 bps into an intron
Upstream Within 5 kb upstream of the 5'-end of a transcript
Regulatory region In regulatory region annotated by Ensembl
5' UTR In 5' UTR
Intronic In intron
3' UTR In 3' UTR
Downstream Within 5 kb downstream of the 3'-end of a transcript
Intergenic More than 5 kb away from a transcript
12 of 25
“The Diploid Genome Sequence of an Individual Human” PLoS Biology 5: 10 2113-2144 (2007)
“The Complete Genome of an Individual by Massively Parallel DNA Sequencing” Nature 452:872-876 (2008)
“Accurate Whole Human Genome Sequencing Using Reversible Terminator Chemistry ” Nature 456:53-59 (2008)
“The Diploid Genome Sequence of an Asian Individual” Nature 456:60-65 (2008)
LD is the deviation from equilibrium, or random association.
(i.e. in a population, two alleles are always inherited together, though they should undergo recombination some of the time.)
LD values between two variants are displayed by means of inverted coloured triangles going from white (low LD) to red (high LD).
Adapted from Nature 426, 6968: 789-796 (2003)
(Wikipedia): Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic projects
(such as genome sequencing projects)
to describe gene (and protein) functions and interactions.
Regulatory build using ENCODE project information
Promoters and Enhancers from CisRED and VISTA
FlyReg features (for Drosophila)
Encylopedia Of DNA Elements
Where are the promoter, enhancer, and other regulatory regions of the human genome?
Pilot project showed: Use chromatin accessibility and histone modification analysis to predict TSS
14 June 2007, Nature
Uses CTCF and DNAse1 data from multiple cell types as “core features”. Overlapping methylation sites expand these regions.
Sequence motifs determined by experimental and prediction tools.
VISTA Enhancer Set
Tissue-specific enhancers. Tested experimentally.
Nucleic Acids Res. 2007 January; 35(Database issue): D88–D92.