Ensembl
Download
1 / 107

Ensembl - PowerPoint PPT Presentation


  • 213 Views
  • Uploaded on

Ensembl. Steve Searle Joint project leader, Ensembl Genebuild team. Outline. Ensembl project overview Core database schema and API Pipeline Genomic annotation Comparative genomics Variation data Ensembl BioMart datamining db Making the data available. What is Ensembl? project aims.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Ensembl' - adina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Ensembl

Ensembl

Steve Searle

Joint project leader, Ensembl Genebuild team


Outline
Outline

  • Ensembl project overview

  • Core database schema and API

  • Pipeline

  • Genomic annotation

  • Comparative genomics

  • Variation data

  • Ensembl BioMart datamining db

  • Making the data available


What is ensembl project aims
What is Ensembl?project aims

  • funded to provide vertebrate genomes to the world

  • aims to provide the high quality automated genome annotation

  • aims to a leading group in genome analysis

  • all software, data and results freely available


What is ensembl project background
What is Ensembl ?project background

  • Group split between EBI and Sanger

  • Mainly Wellcome Trust funded (recently received a new five year grant for 2006-2011)

  • Largest dedicated compute in biology in Europe

  • Developer community > 300 people, including companies


Ensembl technical overview
Ensembl - Technical overview

  • Data storage

    • Mysql databases (~160Gb in current release)

      • Core databases - annotation for each species

      • Variation databases - variation data for some species

      • Compara - single database containing all comparative genomic data for species in ensembl

      • Mart - set of denormalised databases for datamining

  • Data production

    • Pipeline systems running automatic annotation on a compute farm of 800 CPUs

  • Interfaces

    • Website

    • Mart (datamining tool)

    • Apollo

    • SQL

    • APIs (both perl and Java)



Open source

  • Object model

    • standard interface makes it easy for others to build custom applications on top of Ensembl data

  • Open discussion of design ([email protected])

  • Most major pharmaceutical companies and many academics on mailing list

  • Ensembl installs worldwide

    • Both public and commercial

      e.g. Gramene (CSHL)

      Ciona-sg (Temasek)

      Arabidopsis (NASC)

      Fugu (IMCB)


Outline1
Outline

  • Ensembl project overview

  • Core database and API

  • Pipeline

  • Genomic annotation

  • Comparative genomics

  • Variation data

  • Ensembl BioMart datamining db

  • Making the data available


The ensembl core database
The Ensembl Core Database

  • Relational database (MySQL) containing the genomic sequence and annotations on it (genes, alignments, ab initio predictions etc)

  • Data stored in it throughout analysis process and the website displays features from it

  • Current schema has 68 tables

  • Ensembl core API team control changes


Requirements for the schema
Requirements for the schema

  • Store data for human genome

  • … and all the other genomes we have

  • … and all the genomes we might get

  • Flexible to add more data

  • Easy to adapt to new genome

  • Responds fast enough for web site display and pipelined genebuild


System context
System Context

Perl API

Mart DB

Ensembl DBs

Other Scripts & Applications

Apollo

www

Pipeline

MartShell

MartView

Java API (EnsJ)


Sequence tables
Sequence Tables

0..n

0..1

0..n

0..1

1

0..1

1

0..1

1…n

1

1

1

0..n

0..n


Feature tables
Feature Tables

  • Feature tables describe annotations with positions in sequence.

  • Each feature is associated with a seq_region and has a start, end, and orientation on the seq_region.

  • There is no central feature table. There are tables specific to each feature type (DNA/DNA alignments, DNA/Protein alignments, Repeats, Simple features).

  • Different feature tables have different attributes, but always have a seq_region position.

1

0..n


Other features
Other features

1..n

1

1

1..n


Tables for genes
Tables for Genes

0..1

0..n

1

0..n

1

1

0..1

0..1

1

1

1

1..n

0..1

1

0..n

1

1

1

0..n

0..1

0..n

0..n


Other tables
Other tables

  • Sets of tables to handle:

    • Cross references of ensembl features to external database

    • Markers

    • QTLs

    • Regulatory regions and factors

    • Stable ID archive

    • Affymetrix probe data

    • Misc features

    • Density features

  • Tables containing meta information about the database

  • Karyotype bands

  • Protein annotation

  • Supporting evidence

  • Assembly exceptions (haplotypes and PARs)


Ensembl apis
Ensembl APIs

Programmatic access to ensembl databases is via three main APIs:

ensembl core API access to genome database

ensembl compara API access to compara database

ensembl variation API access to variation database

All three have the same basic structure

Data objects to represent biological entities eg. Gene, Homology, Variation

DataAdaptor objects to store and retrieve data objects from database.

Data production APIs

ensembl-pipeline genebuild pipeline

ensembl-analysis analysis wrapper objects

ensembl-hive compara pipeline


The perl core api
The Perl Core API

  • The Perl core API provides a layer of abstraction over the Ensembl core databases.

  • Written in Object-Oriented Perl.

  • Can be used to get information into or out of Ensembl databases.

  • Insulates programmers to some extent from changes to the database schema.

  • Insulates programmer from coordinate transformations


Data objects
Data Objects

  • Information is obtained from the API in the form of Data Objects.

  • Each object represents some data which is stored in the database.

  • A Gene object represents a gene, a Transcript object represents a transcript, a Marker Object represents a Marker, etc.


Data objects code example
Data Objects – Code Example

# print out the start, end and strand of a transcript

print $transcript->start(), '-', $transcript->end(),

'(',$transcript->strand(), “)\n”;

# print out the stable identifier for an exon

print $exon->stable_id(), “\n”;

# print out the name of a marker and its primer sequences

print $marker->display_marker_synonym()->name, “\n”;

print “left primer: ”, $marker->left_primer(), “\n”;

print “right primer:”, $marker->right_primer(), “\n”;

# set the start and end of a simple feature

$simple_feature->start(10);

$simple_feature->end(100);


Object adaptors
Object Adaptors

  • Object Adaptors are factories for Data Objects.

  • Data Objects are retrieved from and stored in databases using Object Adaptors.

  • Each Object Adaptor is responsible for creating objects of only one particular type.

  • Data Adaptor fetch, store, and remove methods are used to retrieve, save, and delete information in the database.

  • All the SQL is in the Object Adaptors


Object adaptors code example
Object Adaptors – Code Example

# fetch a gene by its internal identifier

$gene = $gene_adaptor->fetch_by_dbID(1234);

# fetch a gene by its stable identifier

$gene =

$gene_adaptor->fetch_by_stable_id('ENSG0000005038');

# store a transcript in the database

$transcript_adaptor->store($transcript);

# remove an exon from the database

$exon_adaptor->remove($exon);

# get all transcripts having a specific interpro domain

@transcripts =

@{$transcript_adaptor->fetch_all_by_domain('IPR000980')};


The dbadaptor and the registry

Gene

Marker

The DBAdaptor and the Registry

  • The Database Adaptor is a factory for Object Adaptors

  • It is used to connect to the database and to obtain Object Adaptors

Data Objects

GeneAdaptor

MarkerAdaptor

Object Adaptors

DBAdaptor

DB

  • Registry enables access to multiple databases using information from a config file (important for compara work)


Slices
Slices

  • A Slice Data Object represents an arbitrary region of a genome.

  • Slices are not directly stored in the database.

  • A Slice is used to request sequence or features from a specific region in a specific coordinate system.

chr20

Clone AC022035


Slices code example
Slices – Code Example

# get the slice adaptor

$slice_adaptor = $db->get_SliceAdaptor();

# fetch a slice on a region of chromosome 12

$slice = $slice_adaptor->fetch_by_region('chromosome', '12',

1e6, 2e6);

# print out the sequence from this region

print $slice->seq();

# get all clones in the database and print out their names

@slices = @{$slice_adaptor->fetch_all('clone')};

foreach $slice (@slices) {

print $slice->seq_region_name(), “\n”;

}


Features
Features

  • Features are Data Objects with associated genomic locations.

  • All Features have start, end, strand and slice attributes.

  • Features are retrieved from Object Adaptors using limiting criteria such as identifiers or regions (slices).

  • Gene

  • Transcript

  • Exon

  • PredictionTranscript

  • PredictionExon

  • DnaAlignFeature

  • ProteinAlignFeature

  • SimpleFeature

  • MarkerFeature

  • QtlFeature

  • MiscFeature

  • KaryotypeBand

  • RepeatFeature

  • AssemblyExceptionFeature

  • DensityFeature


A complete code example
A Complete Code Example

use Bio::EnsEMBL::DBSQL::DBAdaptor;

my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new

(-host => ‘ensembldb.ensembl.org’,

-dbname => ‘homo_sapiens_core_35_35h’,

-user => ‘anonymous’);

my $slice_ad = $db->get_SliceAdaptor();

my $slice = $slice_ad->fetch_by_region('chromosome', 'X',

1e6, 10e6);

foreach my $sf (@{$slice->get_all_SimpleFeatures()}) {

my $start = $sf->start();

my $end = $sf->end();

my $strand = $sf->strand();

my $score = $sf->score();

print “$start-$end($strand)$score\n”;

}


A gene object code example
A Gene Object Code Example

#!/usr/bin/perl -w

use Bio::EnsEMBL::DBSQL::DBAdaptor;

use strict;

my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new

(-host => ‘ensembldb.ensembl.org’,

-dbname => ‘homo_sapiens_core_35_35h’,

-user => ‘anonymous’);

my $slice_ad = $db->get_SliceAdaptor();

my $slice = $slice_ad->fetch_by_region('chromosome', 'X',

1e6, 10e6);

foreach my $gene (@{$slice->get_all_Genes_by_type(‘ensembl’)}) {

print “Gene “,$gene->stable_id,“ “,

$gene->start,“ “,

$gene->end,“\n”;

foreach my $trans (@{$gene->get_all_Transcripts}) {

print “ Trans “,$trans->stable_id,”\n”;

my $tlnseq = $trans->translate->seq;

$tlnseq =~ s/(.{1,60})/$1\n/g;

print “ “,$tlnseq;

}

}


Coordinate transformations
Coordinate Transformations

  • The API provides the means to convert between any related coordinate systems in the database.

  • Feature methods transfer, transform, project can be used to move features between coordinate systems.

  • Slice method project can be used to move features between coordinate systems.


Feature transfer
Feature::transfer

  • The Feature method transfer moves a feature from one Slice to another.

  • The Slice may be in the same coordinate system or a different coordinate system.

Chr20

Chr17

Chr17

AC099811

ChrX


Feature transfer code example
Feature::transfer – Code Example

# fetch an exon from the database

$exon = $exon_adaptor->fetch_by_stable_id('ENSE00001180238');

print “Exon is on slice: “, $exon->slice()->name(), “\n”;

print “Exon coords: “, $exon->start(), '-', $exon->end(), “\n”;

# transfer the exon to a small slice just covering it

$exon_slice = $slice_adaptor->fetch_by_Feature($exon);

$exon = $exon->transfer($exon_slice);

print “Exon is on slice: “, $exon->slice()->name(), “\n”;

print “Exon coords: “, $exon->start(), '-', $exon->end(), “\n”;

Sample output:

Exon is on slice: chromosome:NCBI34:12:1:132078379:1

Exon coords: 56452706-56452951

Exon is on slice: chromosome:NCBI34:12:56452706:56452951:1

Exon coords: 1-246


Stability of api
Stability of API

  • Ensembl API changes to meet our needs

  • Request for greater stability from users

  • Some methods are now labelled as stable and we guarentee that they will not change for at least 2 years.


Outline2
Outline

  • Ensembl project overview

  • Core database and API

  • Pipeline

  • Genomic annotation

  • Comparative genomics

  • Variation data

  • Ensembl BioMart datamining db

  • Making the data available


Runnables and runnabledbs
Runnables and RunnableDBs

  • Runnables are perl objects which wrap analysis programs. Methods:

    • run

    • parse_results

      • Generates ensembl data objects

    • output

      • Returns generated data objects

        eg. Blast runnable wraps blast

  • RunnableDBs are perl objects which wrap Runnables allowing them to retrieve input data from and store output data into ensembl databases

    • fetch_input

    • write_output


Runnable example
Runnable example

my $seq = Bio::SeqIO->new( -file => "<test.fa",

-format => 'Fasta')->next_seq;

my $slice = Bio::EnsEMBL::Slice->new(

-seq => $seq->seq,

-coord_system => Bio::EnsEMBL::CoordSystem->new(-name => 'contig',

-rank => 1),

-seq_region_name => $seq->display_id,

-start => 1,

-end => $seq->length);

my $genscan_runnable = Bio::EnsEMBL::Analysis::Runnable::Genscan->new(

-query => $slice,

-analysis => Bio::EnsEMBL::Analysis->new(-logic_name=>'genscan'));

$genscan_runnable->run;

my @output;

foreach my $prediction (@{$genscan_runnable->output}) {

my $blast_run = Bio::EnsEMBL::Analysis::Runnable::Blast->new (

-query => $prediction->translate,

-parser => Bio::EnsEMBL::Analysis::Tools::BPliteWrapper->new(),

-database => 'embl_vertrna',

-program => 'wutblastn',

-analysis => Bio::EnsEMBL::Analysis->new(-logic_name=>'vertrna'));

$blast_run->run;

push(@output, @{$blast_run->output});

}


Runnabledb example

#!/usr/local/ensembl/bin/perl -w

use strict;

use Bio::EnsEMBL::Pipeline::DBSQL::DBAdaptor;

use Bio::EnsEMBL::Pipeline::Analysis;

my $db = new Bio::EnsEMBL::Pipeline::DBSQL::DBAdaptor(

-host => 'localhost',

-user => 'root',

-dbname => 'test_db');

my $anal = $db->get_AnalysisAdaptor->fetch_by_logic_name(’Uniprot');

my $rdbstr = “Bio::EnsEMBL::Analysis::RunnableDB::”.$anal->module;

my $runobj = “$rdbstr”->new(

-db => $db,

-input_id => 'contig::AL1347153.1.3517:1:3571:1',

-analysis => $anal);

$runobj->fetch_input;

$runobj->run;

$runobj->write_output;

RunnableDB example


Writing runnables and runnabledbs
Writing Runnables and RunnableDBs

  • A lot of functionality is implemented in the base classes

  • At its simplest just requires implementing:

    parse_results in the Runnable

    get_adaptor in the RunnableDB

    fetch_input in the RunnableDB

  • Other methods which may need overriding

    write_output in the RunnableDB

    run_analysis in the Runnable




Current hardware
Current hardware

  • 8x ES40 Alpha (667 MHz) with 2Tb fibre channel storage

  • 10x ES45 Alpha (1GZ) with 5Tb fibre channel storage

  • 3x Itanium 4 CPU with 1.6Tb storage

  • 400 HS20 IBM Blades (2x2.8 or 3.2Ghz PIV + 4 Gig memory + 2TB clustered SAN filesystem or 600GB clustered IDE filesystem (both IBM GPFS)

  • Tru64 UNIX/Linux

  • 21 MySQL (v 4.1) instances

  • Most binaries and all sequence databases stored locally (avoids using NFS)


Outline3
Outline

  • Ensembl project overview

  • Core database and API

  • Pipeline

  • Genomic annotation

  • Comparative genomics

  • Variation data

  • Ensembl BioMart datamining db

  • Making the data available


Genome annotation overview
Genome annotation overview

Raw compute - Alignments against protein and DNA dbs, and other basic analyses

Automatic gene annotation

Protein coding gene models

Pseudogenes (some)

RNA genes

Alignment of species ESTs and cDNAs

Affymetrix probe mapping

Protein domain annotation

Cross reference generation


The raw computes
The Raw Computes

  • Repeat Features

    • RepeatMasker

    • Dust

    • TRF

  • Ab Initio Genes

    • Genscan (sometimes other programs)

  • Blast alignments

    • Blastp against Uniprot

    • Blastn against EMBL vertebrate RNAs and UniGene Clusters

  • Other Features

    • CPG islands

    • tRNA genes

    • Transcription start sites using Eponine


Gene annotation
Gene Annotation

Species Specific

Proteins

Other Proteins

Species Specific

cDNAs

Species Specific

ESTs

Genewise

Exonerate

Exonerate

Genewise

genes

Aligned

cDNAs

AlignedESTs

Blessed gene set

(optional)

Genewise geneswith UTRs

ClusterMerge

ClusterMerge

Supported ab initio

(optional)

Genebuilder

Preliminary

gene set

cDNA genes

Gene

Combiner

Final set

+ pseudogenes

Pseudogenes

Core Ensembl

genes

Ensembl

EST genes


Ncrnas
ncRNAs

  • Functional RNAs

  • Families share conserved secondary structure

  • Low sequence identity

  • Ribosome

  • Spliceosome

  • tRNAs

  • miRNA


Difficulties in annotating ncrnas
Difficulties in annotating ncRNAs

Ab initio gene predicting programs such as GENSCAN cannot predict non-coding genes.

BLAST performs poorly at detecting non coding genes where structure is conserved but sequence identity is low.

Cannot use repeat masked DNA as some ncRNAs look very much like repeats (ALU related to SRP RNA)

Cannot use ESTs as ncRNAs lack poly-A


RFAM

  • Hand made alignments

  • Use Infernal to make Covariance Models

  • Scan models over subset of EMBL to build family alignments


Problems 2
Problems 2

  • Infernal does not scale well:

  • “Covariance model searches are extremely compute intensive… The compute time scales roughly to the 4th power of the length of the RNA, so larger models quickly become infeasible without significant compute resources”

  • How long would it take to run the human genome?

  • Rough estimate > a week on the farm

  • Need to limit the amount of sequence we run Infernal on


Rfam scan
Rfam Scan

  • Rfam procedure to speed up Infernal on large eukaryotes

    • Uses Blast to narrow search:

      • BLAST is poor at finding ncRNAs with low sequence ID

      • RFAM families contain sequences from all organisms

      • More sequence variation = more chance of Blast making alignment

  • In ensembl:

    • Separated blast and Infernal steps (using Runnables)

    • Determined filtering for blast results to limit time without significant reduction in sensitivity

    • Now runs in less than 24hrs.


Mirna
miRNA

  • Highly conserved across species

  • Precursor stem loop sequence ~ 70nt

  • Mature miRNA ~ 21nt

  • Identified using BLAST genomic vs miRBase precursors

  • RNAfold used to test for stem loop

  • Mature sequence identified (only 2 nt changes tolerated)

  • Start with ~ 290,000 blast hits

  • End with 222 miRNA

  • 96% of SE miRNAs + additional 60

  • Novel c.f. miRBase:

  • 1 – chicken, 36 – mouse, 5 – rat



Structures
Structures

Structures identified by Infernal / RNAfold are stored as transcript attributes

::::::::::::::::<<-<<<<<-<<<________________>>>>>>>>-->>,,,,

1 AuCUUUGCGCAGGGGCaaUaucguAgccAGUGAGGcUuuaCCGAggcgcgauUAuuGCUA 60

A+CUUUGCGCAG GGCA:UAU :UAGCCA+UGAGG+UU++CCGAGGCG: AUUA:UGCUA

181 AGCUUUGCGCAGUGGCAGUAUCAUAGCCAAUGAGGUUUAUCCGAGGCGCAAUUAUUGCUA 240

<<<<_.________.__>>>>,,,,,<<<.<<<<<<<<<<____......__>>>>>>>>

61 gUugA.AAACUAUU.CCcaAccgCCCgcc.aagacgacauguua......uauugucggc 111

:UU A AAA UA AA:+G G:C ::: ::A:::+UUA U :::U::+:

241 AUUAAuAAAUUAAAuAAUAAAAGGG-GACuCUU-UUAGUGCUUAuaaaggUUUACUAACC 298

>>->>>,,,,,,,,,,,,<<<<____>>>>

112 uuuggcAAUUUUUGGAAGcccuccAaaggg 141

:: G:CAA UU +AAG ::C+AA::

299 ACAGACAACUU---AAAGGUAACAAACCUA 325

Displayable on website as markup on transcript sequence


Low coverage genomes

16 mammalian genomes sequenced to 2x coverage are expected over the next 2-3 years.

Ensembl is aiming to provide gene sets for these based on alignments to human, building predictions on scaffolds which align to genomic locations of human genes

Test case

Cow preliminary 3x assembly 449727 scaffolds, 795212 contigs

Good test case because 6x assembly was recently made available so we can assess accuracy of method

Low Coverage Genomes


NNNNNNNN over the next 2-3 years.

Method overview

  • Details

    • Raw alignments grouped using UCSC chain and net method

    • Use human as source for cow - human has best annotation

    • ‘Gene scaffolds’ (new coord system) are stored in the database

    • Allow scaffolds to be broken at contig gaps

    • Retain ‘gap’ exons


Dealing with duplication: Iterative Human Net over the next 2-3 years.

Human

Cow

RED

Cow

BLUE

Cow

GREEN


Cow gene with ‘gap’ exons over the next 2-3 years.



QC (and human)

  • Internal QC

    • Comparison against Uniprot or Refseq

    • Comparison to previous build

    • Comparisons to homologs

  • External comparisons - the CCDS set


Increase in quality
Increase in quality (and human)




Improving the human build
Improving the human build (and human)

  • CCDS

    • Collaboration with Havana, NCBI and UCSC to produce a stable, reliable set of complete (ATG->Stop) CDS structures for human

    • NCBI and Ensembl guarantee to retain the set in builds

    • Generation:

      • Comparison of merged Ensembl/Vega set with the NCBI Refseq set to find the set of complete CDSs both groups predict identically.

      • UCSC (and the other groups) analyse the complete sets and the CDS intersection set for possible errors.

      • Assign stable ids and release

    • This process has been very valuable to both NCBI and us in highlighting problems in our build processes.

    • Two rounds of comparisons have taken place


Ccds display
CCDS display (and human)


Affymetrix probe mapping
Affymetrix probe mapping (and human)

  • Exonerate to map probe sequences

  • Assign xrefs to transcripts for matched probe sets

  • API currently being modified to be less Affymetrix array format (probeset) specific (by Zebrafish annotation team)

  • Dog, chicken, fruitfly, zebrafish, rat, mouse, human (worm next month)


Other developments
Other developments (and human)

  • Cross reference data

    • New system for generating this data, leading to:

      • More reliable generation of xrefs

      • New types of xref eg. Unigene

  • Monthly cDNA alignment set updates for human

    • Genomic alignments of cDNAs using an up to date cDNA set.

    • Displayed on the website in the ‘cDNAs’ track


Outline4
Outline (and human)

  • Ensembl project overview

  • Core database and API

  • Pipeline

  • Genomic annotation

  • Comparative genomics

  • Variation data

  • Ensembl BioMart datamining db

  • Making the data available


What is ensembl compara
What is Ensembl Compara? (and human)

A single database which links all the Ensembl Species databases together through precalculated comparative genomics data analysis.

A perl object API to access, and create that database

A production system for generating that database


Comparing different species (and human)

H. sapiens (human) NCBI35 3Gb*

5

23

P. troglodytes (common chimpanzee) 3Gb*

91

M. mulatta (rhesus macaque) *

92

M. musculus (house mouse) NCBIm33 2.6Gb*

41

R. norvegicus (Norway rat) RGSC3.1 2.6Gb*

C. familiaris (domestic dog) BROAD1 2.5Gb*

45

170

74

F. catus (domestic cat)

83

E. caballus (horse)

310

65

S. scrofa (domestic pig)

20

B. taurus (domestic cattle)Btau 1.0 +

360

O. aries (domestic sheep)

450

M. domestica (opposum)+

G. gallus (domestic fowl) WASHUC1 1.2Gb*

550

197

X. laevis (African clawed frog) JGI3 3.1Gb

X. tropicalis (tropical clawed frog) 1.7Gb*

D. rerio (zebrafish) WTSI Zv41.7Gb*

140

70

O. latipes (Japanese medaka) 800Mb

T. nigroviridis (Water fresh pufferfish) 400Mb*

25

990

T. rubripes (tiger pufferfish) Fugu v2.0 400Mb*

?

C. savignyi (sea squirt) 180Mb

C. intestinalis (sea squirt) 180Mb+

200?

1500?

A. aegypti (yellow fever mosquito)

250

A. gambiae (African malaria mosquito) 230Mb*

340

D. melanogaster (fruitfly) BDGP3.1 125Mb*

A. mellifera (honey bee) Amel1.1 200Mb*

I. scapularis (tick)

C. elegans (nematode) WS116 100Mb*

S. cerevisiae (yeast, SGD) S228C 12Mb*

Million years

1000

500

400

300

200

100

* 17 species currently in Ensembl

+ 3 to be added soon

Red : whole genome assembly available

Green : whole genome assembly due in the next 2 years


Compara database
Compara database (and human)

Dna/Dna

  • Gene orthology / paralogy predictions

  • Protein Family clusters

  • Raw protein alignments (wublastp)

  • Synteny regions

  • Whole genome alignments (BLAT, BlastZ, chain/net)

  • Whole genome multiple alignments(Mercator, MLagan)

Protein/Protein



Gene orthology prediction

wublastp+sw (and human)

qy

db

db

qy

Gene orthology prediction

species1

species2

species3

species1:species2

species1:species3

species2:species3

Best Reciprocal Hits

Protein->DNA alignments

dN, dS calculation

Extra orthologous pairs found

Based on gene order conservation


Rhs orphans and others

Others (and human)

MBRH

SYN

RHS

MBRH

DUP1.2

RHS

UBRH

UBRH

Matches to some

other chromosome

MBRH

COMPLEX

RHS, Orphans and Others

Using UBRH and MBRH-DUPs as anchors and comparing genomic coordinates in both species, we identify additional orthologues labeled RHS for Reciprocal Hit supported by Synteny

Human

Orphan

Mouse


Multicontigview
MultiContigView (and human)


Pairwise whole genome alignments pipeline

qy (and human)

db

PairAligner

Superclass

Blastz

Pairwise whole genome alignments pipeline

Species1

dna chunking

Species2

dna chunking

Dna chunks defined by

Size, Overlap, Masking options,

Chunk grouping size, Dump location

BLAT

Exonerate

Filtering (UCSC chain and net code)



SyntenyView (and human)


Multiple whole genome alignments pipeline

Mercator (and human)

MLAGAN

Multiple whole genome alignments pipeline

Species1

Coding exons

Species3

Coding exons

Species2

Coding exons

wublastp all vs all

(orthology anchors)

Orthology Map

Builder

MultipleAligner

SuperClass

MAVID

PECAN


Alignslice api

AlignSlice API (and human)

Using whole genome pairwise/multiple alignment data to generate a reference coordinate system common to the aligned species in the genomic region of interest.

Able to project features (including transcripts) from one species to another through the alignment.

Gives annotation context information across species.


Alignsliceview
AlignSliceView (and human)


Outline5
Outline (and human)

  • Ensembl project overview

  • Core database and API

  • Pipeline

  • Genomic annotation

  • Comparative genomics

  • Variation data

  • Ensembl BioMart datamining db

  • Making the data available


Variation database
Variation database (and human)

  • Refactored ensembl-variation database replaced ensembl SNP (and lite) database

  • New API to access DB from perl and java

  • Variation databases for 7 species:

    • Homo_Sapiens

    • Mus_Musculus

    • Anopheles_Gambiae

    • Rattus_Norvegicus

    • Gallus_Gallus

    • Danio_Rerio

    • Canis_Familiaris



Ensembl variation api
Ensembl-variation API (and human)

  • Similar in design to ensembl core and compara APIs:

    • Adaptor objects onto the database

    • Objects to represent biological entities such as:

      • Variation and VariationFeature

      • TranscriptVariation

      • Allele

      • Genotype

      • Population

      • Individual

      • AlleleGroup


Generating snp gene consequence
Generating SNP gene consequence (and human)

  • SNPs occurring within transcripts are identified and their consequence for that transcript determined

  • Classified into

    • Coding

      • Synonymous

      • Non synonymous

      • Frameshift

      • Stop gain / loss

    • Non coding UTR exonic

    • Intronic

    • Upstream or downstream


LD calculation (and human)

  • calculate pair-wise ld in different populations

  • calculate, in each population, how many individual have genotype of AA, Aa & aa

    • for defined window size (100,000), for each pair of variations

    • including 7 populations (involving hapmap and perlegen) and 309M rows in individual_genotype table

  • 86M rows in pairwise_ld table (with r2>0.05 and population sample_size>=40)



Outline6
Outline (and human)

  • Ensembl project overview

  • Core database and API

  • Pipeline

  • Genomic annotation

  • Comparative genomics

  • Variation data

  • Ensembl BioMart datamining db

  • Making the data available


Biomart and ensmart
BioMart and EnsMart (and human)

  • Large-scale data retrieval tool

  • Query builder interface

  • Databases: Ensembl, SNP, Vega, (MSD, UniProt)

  • Associated features or sequences

  • Flexible output formats

  • http://www.biomart.org

  • http://www.ensembl.org/Multi/martview


Outline7
Outline (and human)

  • Ensembl project overview

  • Core database and API

  • Pipeline

  • Genomic annotation

  • Comparative genomics

  • Variation data

  • Ensembl BioMart datamining db

  • Making the data available


Pre! (and human)


Web code

Client browser (and human)

View script

Input

Output

Data

Renderer

var

Ensembl

API

snp

est

core

Web code

  • Encapsulates

    • Input

    • Output

    • Ensembl API

    • Rendering

  • Improves

    • Maintainability

    • Flexibility

    • Code re-use


Ensembl web site
Ensembl web site (and human)

  • Web site is the main access method

  • Hardware recently upgraded

    • website now runs on blades (2 CPU intel boxes) like the compute farm.

      • Scale by adding more

    • Site speed is important to users

  • Code and interface updated during the summer

    • Plugins

      • Customisation of the site

    • Side bar

      • Quick access / discovery of pages relating to current page

  • DAS can be used to put up user features as a track on the site


Entry points
Entry points (and human)


Contigview
ContigView (and human)

Overview

Detailed View

Basepair View


Geneview
GeneView (and human)


Data retrieval
Data retrieval (and human)

BioMart

Export View

Data sets on ftp site

MySQL queries of databases

Perl API access to databases


Biomart
BioMart (and human)


Biomart features
BioMart - Features (and human)


Database access via mysql
Database access via MySQL (and human)

mysql -h ensembldb.ensembl.org -u anonymous

mysql> show databases like 'h%';

+------------------------------+

| Database (h%) |

+------------------------------+

| homo_sapiens_core_14_31 |

| homo_sapiens_core_15_33 |

| homo_sapiens_core_16_33 |

| homo_sapiens_disease_14_31 |

mysql> use homo_sapiens_core_29_35b;

Database changed

mysql> show tables;

+-----------------------------------+

| Tables_in_homo_sapiens_core_29_35b|

+-----------------------------------+

| analysis |

| assembly |

| chromosome |

| clone |


Archive site
Archive site (and human)


Archive site details
Archive site details (and human)

  • Each new release is archived onto a web blade

  • Plan is to keep each archive sites up for 2 years

  • Stable links (for 2 years):

    http://nov2004.archive.ensembl.org/Homo_sapiens/geneview?gene=ENSG00000139618

  • Will allow for better handling of retired stable ids


Ensembl Team (and human)

October 2005


Ensembl Team (and human)


ad