High throughput biology projects

High throughput biology projects

The new biology • Traditional biology: • Small team working on a specialized topic • Well defined experiment to answer precise questions • New « high-throughput » biology • Large international teams using cutting edge technology defining the project • Results are given raw to the scientific community without any underlying hypothesis

Example of « high-throughput » • Complete genome sequencing • Large-scale sampling of the transcriptome • Simultaneous gene expression analysis of thousands of gene (DNA chips) • Large-scale sampling of the proteome • Protein-protein analysis large-scale 2-hybrid (yeast, worm) • Large-scale 3D structure production (yeast) • Metabolism modelling • Biodiversity

Role of bioinformatics • Control and management of the data • Analysis of primary data e.g. • Base calling from chromatograms • Mass spectra analysis • DNA chips images analysis • Statistics • Results analysis in a biological context

Genomes in numbers • Sizes: • virus: 103 to 105 nt • bacteria: 105 to 107 nt • yeast: 1.35 x 107 nt • mammals: 108 to 1010 nt • plants: 1010 to 1011 nt • Gene number: • virus: 3 to 100 • bacteria: ~ 1000 • yeast: ~ 7000 • mammals: ~ 30’000

Sequencing projects • « small » genomes (<107): bacteria, virus • Many already sequenced (industry excluded) • More than 60 bacterial genomes already in the public domain • More to come! (one every two weeks…) • « large » genomes (107-1010) eucaryotes • 5 finished (S.cerevisiae, C.elegans, D.melanogaster, A.thaliana, Homo sapiens) • Many more to come: mouse, rat, rice (and other plants), fishes, many pathogenic parasites • EST sequencing • Partial mRNA sequences • ~8.5x106 sequences in the public domain

Human genome • Size: 3 x 109 nt for a haploid genome • Highly repetitive sequences 25%, moderately repetitive sequences 25-30% • Size of a gene: from 900 to >2’000’000 bases (introns included) • Proportion of the genome coding for proteins: 5-7% • Number of chromosomes: 22 autosomal, 1 sexual chromosome • Size of a chromosome: 5 x 107 to 5 x 108 bases

How to sequence the human genome? • Consortium « international » approach: • Generate genetic maps (meiotic recombination) and pseudogenetic maps (chromosome hybrids) for indicator sequences • Generate a physical map based on large clones (BAC or PAC) • Sequence enough large clones to cover the genome • « commercial » approach (Celera): • Generate random libraries of fixed length genomic clones (2kb and 10kb) • Sequence both ends of enough clones to obtain a 10x coverage • Use computer techniques to reconstitute the chromosomal sequences, check with the public project physical map

Mapping resources • Genetic and physical maps: Genethon, GDB, NCBI • Radiation hybrid map: Sanger • BAC production & mapping: Oakland, Caltech, others • Clone information and retrieval: RZPD (Germany) • Physical maps in ACEDB format from chromosome coordinators

Sequencing • Create shotgun library from BAC/PAC • Sequence individual clones to get a ten-fold coverage • Phases: • 0 = single sequence (like STS) • 1 = unordered contigs • 2 = ordered, oriented contigs • 3 = finished, annotated sequence

Chromosome size sequences • Problem: full chromosomes or entire bacterial genomes are too long to fit the database entry specifications • Solution: split the sequence in overlapping “chunks” • New problem: have to reassemble chunks if you want to analyze the whole sequence • GenBank provides “meta-entries” (CON division) with assembly instructions

Interpretation of the human draft • Many gaps and unordered small pieces • A genomic sequence does not tell you where the genes are encoded. The genome is far from being « decoded » • One must combine genome and transcriptome to have a better idea

The transcriptome • The set of all functional RNAs (tRNA, rRNA, mRNA etc…) that can potentially be transcribed from the genome • The documentation of the localization (cell type) and conditions under which these RNAs are expressed • The documentation of the biological function(s) of each RNA species

Public draft transcriptome • Information about the expression specificity and the function of mRNAs • « full » cDNA sequences of know function • « full » cDNA sequences, but « anonymous » (e.g. KIAA or DKFZ collections) • EST sequences • cDNA libraries derived from many different tissues • Rapid random sequencing of the ends of all clones • ORESTES sequences • Limited set of expression data

How to organise EST collections? • Clustering: associate individual EST sequences with unique transcripts or genes • Assembling: derive consensus sequences from overlapping ESTs belonging to the same cluster • Mapping: associate ESTs (or EST contigs) with exons in genomic sequences • Interpreting: find and correct coding regions

Example mapping of ESTs and mRNAs mRNAs ESTs Computer prediction

How to cope with the amount of data? • Enormous increase of sequences • Always moving data (phases…) • Automatic annotation projects • RefSeq (NCBI) • ENSEMBL (EBI) • HAMAP (SIB)

RefSeq: NCBI Reference sequences mRNAs and Proteins NM_123456 Reference mRNA NP_123456 Reference Protein XM_123456 Predicted Transcript XP_123456 Predicted Protein XR_123456 Predicted non-coding Transcript Gene Records NG_123456 Reference Genomic Sequence Assemblies NT_123456 Reference Contig (Mouse and Human Genomes) NC_123455 Reference Chromosome, Microbial Genomes, Plasmid

Status codes • RefSeq records are provided with a status code which provides an indication of the level of review a RefSeq record has undergone. • REVIEWED • The RefSeq record has been the reviewed by NCBI Staff. The review process includes reviewing available sequence data and frequently also includes a review of the literature. • PROVISIONAL • The RefSeq record has not yet been subject to individual review. • PREDICTED • Some aspect of the RefSeq record is predicted and there is supporting evidence that the locus is valid. • GENOME ANNOTATION • This identifies the contig (NT_ accessions), mRNA (XM_), non-coding transcript (XR_), and protein (XP_) RefSeq records provided by the NCBI Genome Annotation process. These records are provided via automated processing.

Map view of RefSeq NT_ XM_ NM_

ENSEMBL • Goals of Ensembl • Accurate, automatic analysis of genome data • Analysis maintained on the current data • Presentation of the analysis to biologists via the Web • Distribution of the analysis to other bioinformatics laboratories. • The Ensembl project will be a foundation for a next generation sequence database that provides a curated, distributed, non redundant view of the genomes of model organisms. • Commitments of the Ensembl project • Public release of data • All the data and analysis will be put into the public domain immediately. • Open, collaborative software development • The software which forms the automated pipeline will be available to everyone under an open license, modelled after the Apache license. • Collaboration on agreed standards for distribution • We hope to provide the data in as many useful forms as is practical, including the EMBL flat file formats and new data distribution channels such as XML and CORBA.

ENSEMBL

ENSEMBL views

High quality Automated Microbial Annotation of Proteomes • Aim: automatically annotate with the highest level of quality a significant percentage of proteins originating from microbial genome sequencing projects. • The programs being developed are specifically designed to track down "eccentric" proteins. Among the peculiarities recognized by the programs are: size discrepancy, absence or mutation of regions involved in activity or binding (to metals, nucleotides, etc), presence of paralogs, contradiction with the biological context (i.e. if a protein belongs to a pathway supposed to be absent in a particular organism), etc. Such "problematic" proteins will not be automatically annotated. • This project should allow annotators in the SWISS-PROT groups at SIB and EBI to concentrate on the proteins that really need careful manual annotation.

HAMAP origin • About 60 microbial genomes are available today • >1000 in a few years; >1 million microbial proteins! • Functional analysis and detailed biochemical characterization will only be available: • For « all » proteins in a handful of model organisms (i.e. E.coli, B.subtilis, etc.) • For proteins involved in pathogenicity (medical and pharmaceutical interests) • For proteins involved in specific biosynthetic or catabolic pathways (biotechnological and food industry interests)

HAMAP overview

HAMAP flow chart

HAMAP study case • The case of the Escherichia coli proteome • According to the original analysis in 1997: 4286 protein coding genes • 60 were missed (almost all <100 residues) • 120 are most probably « bogus » • 50 pairs or triplets of ORFs had to be fused • 719 have proven or probable wrong start sites • ~1800 are still not biochemically characterized; only one new « functionalisation » per week…

Unix reminder • General: man, pwd, cd, ls, mkdir, rmdir, passwd, exit • Files manipulation: cat, more, cp, mv, rm, grep, find, diff, head, tail, chmod • Editing: vi, pico, emacs • Compression: tar, (un)compress, gzip • Various: redirection (<>>) and piping (|)

High throughput biology projects