Bioinformatics
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Genome analysis PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on
  • Presentation posted in: General

Bioinformatics. Genome analysis. Contents. Genome annotation Comparative genomics Phylogenetic profiles Gene fusion analysis Phylogenetic footprinting. Bioinformatics. From sequences to genomes. From sequences to genomes.

Download Presentation

Genome analysis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Bioinformatics

Bioinformatics

Genome analysis

[email protected]

Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

http://www.bigre.ulb.ac.be/


Contents

Contents

  • Genome annotation

  • Comparative genomics

    • Phylogenetic profiles

    • Gene fusion analysis

    • Phylogenetic footprinting


Bioinformatics1

Bioinformatics

From sequences to genomes

[email protected]

Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

http://www.bigre.ulb.ac.be/


From sequences to genomes

From sequences to genomes

  • Before the 1990’s, DNA sequencing represented an important investment in terms of human work. A PhD student could spend a significant fraction of his thesis to sequence a single gene.

  • Genome projects stimulated the development of automatic sequencing methods, and led to important technological improvement.

  • There are currently (2008) several hundreds of publicly available fully sequenced genomes.

    • The NCBI genome distribution (ftp://ftp.ncbi.nih.gov/genomes/) contains

      • >650 prokaryotes (Bacteria and Archaea)

      • Insects (Drosophila melanogaster, Apis mellifera)

      • Plants (Arabidopsis thaliana, rice, maize)

      • A worm (Caenorhabditis elegans)

      • Some fungi (Saccharomyces cerevisiae, Schizosaccharomyces pombe, … )

      • Some mammals (Homo sapiens, Mus musculus, Rattus norvegicus)

    • Other genome centres give acces to other genomes.

      • ENSEMBL (http://www.ensembl.org/) maintains many vertebrate genomes

      • UCSC (http://genome.ucsc.edu/) maintains genomes of metazoan + insects

      • Sanger Institute (http://www.sanger.ac.uk/genbiol/)

      • Integr8 ~800 of genomes in 2008.

  • Many other genomes were sequenced by commercial companies, and are not available to the public.


Gene organization

Gene organization

Source: Mount (2000)


Gene function

Gene function

>PHO4,SPBC428.03C : THIAMINE-REPRESSIBLE ACID PHOSPHATASE PRECURSOR: Q01682;Q9UU70;

Length = 463 Score = 161 bits (408), Expect = 1e-40

Identities = 138/473 (29%), Positives = 223/473 (46%), Gaps = 47/473 (9%)

Query: 9 ILAASLVNAGTIPLGKLSDIDKIGTQTEIFPFLGGSGPYYSFPGDYGISRDLPESCEMKQ 68

+LAAS+V+AG S + + LG Y+ P G + PESC +KQ

Sbjct: 10 LLAASIVHAGK------SQFEAFENEFYFKDHLGTISVYHE-PYFNGPTTSFPESCAIKQ 62

Query: 69 VQMVGRHGERYPT-------VSKAKSIMTTWYKLSNYTGQFSGALSFLNDDYEFFIRDTK 121

V ++ RHG R PT VS A+ I KL N G S+ + F T

Sbjct: 63 VHLLQRHGSRNPTGDDTATDVSSAQYIDIFQNKLLN--GSIPVNFSYPENPLYFVKHWTP 120

Query: 122 NLEMETTLANSVNVLNPYTGEMNAKRHARDFLAQYGYMVENQTSFAVFTSNSNRCHDTAQ 181

++ E S + G + R +Y Y + + + + T+ R D+A+

Sbjct: 121 VIKAENADQLSSS------GRIELFDLGRQVFERY-YELFDTDVYDINTAAQERVVDSAE 173

Query: 182 YFIDGL-GDKFN--ISLQTISEAESAGANTLSAHHSCPAWDDDVNDDILKK-----YDTK 233

+F G+ GD + + E +SAGAN+L+ ++SCP ++D+ D+ + +

Sbjct: 174 WFSYGMFGDDMQNKTNFIVLPEDDSAGANSLAMYYSCPVYEDNNIDENTTEAAHTSWRNV 233

Query: 234 YLSGIAKRLNKE-NKGLNLTSSDANTFFAWCAYEINARGYSDICNIFTKDELVRFSYGQD 292

+L IA RLNK + G NLT SD + + C YEI R SD C++FT E + F Y D

Sbjct: 234 FLKPIANRLNKYFDSGYNLTVSDVRSLYYICVYEIALRDNSDFCSLFTPSEFLNFEYDSD 293

Query: 293 LETYYQTGPGYDVVRSVGANLFNASVKLLKE--SEVQDQKVWLSFTHDTDILNYLTTIGI 350

L+ Y GP + ++G N L++ + D+KV+L+FTHD+ I+ +G

Sbjct: 294 LDYAYWGGPASEWASTLGGAYVNNLANNLRKGVNNASDRKVFLAFTHDSQIIPVEAALGF 353

Query: 351 IDDKNNLTAEH-VPFMENTF----HRSWYVPQGARVYTEKFQCS-NDTYVRYVINDAVVP 404

D +T EH +P +N F S +VP + TE F CS N YVR+++N V P

Sbjct: 354 FPD---ITPEHPLPTDKNIFTYSLKTSSFVPFAGNLITELFLCSDNKYYVRHLVNQQVYP 410

Query: 405 IETCSTGPGFS----CEINDFYDYAEKRVAGTDFLKVCNVSSVSNSTELTFFW 453

+ C GP + CE++ + + + + + ++ + N ++ST +T ++

Sbjct: 411 LTDCGYGPSGASDGLCELSAYLNSSVRVNSTSNGIANFNSQCQAHSTNVTVYY 463

  • After having localized genes on the sequence, we have to predict their function.

  • Some genes have already been characterized before the genome project, but these are generally a minority of those found in the genome.

  • For the majority of the genes, one tries to predict function on the basis of similarities between the sequence of the newly sequenced gene and some previously known genes (function assignation by sequence similarity).

  • Example: yeast genome (1996): there are still 2500 genes (39%) whose function is completely unknown. However

    • Yeast is among the best known model organisms (genetics, molecular biology).

    • The full genome is available since 1996.

  • When the first traft of the Human genome has been published, 60% of the predicted genes were of unknwown function.


Some milestones

Some milestones


Genes and genome size

Genes and genome size

  • In prokaryotes, the number of genes increases linearly with genome size

  • In eukaryotes, this is not the case: the genome size increases faster than the number of genes


Genes and genome size1

Genes and genome size

  • Beware: the axes are logarithmic.

  • This plot represents the same data as the previous one, but in logarithmic scale, in order to see Mammals as well.


Gene spacing

Gene spacing

  • Gene spacing increases considerably with the complexity off the organisms.

  • Note: the X axis si logarithmic, not the Y axis -> the increase seems grossly exponential.


Proportion of intergenic regions

Proportion of intergenic regions

  • Beware: the X axis is logarithmic.

  • The proportion of intergenic regions increases with the complexity of an organism.

  • In addition (not shown here), introns represent an increasing fraction of the genome.

  • For example, the exonic fraction represents <5% of the human genome.


Protein size versus genome size

Protein size versus genome size

  • Protein sequences are shorter in prokaryotes than in eukaryotes.

  • Among eukaryotes, the increase in genome size is not correlated to an increase in protein size

    • higher eukaryotes have a much larger genome than fungi, without increase in protein size


Bioinformatics2

Bioinformatics

Genome annotation

[email protected]

Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

http://www.bigre.ulb.ac.be/


Gene prediction

Gene prediction

  • Starting from a completely sequenced genome, predict the positions of genes

  • Elements of prediction

    • Open Reading Frames

      • Start and stop codons, separated by a a continuous set of non-stop codons.

    • Region content

      • Hexanucleotide composition

      • Codon adaptation index (CAI).

    • Signals

      • In prokaryotes: Shine-Delgarno boxes.

      • In eukaryotes: intron/exon boundary elements (splicing signals).

    • Similarity with known genes.


Gene prediction limitations

Gene prediction - limitations

  • Typical problems:

    • Gene prediction programs are trained for a specific organism, and can give very bad results with other organisms (e.g., the first rounds of annotations of A.thaliana were done with programs trained for mammals).

    • Any gene prediction program will unavoidably predict false genes, and miss some true genes.

    • The prediction of intron/exon boundaries is particularly difficult.

    • For prokaryotes, the predicted start codons are sometimes imprecise.

  • Example: genome of the yeast Saccharomyces cerevisiae

    • For the yeast genomes, the gene detection protocol used in 1996 was over-predictive.

    • The program essentially relied on ORF, and predicted 6400 gene.

    • Some researchers estimated that ~1,000 ORFs might be false predictions.

    • Since 1996, the reality of the predicted genes has been tested by combining several methods of functional genomics (expression studies, mutant phenotypes, comparative genomics between closely related species, …).

    • A few hundreds of the initially predicted genes have been removed from the annotations.


Non coding genes

Non-coding genes

  • There are many types of non-coding genes

    • tRNAtransfer RNA

    • rRNAribosomial RNA

    • snRNAsmall nuclear RNA (elements of spliceosome)

    • snoRNAmethylation guides

    • ...

  • Detection of non-coding RNA

    • generally transcribed by polymerase I and III and have different promoters


Annotation of gene function

Annotation of gene function

  • Once a genomic region has been predicted to contain a gene, the next step is to predict the function of this gene.

  • The translated product is compared with all known proteins, and a putative function can be assigned on the basis of high similarity matches.

  • Problems

    • Sequence similarity is not always sufficient to confer the same function

    • Where to put the threshold ?

    • Some proteins might have similar function with different sequences (convergent evolution).

    • Once a gene has been assigned some putative function, this will be used to assign the same function to other genes  expansion of errors.

  • We should thus be aware that gene annotations have to be taken with caution.


Genes with unknown function

Genes with unknown function

  • When genomes of model organisms were sequenced, about 40% of the predicted genes could not be associated to any known function

  • These genes are annotated as "hypothetical proteins".

  • Note

    • In the yeast genome, many of these hypothetical proteins have been removed from the annotations since 1996, because they were false predictions.


Bioinformatics3

Bioinformatics

Comparative genomics

[email protected]

Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

http://www.bigre.ulb.ac.be/


Phylogenetic footprinting

Phylogenetic footprinting

Genome 1

Genome 2

conserved exon

conserved non-coding region

  • One of the main reasons for sequencing the mouse genome was to detect conserved regions between mouse and human, which will reveal exons and regulatory regions.

    • The fact that an unknown gene is found in different genomes gives more confidence in the existence of this gene.

  • Another important goal was to detect conserved regions in non-coding regions.

    • On the basis of a few known cases, it has been shown that conserved non-coding regions contain a high concentration in regulatory elements.

    • The detection of conserved non-coding sequences gives thus indications about regions potentially involved in regulation.

    • Such conserved regions are called phylogenetic footprints.


Phylogenetic profiles

Phylogenetic profiles

  • For each gene of the query genome (e.g. E.coli), orthologs are searched in all the sequenced genomes

  • Each gene is characterized by a profile of presence/absence in all the sequenced genomes

  • Groups of genes having similar phylogenetic profiles are likely to be functionally related

Pellegrini et al. (1999). Proc Natl Acad Sci U S A96(8), 4285-8.


Gene fusion analysis

Gene fusion analysis

Query genome

A

B

E.coli 2 components

Reference genomes

A^B

B.subtilis 1 composite

H.pylori 1 composite

Query genome

A

B

C

D

E

E.coli 5 components

Reference genomes

C^D^A^B^E

Yeast 1 composite

  • It is quite frequent to observe that two genes of a given organism are fused into a single gene in another organism.

  • Fusions between more than 2 genes are occasionally observed.

  • Fused genes are likely to be functionally related.

References

Marcotte, et al. (1999). Science 285(5428), 751-3.

Marcotte, et al. (1999). Nature 402(6757), 83-6.

Enright, et al. (1999). Nature 402(6757), 86-90.


Bioinformatics4

Bioinformatics

Conclusion

[email protected]

Université Libre de Bruxelles, Belgique

Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

http://www.bigre.ulb.ac.be/


The genome challenge

The genome challenge

  • Despite the availability of several hundreds of genomes, we are far from understanding the organization and function of a single genome.

  • In particular, a lot of work remains to be done to decipher genomes of higher organisms.

  • Genome sequence by itself is far from sufficient for this.

  • Since 1997, several high-throughput methods have been invented to give complementary information about gene function (see courses on transcriptome, proteome and interactome).


Quelques jalons

Quelques jalons


  • Login