1 / 47

EnsEMBL

EnsEMBL. Opening up the whole Genome Philip Lijnzaad lijnzaad@ebi.ac.uk. Overview. what how (science, hardware software) results families and descriptions tour people. What is EnsEMBL. Automatic Annotation of complete Human Genome genes other: markers, SNPs, homologies, etc.

Download Presentation

EnsEMBL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EnsEMBL Opening up the whole GenomePhilip Lijnzaad lijnzaad@ebi.ac.uk

  2. Overview • what • how (science, hardware software) • results • families and descriptions • tour • people

  3. What is EnsEMBL • Automatic Annotation of complete Human Genome • genes • other: markers, SNPs, homologies, etc. • completely open • data, software, discussions • portable, downloadable • ‘the Linux of the Human Genome’

  4. From ... TCTTCTCCTTCAAGGCATCCAGGTTACCCCGGACAATAAGAGGGGAACAAGCTCTTTGTT TTGCCAAGCGGTGGAAGCTTCAGGAAAGGTGCCCGGCCCCTTAGGAGGAAAACCGGGGAA CAAGACCCGCAGTTTTTGCCTTCCCAACTTCCAGTGGGCCCAAAAAAACTTGGGGCGCCC AGGGTCCCCAAAAGAGAGAGCCACGCTGGGGCCGGGTTCCTGCTTTTAATATCCAGGAAA AGGGGGGGAGGGGTATTCCCCCTTCCTCATTAAGATAAAAGACTCCCCCTCGTACTTATG GGTCCTTTACGGTTGGGCATGGGGCGAAAAAAGGGAGCGCCCCGGTGGACTTAATCGTAT TTTAACACACCCCCCGGGATATTTAAAGTCGGGGTAGGGCTGTTTGAAAATATTCAATGT GGGGGGCTTTTTGACACGCCCGTTTATATTGTTCTGGGACGCGCGTGAGGGGGGTAGACA AGAGGTGTGTAAGCCGTGCTTTATTATCCTCGCGTAGACACGCGTTAGCATGTAGTGGTG TTACCTGGTCGCGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCTT CTCTACTAAAAACCCAAAAATTTGCCAGACACGTGGAGAGCGAGACTTCATCTCAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCTAAGAGTTGTTATTTCTGAGGTAGAAT AACTAATGATCTTATCTTCTCTTTTTTCTTTTCTTCAAGATGGGGTCTTGCTTTGTCACC CAGGCCAGAGGGCAGTGGCACAATCATAGTTCACTGCAGCTTCAAACACCTGAACTCAAG CAATCTTCCCCGCTCATACTGCTCCCCAGCACCAGGAGCTGGGACTACAGGCACACGTCA CCACATCCGGCTAATTTTTTTTTTCTTTTGGTGGGTAGAGACGGGGGCCTCACTATGTTG

  5. … to: MHSSGSSGKGAGPLRGKTSGTEPADFALPSSRGGPGKLRCYQTNLSSFS SPRKGVSQTGTPVCEEDGDAGLGIRQGGKAPVTPRGRGRRGRPPSRTTG TRETAVPGPLGIEDISPNLSPDDKSFSRVVPRVPDSTRRTDVGAGALRR SDSPEIPFQAAAGPSDGLDASSPGNSFVGLRVVAKWSSNGYFYSGKITR DVGAGKYKLLFDDGYECDVLGKDILLCDPIPLDTEVTALSEDEYFSAGV VKGHRKESGELYYSIEKEGQRKWYKRMAVILSLEQGNRLREQYGLGPYE AVTPLTKAADISLDNLVEGKRKRRSNVSSPATPTASSSSSTTPTRKITE SPRASMGVLSGKRKLITSEEERSPAKRGRKSATVKPGAVGAGEFVSPCE SGDNTGEPSALEEQRGPLPLNKTLFLGYAFLLTMATTSDKLASRSKLPD GPTGSSEEEEEFLEIPPFNKQYTESQLRAGAGYILEDFNEAQCNTAYQC

  6. Take: • Draft human genome • clones and contigs from public databases • not finished • errors • gaps • Golden Path • assembly of all contigs into (nearly) complete chromosomes

  7. Then: • Get rid of repeats • Targetted searches • pmatch to ‘find back’ known proteins from SWISSPROT, SP-TrEMBL and RefSeq • GeneWise and EST2Genome to build the genes • fill in coding sequences and UTR’s

  8. And then: • Similarity searches • GenScan on raw contigs • its peptides are searched against protein, mRNA and EST databases • genes are built using GeneWise on promising regions • additonally, exons can be used • All predictions supported by evidence!

  9. Add cross references: • HUGO (HGNC) • SwissProt/Trembl, RefSeq • EMBL • OMIM • LocusLink • InterPro

  10. Add yet more • GeneTribe families • Gene descriptions • Markers • SNPs • external annotations (EMBL) • mouse traces • ...

  11. Hardware • 360 Alpha’s: DS10, dual EV6 processors, 1GByte memory • 200 other nodes • 10 days to do a complete blast + gene build • ~ 30 million jobs • ~ 30 GB

  12. Software • Digital Unix • Apache • relational database (MySQL) • mostly perl, some C and Java • BioPerl, BioJava, BioCORBA • LSF • AltaVista

  13. Software (2) • Wiki Web • CVS (~100 Mb) • Code review, data review • Testing conventions • Interfaces • VirtualContigs • CORBA/Java

  14. ID’s • for genes, transcripts, exons, peptides, families • ENSXnnnnnnnnnnn (eg: ENSG00000067369) • X denotes which type: • G = gene • T = transcript • E = exon • P = peptide (translation) • F = family

  15. ID’s (2) • ID’s should be stable • difficult, because underlying data keeps being refined! • ID mapping • version numbers

  16. Results • Latest release: ,1.1 (17. July) • Web code version: 1.1.1 (1 Aug.) • April 2001 dataset • 4,318,661,441 basepairs • 143,479 exons • 23,931 transcripts • 21,921 genes (‘confirmed’)

  17. Errors • Missing data • Misassembly • Misidentification (pseudo-gene, paralog) • Sequencing errors • in Human Genome Data • in supporting databases • Bugs • GenScan tuning • GeneWise tuning

  18. Gene Families • Cluster EnsEMBL peptides together with SwissProt and SPTrembl • vertebrate • GeneTRIBE - Automatic Protein Family detection using Markov Clustering. Enright, van Dongen & Ouzounis (in preparation)

  19. Family descriptions • distill consensus descriptions • using SwissProt DE-lines • may not work => unknown • Transfer peptide’s family assignment to gene • resolve conflicts: choose family that has best description • unknown < hypothetical < fragment < cDNA

  20. Family statistics: • 13,811 families • 7284 ‘unknown’ description • 128,828 members • 21,894 ENS genes • 23,867 ENS peptides

  21. Family statistics (2) • 6759 1 member • 3457 2-10 members • 215 10-100 • 4 > 100 • max is 483 (zinc finger)

  22. Gene descriptions • Use SwissProt DE-line if known • use Family if not • Statistics: • 18053 descriptions • 13202 from SwissProt • 4851 from family description • 3868 still UNKNOWNs

  23. Entry points • http://www.ensembl.org • ID search • text search • OMIM disease • Browse chromosomes • BLAST

  24. TextSearch

  25. DiseaseView

  26. BLAST/SSAHA

  27. MapView

  28. MarkerView

  29. ContigView

  30. ContigView (2)

  31. ContigView(3)

  32. DAS annotations

  33. Apollo

  34. ExportView

  35. GeneView

  36. GeneView (2)

  37. ProteinView

  38. ProteinView (2)

  39. ExpressionView

  40. DomainView

  41. FamilyView

  42. Recent developments • HelpDesk • DAS • Adding annotations from anywhere • Apollo • Genome viewer • Expression data • SAGE

  43. Future • Better genes! • Alignments • Other genomes • Comparative Genomics • CORBA/Java • More protein-structural links • Scop profiles • IGI/IPI • Entity infra-structure

  44. Links • http://www.ensembl.org • dev.ensembl.org • http://www.ensembl.org/genome/central • http://genome.ucsc.edu • http://compbio.ornl.gov/channel • http://ncbi.nlm.gov/genome/guide/human • http://www.biodas.org • http://www.bio{perl,xml,corba,python,java}.org

  45. Acknowledgements • Ewan Birney, Michele Clamp,Tim Hubbard,Tony Cox,Elia Stupka,Arek Kasprzyk, Arne Stabenau, James Stalker, James Cuff, James Smith, Simon Potter, Manu Mongin, Val Curwen, Guy Slater, Richard Durbin, Craig Melsopp, Alistair Rust, Chriss Mungall, Jim Kent and many, many more

  46. Join! • http://www.ensembl.org • mailing lists • ensembl-dev@ebi.ac.uk • ensembl-announce@ebi.ac.uk • (see http://www.ensembl.org/Dev/Lists )

More Related