The UniProt knowledgebase www.uniprot.org a hub of integrated protein data
The UniProt knowledgebase a hub of integrated protein data

The UniProt knowledgebase a hub of integrated protein data
The UniProt knowledgebase a hub of integrated protein data

  1. The UniProt knowledgebasewww.uniprot.orga hub of integrated protein data Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics

  2. Science cover, february 2011

  3. data knowledge proteinsequencefunctional information

  4. UniProt consortium EBI : EuropeanBioinformatics Institute (UK) SIB : Swiss Institute of Bioinformatics (CH) PIR : Protein information resource (US)


  6. UniProt databases

  7. UniProtKB: proteinsequenceknowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL(query, Blast, download) (~15 mo entries) UniParc: proteinsequence archive (ENA equivalentat the proteinlevel).Each entry contains a proteinsequencewith cross-links to otherdatabaseswhereyoufind the sequence (active or not). Not annotated(query, Blast, download) (~25 mo entries) UniRef: 3 clusters of proteinsequenceswith 100, 90 and 50 % identity; useful to speed up sequencesimilaritysearch (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo entries; UniRef50 3.3 mo entries) UniMES: proteinsequencesderivedfrommetagenomicprojects (mostlyGlobal OceanSampling (GOS)) (download) (8 mo entries, included in UniParc)

  8. UniProt databases The central piece

  9. UniProtKB an encyclopedia on proteins composed of 2 sectionsUniProtKB/TrEMBL and UniProtKB/Swiss-Prot unreviewed and reviewed automatically annotated and manually annotated released every 4 weeks

  10. UniProtKB • Origin of proteinsequences • UniProtKBproteinsequences are mainlyderivedfrom • INSDC (translatedsubmittedcodingsequences- CDS) • Ensembl (geneprediction ) and RefSeqsequences • Sequences of PDB structures • Direct submission or sequencesscannedfromliterature • Notes:- UniProtis not doinganygeneprediction • - Most non-germlineimmunoglobulins, T-cell receptors , most patent sequences, highly over-representeddata (e.g. viral antigens), pseudogenessequences are excludedfromUniProtKB, - but stored in UniParc • - Data from the PIR database have been integrated in UniProtKBsince 2003. 85 % 15 %

  11. Manual annotation of the sequence and associatedbiological information Swiss-Prot EMBL TrEMBL Automated extraction of proteinsequence (translated CDS), genename and references. Automated annotation

  12. UniProtKB/TrEMBL unreviewed Automatic annotation released every 4 weeks

  13. Protein and genenames Taxonomic information Automated annotation Function, Subcellular location, Catalyticactivity, Sequencesimilarities… Automated annotation transmembranedomains, signal peptide… References One proteinsequence One species Automated annotation Keywords and Gene Ontology Cross-references to over 125 databases UniProtKB/TrEMBL

  14. UniProtKB/TrEMBLAutomatic annotation • Proteinsequence • -The quality of the protein sequences is dependent on the information provided by the submitter of the original nucleotide entry (CDS) or of the gene prediction pipeline (i.e. Ensembl). • - 100% identical sequences (same length, same organism are merged automatically). • Biologicalinformation • Sources of annotation • Provided by the submitter (EMBL, PDB, TAIR…) • From automated annotation (automated generated annotation rules (i.e. SAAS) and/or manually generated annotation rules (i.e. UniRule))

  15. UniProtKB/TrEMBL Example of fullyautomatic annotation: SAAS • Rules are derived from the UniProtKB/Swiss-Prot manual annotation. • Fully automated rule generation based on C4.5 decision tree algorithm. • One annotation, one rule. • High stringency – require 99% or greater estimated precision to generate annotation(test on UniProtKB/Swiss-Prot) • Rules are produced, updated and validated at each release.

  16. UniProtKB/Swiss-Prot reviewed manually annotated released every 4 weeks

  17. Manual annotation Function, Subcellular location, Catalyticactivity, Disease, Tissue specificty, Pathway… Protein and genenames Taxonomic information MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAARAFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVKNMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFLNKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRGTVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGRAGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG Manual annotation Post-translational modifications, variants, transmembranedomains, signal peptide… References One proteinsequence One gene One species Alternative products: proteinsequencesproduced by alternative splicing, alternative promoter usage, alternative initiation… Manual annotation Keywords and Gene Ontology Cross-references to over 125 databases UniProtKB/Swiss-Prot

  18. UniProtKB/Swiss-ProtManual annotation 1. Proteinsequence(mergeavailable CDS, annotatesequencediscrepancies, report sequencingmistakes…)2. Biological information(sequenceanalysis,extractliterature information, ortholog data propagation, …)

  19. UniProtKB/Swiss-Prot 1- Proteinsequence curation

  20. The displayedproteinsequence: …canonical, representative, consensus…+alternative sequences (describedwithin the entry) UniProtKB/Swiss-Prot a gene-centric view of the protein space 1 entry <-> 1 gene (1 species)

  21. What is the current status? • At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence. • Typical problems • unsolved conflicts • uncorrected initiation sites • frameshifts • wrong gene prediction • other ‘problems’

  22. UCSC genome browser examples of CDS annotation submitted to INSDC…

  23. UniProtKB/Swiss-Prot 2- Biological data curation

  24. Extractliterature informationand proteinsequenceanalysis maximum usage of controlledvocabulary UniProtKB/Swiss-Prot gathers data form multiple sources: - publications (literature/Pubmed) - prediction programs (Prosite, TMHMM, …) - contacts with experts - other databases - nomenclature committees An evidence attribution system allows to easily trace the source of each annotation

  25. Protein and genenames

  26. General annotation (Comments) …enable researchers to obtain a summary of what is known about a protein…

  27. Humanproteinmanual annotation: somestatistics (June 2011)

  28. Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein…

  29. Non-experimentalqualifiers UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction between both

  30. Find all the proteinslocalized in the cytoplasm (experimentallyproven) which are phosphorylated on a serine (experimentallyproven)

  31. ‘Protein existence’ tag • The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein; • Different qualifiers: 1. Evidence at protein level (~18%) (MS, western blot (tissue specificity), immuno (subcellular location),…) 2. Evidence at transcript level (~19%) 3. Inferred from homology (~58 %) 4. Predicted (~5%) 5. Uncertain (mainly in TrEMBL)

  32. UniProtKB Additional information canbefoundin the cross-references (to more than 140 databases)

  33. Family and domain Gene3D HAMAP InterPro PANTHER Pfam PIRSF PRINTS ProDom PROSITE SMART SUPFAM TIGRFAMs Organism-specific AGD ArachnoServer CGD ConoServer CTD CYGD dictyBase EchoBASE EcoGene euHCVdb EuPathDB FlyBase GeneCards GeneDB_Spombe GeneFarm GenoList Gramene H-InvDB HGNC HPA LegioList Leproma MaizeGDB MGI MIM neXtProt Orphanet PharmGKB PseudoCAP RGD SGD TAIR TubercuList WormBase Xenbase ZFIN Sequence EMBL IPI PIR RefSeq UniGene Proteomic PeptideAtlas PRIDE ProMEX Polymorphism dbSNP Genome annotation Ensembl EnsemblBacteria EnsemblFungi EnsemblMetazoa EnsemblPlants EnsemblProtists GeneID GenomeReviews KEGG NMPDR TIGR UCSC VectorBase Gene expression ArrayExpress Bgee CleanEx Genevestigator GermOnline Protein family/group Allergome CAZy MEROPS PeroxiBase PptaseDB REBASE TCDB Ontologies GO UniProtKB/Swiss-Prot: 129 explicit links 2D gel 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE (no server) OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE UCD-2DPAGE World-2DPAGE and 14 implicit links! Phylogenomic dbs eggNOG GeneTree HOGENOM HOVERGEN InParanoid OMA OrthoDB PhylomeDB ProtClustDB 3D structure DisProt HSSP PDB PDBsum ProteinModelPortal SMR Other BindingDB DrugBank NextBio PMAP-CutDB PPI DIP IntAct MINT STRING Enzyme and pathway BioCyc BRENDA Pathway_Interaction_DB Reactome PTM GlycoSuiteDB PhosphoSite PhosSite

  34. The UniProt web site Powerful search engine, google-like and easy-to-use, but also supports very directed field searches Scoring mechanism presenting relevant matches first Entry views, search result views and downloads are customizable The URL of a result page reflects the query; all pages and queries are bookmarkable, supporting programmatic access Search, Blast, Align, Retrieve, ID mapping

  35. Search A verypowerfultextsearchtoolwithautocompletion and refinement options allowing to look for UniProt entries and documentation by biological information

  36. Find all humanproteins located in the nucleus

  37. The search interface guides users with helpful suggestions and hints

  38. Advanced Search A verypowerfulsearchtool To beusedwhenyou know in which entry section the information isstored

  39. Find all the proteinlocalized in the cytoplasm (experimentallyproven) which are phosphorylated on a serine (experimentallyproven)

  40. Result pages: highly customizable

  41. Result pages: downloadable

  42. The URL can be bookmarked and manually modified.

  43. Blast A toolassociatedwith the standard options to searchsequences in differentUniProtdatabases and data sets

  44. Blast: customize the result display

  45. Blast: local alignmentsequenceannotation highlighting option