720 likes | 847 Views
Understanding proteins: resources for identification and annotation. The Gene Ontology: Annotating protein function, role and localization. Contact: Jane Lomax Coordinator, GO Editorial Office EBI-EMBL jane@ebi.ac.uk. What is an ontology?. What is an ontology?. Collectibles & art Stamps
E N D
Understanding proteins: resources for identification and annotation
The Gene Ontology: Annotating protein function, role and localization Contact: Jane Lomax Coordinator, GO Editorial Office EBI-EMBL jane@ebi.ac.uk
What is an ontology? • Collectibles & art • Stamps • UK (Great Britain)Victoria • 1884 GREAT BRITAIN 10S SCOTT (11,999.99$) A definition... “A controlled representation of ideas, concepts or events in a given domain and the relationships between them.”
Why do we need ontologies? Help with data retrieval allow grouping of annotations brain 20 hindbrain 15 rhombomere 10 Query ‘brain’ without ontology 20 Query ‘brain’ with ontology 45 Make data (re-)usable through standards • Common structure and terminology (controlled vocabulary) • Avoid redundancies (single data source) • Allow common tools, techniques, training, validation... Adapted from Barry Smith: http://ontology.buffalo.edu/smith/BioOntology_Course.html
Gene ontology • http://geneontology.org/ What is the gene ontology? Organized, controlled vocabulary of terms that describe gene products characteristics. • Represents gene product properties, not gene products themselves • Three branches (domains): • Cellular component • Molecular function • Biological process • Species-independent (with taxonomic restrictions) • Represents physiologicalprocesses • Goes up to the level of the cell
How does GO work? The Gene Ontology is like a dictionary term: transcription initiation id: GO:0006352 definition: Processes involved in the assembly of the RNA polymerase complex at the promoter region of a DNA template resulting in the subsequent synthesis of RNA from that promoter.
is_a part_of GO tree and annotations Clark et al., 2005
An annotation example… • GO terms for Caspase 9
Which processes are up- or down-regulated? time Defense response Immune response Response to stimulus Toll regulated genes JAK-STAT regulated genes Puparial adhesion Molting cycle hemocyanin Amino acid catabolism Lipid metobolism Peptidase activity Protein catabloism Immune response Immune response Toll regulated genes control attacked Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.
QuickGO: browsing GO Term definition • http://www.ebi.ac.uk/QuickGO/
QuickGO: browsing GO Term relationships (ancestors)
QuickGO: browsing GO Term relationships (children)
QuickGO: browsing GO Proteins annotated to term
Annotation and ontology files www.geneontology.org/GO.downloads.shtml • Ontology files: • Hold ontology terms and structure • Species-independent • You can get GO-slims • Annotation files: • Hold list of terms and the proteins annotated with them • You can get species-specific files or the whole annotation.
More about GO: EBI train online www.ebi.ac.uk/training/online/course/go-quick-tour www.ebi.ac.uk/training/online/course/uniprot-goa-quick-tour
Acknowledgements & questions Jane Lomax Coordinator, GO Editorial OfficeEBI-EMBL jane@ebi.ac.uk
UniProt: A repository of annotated protein sequences Contact: Duncan Legge UniProt Content TeamEBI-EMBL help@uniprot.org dlegge@ebi.ac.uk
Background of UniProt Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL PIR-PSD Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database
We Aim To Provide… • A high quality protein sequence database • A non redundant protein database, with maximal coverage including splice isoforms, disease variant and PTMs. Sequence archiving essential. • Easy protein identification • Stable identifiers and consistent nomenclature / controlled vocabularies • Thorough protein annotation • Detailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source
The Two Sides of UniProtKB UniProtKB/TrEMBL UniProtKB/Swiss-Prot 1 entry per nucleotide submission 1 entry per protein Redundant, automatically annotated - unreviewed Non-redundant, high-quality manual annotation - reviewed
UniProtKB/TrEMBL Computationally annotated UniProtKB/Swiss-Prot Manually annotated
Data sources of UniProtKB UniProt/TrEMBL Ensembl ENA (EMBL) DNA database PDB Sub/ Peptide Data FlyBase WormBase VEGA (Sanger) Patent Data mRNA Data
Curation of a UniProt/SwissProtentry Nomenclature UniProt/TrEMBL Sequence Literature Annotations Sequence variants Ontologies References UniProt/SwissProt Sequence features
UniProt Website www.uniprot.org
Annotation comments FUNCTION SUBCELLULAR LOCATION ALTERNATIVE PRODUCTS TISSUE SPECIFICITY DEVELOPMENTAL STAGE INDUCTION SIMILARITY CATALYTIC ACTIVITY COFACTOR ENZYME REGULATION BIOPHYSICOCHEMICAL- PROPERTIES PATHWAY SUBUNIT INTERACTION PTM RNA EDITING MASS SPECTROMETRY DOMAIN POLYMORPHISM DISRUPTION PHENOTYPE ALLERGEN DISEASE TOXIC DOSE BIOTECHNOLOGY PHARMACEUTICAL MISCELLANEOUS CAUTION SEQUENCE CAUTION WEB RESOURCE
Evidence tags to show source Controlled vocabularies used whenever possible
Proteomes in UniProt Complete proteomes Reference proteomes Complete sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced. Some complete proteomes have been selected as reference proteome sets. These cover the proteomes of well-studied model organisms and other proteomes of interest for biomedical research.
Help / Feedback • Stuck? Just ask – active help and support team • Feedback – if you find something incorrect, outdated, missing etc please tell us. • help@uniprot.org
Find out more: EBI online courses www.ebi.ac.uk/training/online/course/uniprot-quick-tour/
Acknowledgements & questions Duncan Legge UniProt Content TeamEBI-EMBL dlegge@ebi.ac.uk
InterPro: An integrated protein sequence analysis resource Contact: AmaiaSangrador InterPro curation TeamEBI-EMBL interhelp@ebi.ac.uk amaia@ebi.ac.uk
What is InterPro? • InterPro is a sequence analysis resource that classifies sequences into protein families and predicts important domains and sites • It combines predictive models (known as signatures) from different databases to provide functional analysis of protein sequences by classifying them into families and predicting domains and important sites
The aim of InterPro InterPro
Protein annotation: a predictive approach • Model the pattern of conserved amino acids at specific positions within a multiple sequence alignment • We can use these models to infer relationships with the characterised sequences from which the alignment was constructed • This is the approach taken by protein signaturedatabases
Three (4) different protein signature approaches Single motif methods Patterns Full alignment methods Profiles & Hidden Markov models (HMMs) Multiple motif methods Fingerprints
InterPro Consortium HAMAP Profiles Protein features (sites) Functional annotation of families/domains Structural domains Patterns Finger prints Hidden Markov Models
Signatures are provided by member databases They are scanned against the UniProt database to see which sequences they match Curators manually inspect the matches before integrating the signatures into InterPro InterPro signature integration process • Signatures representing the same entity are integrated together • Relationships between entries are traced, where possible • Curators add literature referenced abstracts, cross-refs to other databases, and GO terms
Using InterPro Let’s find some information about T-cell surface antigen CD4 in InterPro Search using the key word: CD4
Family-centered view Type Name Identifier Contributing signatures Description References Go terms
Using InterPro Search using human CD4 protein sequence
Protein-centered view Identifier Type Name Domains Family
Domain-centered view Type Name Identifier Contributing signatures Description References
Using InterPro with unknown sequences: InterProScan Search with unknown protein sequence InterProScan is the software package that allows sequences to be scanned against InterPro's signatures