Sandra orchard
Sponsored Links
This presentation is the property of its rightful owner.
1 / 61

UniProtKB PowerPoint PPT Presentation


  • 136 Views
  • Uploaded on
  • Presentation posted in: General

Sandra Orchard. UniProtKB. Importance of reference protein sequence databases. Completeness and minimal redundancy A non redundant protein sequence database, with maximal coverage including splice isoforms, disease variant and PTMs.

Download Presentation

UniProtKB

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sandra Orchard

UniProtKB


Importance of reference protein sequence databases

  • Completeness and minimal redundancy

    A non redundant protein sequence database, with maximal coverage including splice isoforms, disease variant and PTMs.

    Low degree of redundancy for facilitating peptide assignments

  • Stabilityand consistency Stable identifiers and consistent nomenclature

    Databases are in constant change due to a substantial amount of work to improve their completeness and the quality of sequence annotation

  • High quality protein annotation

    Detailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source


Summary of protein sequence databases

Updated from Nesvizhskii, A. I., and Aebersold, R. (2005) Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics. 4,1419–1440l


UniProtKB

  • UniProt Knowledgebase:

    • 2 sections

    • UniProtKB/Swiss-Prot Non-redundant, high-quality manual annotation - reviewed

    • UniProtKB/TrEMBL Redundant, automatically annotated - unreviewed

www.uniprot.org

Master headline


Manual annotation of UniProtKB/Swiss-Prot

Splice variants

Sequence

Sequence features

UniProtKB

Ontologies

Annotations

References

Nomenclature


  • Sequence curation, stable identifiers, versioning and archiving

    • For example – erroneous gene model predictions, frameshifts

    • ….

..premature stop codons, read-throughs, erroneous initiator methionines…..

Master headline


Splice variants

Master headline


Identification of amino acid variants

..and of PTMs

… and also

Master headline


Domain annotation

Binding sites

Master headline


Protein nomenclature

Master headline


Master headline


Annotation - >30 defined fields

Controlled vocabularies used whenever possible…

Master headline


..and also imported from external resources

Binary interactions taken from the IntAct database

Interactors of human p53

Master headline


Controlled vocabulary usage increasing – for example

from the Gene Ontology

Annotation for human Rhodopsin

Master headline


Evidence at protein level

There is experimental evidence of the existence of a protein

(e.g. Edman sequencing, MS, X-ray/NMR structure, good quality protein-protein interaction , detection by antibodies)

Evidence at transcript level

The existence of a protein has not been proven but there is expression data (e.g. existence of cDNAs, RT-PCR or Northern blots) that indicates the existence of a transcript.

Inferred from homology

The existence of a protein is likely because orthologs exist in closely related species

4 Predicted

5Uncertain

Sequence evidence

Type of evidence that supports the existence of a protein


Manual annotation of the human proteome(UniProtKB/Swiss-Prot)

A draft of the complete human proteome has been available in UniProtKB/Swiss-Prot since 2008

Manually annotated representation of 20,242 protein coding genes with ~ 36,000 protein sequences - an additional 38,484 UniProtKB/TrEMBL form the complete proteome set

Approximately 63,000 single amino acid polymorphisms (SAPs), mostly disease-linked

80,000 post-translational modifications (PTMs)

Close collaboration with NCBI, Ensembl, Sanger Institute and UCSC to provide the authoritative set to the user community


  • Text-based searching

  • Logical operators ‘&’ (and), ‘|’

Searching UniProt – Simple Search

Master headline


Searching UniProt – Advanced Search

Master headline


Each linked to the UniProt entry

Searching UniProt – Search Results

Master headline


Searching UniProt – Search Results

Master headline


Searching UniProt – Search Results

Master headline


Searching UniProt – Blast Search

Master headline


Searching UniProt – Blast Search

Master headline


Alignment with query sequence

Searching UniProt – Blast Results

Master headline


Searching UniProt – Blast Results

Master headline


UniProtKB/TrEMBL

  • Multiple entries for the same protein (redundancy) can arise in UniProtKB/TrEMBL due to:

  • Erroneous gene model predictions

  • Sequence errors (Frame shifts)

  • Polymorphisms

  • Alternative start sites

  • Isoforms

  • Apart from 100% identical sequences all merged sequences are analysed by a curator so they can be annotated accordingly.


Why do we need predictive annotation tools?


Given a set of uncharacterised sequences, we usually want to know:

  • what are these proteins; to what family do they belong?

  • what is their function; how can we explain this in structural terms?


2. The protein signature approach

1. Pairwise alignment approaches (e.g. BLAST)

  • Good at recognising similarity between closely related sequences

  • Perform less well at detecting divergent homologues

  • Alternatively, we can model the conservation of amino acids at specific positions within a multiple sequence alignment, seeking ‘patterns’ across closely related proteins

  • We can then use these models to infer relationshipswith previously characterised sequences

  • This is the approach taken by protein signature databases


Multiple sequence alignment

What are protein signatures?

Protein family/domain

Build model

Search

UniProt

Protein analysis

Significant match

ITWKGPVCGLDGKTYRNECALL

Mature model

AVPRSPVCGSDDVTYANECELK


Diagnostic approaches (sequence-based)

Single motif methods

Regex patterns (PROSITE)

Full domain alignment methods

Profiles

(Profile Library)

HMMs

(Pfam)

Multiple motif methods

Identity matrices

(PRINTS)


Motif

Define pattern

xxxxxx

xxxxxx

xxxxxx

xxxxxx

Extract pattern sequences

Build regular expression

C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C

Pattern signature

PS00000

Patterns

Sequence alignment


Patterns

Advantages

  • Some aa can be forbidden at some specific positions which can help to distinguish closely related subfamilies

  • Short motifs handling - a pattern with very few variability and forbidden positions, can produce significant matches e.g. conotoxins: very short toxins with few conserved cysteinesC-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C

Drawbacks

  • High False Positive/False Negative rate

Patterns are mostly directed against functional residues:

active sites, PTM, disulfide bridges, binding sites


Motif 1

Motif 2

Motif 3

Define motifs

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

Extract motif sequences

Correct order

Fingerprint signature

1

2

3

Correct spacing

PR00000

Fingerprints

Sequence alignment

Weight matrices


1

2

3

4

5

The significance of motif context

  • Identify small conserved regions in proteins

  • Several motifs  characterise family

  • Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours

order

interval


Profiles & HMMs

Whole protein

Sequence alignment

Entire domain

Define coverage

xxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxx

Use entire alignment for domain or protein

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Models insertions and deletions

Build model

Profile or HMM signature


HMM databases

  • Sequence-based

  • PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship

  • PANTHER: families/subfamilies model the divergence of specific functions

  • TIGRFAM: microbial functional family classification

  • PFAM : families & domains based on conserved sequence

  • SMART: functional domain annotation

  • Structure-based

  • SUPERFAMILY : models correspond to SCOP domains

  • GENE3D: models correspond to CATH domains


Why we created InterPro

  • By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful diagnostic tool & integrated database

    • to simplify & rationalise protein analysis

    • to facilitate automatic functional annotation of uncharacterised proteins

    • to provide concise information about the signatures and the proteins they match, including consistent names, abstracts (with links to original publications), GO terms and cross-references to other databases


InterPro Entry

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers

  • Hierarchical classification


Interpro hierarchies: Families

FAMILIES can have parent/child relationships with other Families

  • Parent/Child relationships are based on:

  • Comparison of protein hits

  • child should be a subset of parent

  • siblings should not have matches in common

  • Existing hierarchies in member databases

  • Biological knowledge of curators


Interpro hierarchies: Domains

DOMAINS can have parent/child relationships with other domains


Domains and Families may be linked through Domain Organisation

Hierarchy


InterPro Entry

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers


InterPro Entry

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers

The Gene Ontology project provides a controlled vocabulary of terms for describing gene product characteristics


InterPro Entry

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers

UniProt

KEGG ... Reactome ... IntAct ...

UniProt taxonomy

PANDIT ... MEROPS ... Pfam clans ...

Pubmed


InterPro Entry

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers

PDB 3-D Structures

SCOP Structural domains

CATH Structural domain classification


Searching InterPro


Searching InterPro

Protein family membership

Domain organisation

Domains, repeats

& sites

GO terms


Searching InterPro


Searching InterPro


InterProScan access

Interactive:

http://www.ebi.ac.uk/Tools/pfa/iprscan/

Webservice (SOAP and REST):

http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_rest

http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_soap

Downloadable:

ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/


Searching InterPro


Automatic Annotation

  • Automated clean-up of annotation from original nucleotide sequence entry

  • Additional value added by using automatic annotation

  • Recognises common annotation belonging to a closely related family within UniProtKB/Swiss-Prot

  • Identifies all members of this family using pattern/motif/HMMs in InterPro

  • Transfers common annotation to related family members in TrEMBL

Master headline


← Taxonomy

← Publication

← Name (non-standard)

← Sequence


InterPro

Master headline


Master headline


Finding a complete proteome in UniProtKB


Complete Proteomes


MS Proteomics

  • Require each sequence (inc isoforms) to be present in the dataset as an separate entity for search engines to access

  • For higher organisms, with isoforms, expanded set made available on ftp site

  • Fastafiles by FTP

    • One file per species containing canonical + isoform sequences


?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

Master headline


  • Login