Sandra orchard
This presentation is the property of its rightful owner.
Sponsored Links
1 / 61

UniProtKB PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on
  • Presentation posted in: General

Sandra Orchard. UniProtKB. Importance of reference protein sequence databases. Completeness and minimal redundancy A non redundant protein sequence database, with maximal coverage including splice isoforms, disease variant and PTMs.

Download Presentation

UniProtKB

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Sandra orchard

Sandra Orchard

UniProtKB


Importance of reference protein sequence databases

Importance of reference protein sequence databases

  • Completeness and minimal redundancy

    A non redundant protein sequence database, with maximal coverage including splice isoforms, disease variant and PTMs.

    Low degree of redundancy for facilitating peptide assignments

  • Stabilityand consistency Stable identifiers and consistent nomenclature

    Databases are in constant change due to a substantial amount of work to improve their completeness and the quality of sequence annotation

  • High quality protein annotation

    Detailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source


Summary of protein sequence databases

Summary of protein sequence databases

Updated from Nesvizhskii, A. I., and Aebersold, R. (2005) Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics. 4,1419–1440l


Uniprotkb

UniProtKB

  • UniProt Knowledgebase:

    • 2 sections

    • UniProtKB/Swiss-Prot Non-redundant, high-quality manual annotation - reviewed

    • UniProtKB/TrEMBL Redundant, automatically annotated - unreviewed

www.uniprot.org

Master headline


Uniprotkb

Manual annotation of UniProtKB/Swiss-Prot

Splice variants

Sequence

Sequence features

UniProtKB

Ontologies

Annotations

References

Nomenclature


Uniprotkb

  • Sequence curation, stable identifiers, versioning and archiving

    • For example – erroneous gene model predictions, frameshifts

    • ….

..premature stop codons, read-throughs, erroneous initiator methionines…..

Master headline


Uniprotkb

Splice variants

Master headline


Uniprotkb

Identification of amino acid variants

..and of PTMs

… and also

Master headline


Uniprotkb

Domain annotation

Binding sites

Master headline


Uniprotkb

Protein nomenclature

Master headline


Uniprotkb

Master headline


Annotation 30 defined fields

Annotation - >30 defined fields

Controlled vocabularies used whenever possible…

Master headline


Uniprotkb

..and also imported from external resources

Binary interactions taken from the IntAct database

Interactors of human p53

Master headline


Uniprotkb

Controlled vocabulary usage increasing – for example

from the Gene Ontology

Annotation for human Rhodopsin

Master headline


Uniprotkb

Evidence at protein level

There is experimental evidence of the existence of a protein

(e.g. Edman sequencing, MS, X-ray/NMR structure, good quality protein-protein interaction , detection by antibodies)

Evidence at transcript level

The existence of a protein has not been proven but there is expression data (e.g. existence of cDNAs, RT-PCR or Northern blots) that indicates the existence of a transcript.

Inferred from homology

The existence of a protein is likely because orthologs exist in closely related species

4 Predicted

5Uncertain

Sequence evidence

Type of evidence that supports the existence of a protein


Manual annotation of the human proteome uniprotkb swiss prot

Manual annotation of the human proteome(UniProtKB/Swiss-Prot)

A draft of the complete human proteome has been available in UniProtKB/Swiss-Prot since 2008

Manually annotated representation of 20,242 protein coding genes with ~ 36,000 protein sequences - an additional 38,484 UniProtKB/TrEMBL form the complete proteome set

Approximately 63,000 single amino acid polymorphisms (SAPs), mostly disease-linked

80,000 post-translational modifications (PTMs)

Close collaboration with NCBI, Ensembl, Sanger Institute and UCSC to provide the authoritative set to the user community


Uniprotkb

  • Text-based searching

  • Logical operators ‘&’ (and), ‘|’

Searching UniProt – Simple Search

Master headline


Uniprotkb

Searching UniProt – Advanced Search

Master headline


Uniprotkb

Each linked to the UniProt entry

Searching UniProt – Search Results

Master headline


Uniprotkb

Searching UniProt – Search Results

Master headline


Uniprotkb

Searching UniProt – Search Results

Master headline


Uniprotkb

Searching UniProt – Blast Search

Master headline


Uniprotkb

Searching UniProt – Blast Search

Master headline


Uniprotkb

Alignment with query sequence

Searching UniProt – Blast Results

Master headline


Uniprotkb

Searching UniProt – Blast Results

Master headline


Uniprotkb

UniProtKB/TrEMBL

  • Multiple entries for the same protein (redundancy) can arise in UniProtKB/TrEMBL due to:

  • Erroneous gene model predictions

  • Sequence errors (Frame shifts)

  • Polymorphisms

  • Alternative start sites

  • Isoforms

  • Apart from 100% identical sequences all merged sequences are analysed by a curator so they can be annotated accordingly.


Why do we need predictive annotation tools

Why do we need predictive annotation tools?


Uniprotkb

Given a set of uncharacterised sequences, we usually want to know:

  • what are these proteins; to what family do they belong?

  • what is their function; how can we explain this in structural terms?


2 the protein signature approach

2. The protein signature approach

1. Pairwise alignment approaches (e.g. BLAST)

  • Good at recognising similarity between closely related sequences

  • Perform less well at detecting divergent homologues

  • Alternatively, we can model the conservation of amino acids at specific positions within a multiple sequence alignment, seeking ‘patterns’ across closely related proteins

  • We can then use these models to infer relationshipswith previously characterised sequences

  • This is the approach taken by protein signature databases


Uniprotkb

Multiple sequence alignment

What are protein signatures?

Protein family/domain

Build model

Search

UniProt

Protein analysis

Significant match

ITWKGPVCGLDGKTYRNECALL

Mature model

AVPRSPVCGSDDVTYANECELK


Uniprotkb

Diagnostic approaches (sequence-based)

Single motif methods

Regex patterns (PROSITE)

Full domain alignment methods

Profiles

(Profile Library)

HMMs

(Pfam)

Multiple motif methods

Identity matrices

(PRINTS)


Uniprotkb

Motif

Define pattern

xxxxxx

xxxxxx

xxxxxx

xxxxxx

Extract pattern sequences

Build regular expression

C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C

Pattern signature

PS00000

Patterns

Sequence alignment


Uniprotkb

Patterns

Advantages

  • Some aa can be forbidden at some specific positions which can help to distinguish closely related subfamilies

  • Short motifs handling - a pattern with very few variability and forbidden positions, can produce significant matches e.g. conotoxins: very short toxins with few conserved cysteinesC-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C

Drawbacks

  • High False Positive/False Negative rate

Patterns are mostly directed against functional residues:

active sites, PTM, disulfide bridges, binding sites


Uniprotkb

Motif 1

Motif 2

Motif 3

Define motifs

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

Extract motif sequences

Correct order

Fingerprint signature

1

2

3

Correct spacing

PR00000

Fingerprints

Sequence alignment

Weight matrices


The significance of motif context

1

2

3

4

5

The significance of motif context

  • Identify small conserved regions in proteins

  • Several motifs  characterise family

  • Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours

order

interval


Uniprotkb

Profiles & HMMs

Whole protein

Sequence alignment

Entire domain

Define coverage

xxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxx

Use entire alignment for domain or protein

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Models insertions and deletions

Build model

Profile or HMM signature


Uniprotkb

HMM databases

  • Sequence-based

  • PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship

  • PANTHER: families/subfamilies model the divergence of specific functions

  • TIGRFAM: microbial functional family classification

  • PFAM : families & domains based on conserved sequence

  • SMART: functional domain annotation

  • Structure-based

  • SUPERFAMILY : models correspond to SCOP domains

  • GENE3D: models correspond to CATH domains


Why we created interpro

Why we created InterPro

  • By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful diagnostic tool & integrated database

    • to simplify & rationalise protein analysis

    • to facilitate automatic functional annotation of uncharacterised proteins

    • to provide concise information about the signatures and the proteins they match, including consistent names, abstracts (with links to original publications), GO terms and cross-references to other databases


Uniprotkb

InterPro Entry

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers

  • Hierarchical classification


Interpro hierarchies families

Interpro hierarchies: Families

FAMILIES can have parent/child relationships with other Families

  • Parent/Child relationships are based on:

  • Comparison of protein hits

  • child should be a subset of parent

  • siblings should not have matches in common

  • Existing hierarchies in member databases

  • Biological knowledge of curators


Interpro hierarchies domains

Interpro hierarchies: Domains

DOMAINS can have parent/child relationships with other domains


Domains and families may be linked through domain organisation

Domains and Families may be linked through Domain Organisation

Hierarchy


Uniprotkb

InterPro Entry

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers


Uniprotkb

InterPro Entry

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers

The Gene Ontology project provides a controlled vocabulary of terms for describing gene product characteristics


Uniprotkb

InterPro Entry

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers

UniProt

KEGG ... Reactome ... IntAct ...

UniProt taxonomy

PANDIT ... MEROPS ... Pfam clans ...

Pubmed


Uniprotkb

InterPro Entry

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers

PDB 3-D Structures

SCOP Structural domains

CATH Structural domain classification


Searching interpro

Searching InterPro


Searching interpro1

Searching InterPro

Protein family membership

Domain organisation

Domains, repeats

& sites

GO terms


Searching interpro2

Searching InterPro


Searching interpro3

Searching InterPro


Uniprotkb

InterProScan access

Interactive:

http://www.ebi.ac.uk/Tools/pfa/iprscan/

Webservice (SOAP and REST):

http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_rest

http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_soap

Downloadable:

ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/


Searching interpro4

Searching InterPro


Uniprotkb

Automatic Annotation

  • Automated clean-up of annotation from original nucleotide sequence entry

  • Additional value added by using automatic annotation

  • Recognises common annotation belonging to a closely related family within UniProtKB/Swiss-Prot

  • Identifies all members of this family using pattern/motif/HMMs in InterPro

  • Transfers common annotation to related family members in TrEMBL

Master headline


Uniprotkb

← Taxonomy

← Publication

← Name (non-standard)

← Sequence


Interpro

InterPro

Master headline


Uniprotkb

Master headline


Finding a complete proteome in uniprotkb

Finding a complete proteome in UniProtKB


Complete proteomes

Complete Proteomes


Ms proteomics

MS Proteomics

  • Require each sequence (inc isoforms) to be present in the dataset as an separate entity for search engines to access

  • For higher organisms, with isoforms, expanded set made available on ftp site

  • Fastafiles by FTP

    • One file per species containing canonical + isoform sequences


Uniprotkb

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

Master headline


  • Login