Sandra orchard
Download
1 / 61

UniProtKB - PowerPoint PPT Presentation


  • 153 Views
  • Uploaded on

Sandra Orchard. UniProtKB. Importance of reference protein sequence databases. Completeness and minimal redundancy A non redundant protein sequence database, with maximal coverage including splice isoforms, disease variant and PTMs.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' UniProtKB' - laban


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Sandra orchard

Sandra Orchard

UniProtKB


Importance of reference protein sequence databases
Importance of reference protein sequence databases

  • Completeness and minimal redundancy

    A non redundant protein sequence database, with maximal coverage including splice isoforms, disease variant and PTMs.

    Low degree of redundancy for facilitating peptide assignments

  • Stabilityand consistency Stable identifiers and consistent nomenclature

    Databases are in constant change due to a substantial amount of work to improve their completeness and the quality of sequence annotation

  • High quality protein annotation

    Detailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source


Summary of protein sequence databases
Summary of protein sequence databases

Updated from Nesvizhskii, A. I., and Aebersold, R. (2005) Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics. 4,1419–1440l


Uniprotkb
UniProtKB

  • UniProt Knowledgebase:

    • 2 sections

    • UniProtKB/Swiss-Prot Non-redundant, high-quality manual annotation - reviewed

    • UniProtKB/TrEMBL Redundant, automatically annotated - unreviewed

www.uniprot.org

Master headline


Manual annotation of UniProtKB/Swiss-Prot

Splice variants

Sequence

Sequence features

UniProtKB

Ontologies

Annotations

References

Nomenclature


  • Sequence curation, stable identifiers, versioning and archiving

    • For example – erroneous gene model predictions, frameshifts

    • ….

..premature stop codons, read-throughs, erroneous initiator methionines…..

Master headline


Splice variants

Master headline


Identification of amino acid variants

..and of PTMs

… and also

Master headline


Domain annotation

Binding sites

Master headline


Protein nomenclature

Master headline



Annotation 30 defined fields
Annotation - >30 defined fields

Controlled vocabularies used whenever possible…

Master headline


..and also imported from external resources

Binary interactions taken from the IntAct database

Interactors of human p53

Master headline


Controlled vocabulary usage increasing – for example

from the Gene Ontology

Annotation for human Rhodopsin

Master headline


Evidence at protein level

There is experimental evidence of the existence of a protein

(e.g. Edman sequencing, MS, X-ray/NMR structure, good quality protein-protein interaction , detection by antibodies)

Evidence at transcript level

The existence of a protein has not been proven but there is expression data (e.g. existence of cDNAs, RT-PCR or Northern blots) that indicates the existence of a transcript.

Inferred from homology

The existence of a protein is likely because orthologs exist in closely related species

4 Predicted

5 Uncertain

Sequence evidence

Type of evidence that supports the existence of a protein


Manual annotation of the human proteome uniprotkb swiss prot
Manual annotation of the human proteome(UniProtKB/Swiss-Prot)

A draft of the complete human proteome has been available in UniProtKB/Swiss-Prot since 2008

Manually annotated representation of 20,242 protein coding genes with ~ 36,000 protein sequences - an additional 38,484 UniProtKB/TrEMBL form the complete proteome set

Approximately 63,000 single amino acid polymorphisms (SAPs), mostly disease-linked

80,000 post-translational modifications (PTMs)

Close collaboration with NCBI, Ensembl, Sanger Institute and UCSC to provide the authoritative set to the user community


Searching UniProt – Simple Search

Master headline


Searching UniProt – Advanced Search

Master headline


Each linked to the UniProt entry

Searching UniProt – Search Results

Master headline






Alignment with query sequence

Searching UniProt – Blast Results

Master headline



UniProtKB/TrEMBL

  • Multiple entries for the same protein (redundancy) can arise in UniProtKB/TrEMBL due to:

  • Erroneous gene model predictions

  • Sequence errors (Frame shifts)

  • Polymorphisms

  • Alternative start sites

  • Isoforms

  • Apart from 100% identical sequences all merged sequences are analysed by a curator so they can be annotated accordingly.



Given a set of uncharacterised sequences, we usually want to know:

  • what are these proteins; to what family do they belong?

  • what is their function; how can we explain this in structural terms?


2 the protein signature approach
2. The know:protein signature approach

1. Pairwise alignment approaches (e.g. BLAST)

  • Good at recognising similarity between closely related sequences

  • Perform less well at detecting divergent homologues

  • Alternatively, we can model the conservation of amino acids at specific positions within a multiple sequence alignment, seeking ‘patterns’ across closely related proteins

  • We can then use these models to infer relationshipswith previously characterised sequences

  • This is the approach taken by protein signature databases


Multiple sequence alignment know:

What are protein signatures?

Protein family/domain

Build model

Search

UniProt

Protein analysis

Significant match

ITWKGPVCGLDGKTYRNECALL

Mature model

AVPRSPVCGSDDVTYANECELK


Diagnostic approaches (sequence-based) know:

Single motif methods

Regex patterns (PROSITE)

Full domain alignment methods

Profiles

(Profile Library)

HMMs

(Pfam)

Multiple motif methods

Identity matrices

(PRINTS)


Motif know:

Define pattern

xxxxxx

xxxxxx

xxxxxx

xxxxxx

Extract pattern sequences

Build regular expression

C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C

Pattern signature

PS00000

Patterns

Sequence alignment


Patterns know:

Advantages

  • Some aa can be forbidden at some specific positions which can help to distinguish closely related subfamilies

  • Short motifs handling - a pattern with very few variability and forbidden positions, can produce significant matches e.g. conotoxins: very short toxins with few conserved cysteinesC-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C

Drawbacks

  • High False Positive/False Negative rate

Patterns are mostly directed against functional residues:

active sites, PTM, disulfide bridges, binding sites


Motif 1 know:

Motif 2

Motif 3

Define motifs

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

Extract motif sequences

Correct order

Fingerprint signature

1

2

3

Correct spacing

PR00000

Fingerprints

Sequence alignment

Weight matrices


The significance of motif context

1 know:

2

3

4

5

The significance of motif context

  • Identify small conserved regions in proteins

  • Several motifs  characterise family

  • Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours

order

interval


Profiles know:& HMMs

Whole protein

Sequence alignment

Entire domain

Define coverage

xxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxx

Use entire alignment for domain or protein

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Models insertions and deletions

Build model

Profile or HMM signature


HMM databases know:

  • Sequence-based

  • PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship

  • PANTHER: families/subfamilies model the divergence of specific functions

  • TIGRFAM: microbial functional family classification

  • PFAM : families & domains based on conserved sequence

  • SMART: functional domain annotation

  • Structure-based

  • SUPERFAMILY : models correspond to SCOP domains

  • GENE3D: models correspond to CATH domains


Why we created interpro
Why we created know:InterPro

  • By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful diagnostic tool & integrated database

    • to simplify & rationalise protein analysis

    • to facilitate automatic functional annotation of uncharacterised proteins

    • to provide concise information about the signatures and the proteins they match, including consistent names, abstracts (with links to original publications), GO terms and cross-references to other databases


InterPro Entry know:

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers

  • Hierarchical classification


Interpro hierarchies families
Interpro know: hierarchies: Families

FAMILIES can have parent/child relationships with other Families

  • Parent/Child relationships are based on:

  • Comparison of protein hits

  • child should be a subset of parent

  • siblings should not have matches in common

  • Existing hierarchies in member databases

  • Biological knowledge of curators


Interpro hierarchies domains
Interpro know: hierarchies: Domains

DOMAINS can have parent/child relationships with other domains



InterPro Entry Organisation

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers


InterPro Entry Organisation

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers

The Gene Ontology project provides a controlled vocabulary of terms for describing gene product characteristics


InterPro Entry Organisation

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers

UniProt

KEGG ... Reactome ... IntAct ...

UniProt taxonomy

PANDIT ... MEROPS ... Pfam clans ...

Pubmed


InterPro Entry Organisation

Groups similar signatures together

Adds extensive annotation

Adds extensive annotation

Links to other databases

Links to other databases

Structural information and viewers

PDB 3-D Structures

SCOP Structural domains

CATH Structural domain classification


Searching interpro
Searching InterPro Organisation


Searching interpro1
Searching InterPro Organisation

Protein family membership

Domain organisation

Domains, repeats

& sites

GO terms


Searching interpro2
Searching InterPro Organisation


Searching interpro3
Searching InterPro Organisation


InterProScan Organisation access

Interactive:

http://www.ebi.ac.uk/Tools/pfa/iprscan/

Webservice (SOAP and REST):

http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_rest

http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_soap

Downloadable:

ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/


Searching interpro4
Searching InterPro Organisation


Automatic Annotation Organisation

  • Automated clean-up of annotation from original nucleotide sequence entry

  • Additional value added by using automatic annotation

  • Recognises common annotation belonging to a closely related family within UniProtKB/Swiss-Prot

  • Identifies all members of this family using pattern/motif/HMMs in InterPro

  • Transfers common annotation to related family members in TrEMBL

Master headline


← Taxonomy Organisation

← Publication

← Name (non-standard)

← Sequence


Interpro
InterPro Organisation

Master headline


Master headline Organisation



Complete proteomes
Complete Proteomes Organisation


Ms proteomics
MS Proteomics Organisation

  • Require each sequence (inc isoforms) to be present in the dataset as an separate entity for search engines to access

  • For higher organisms, with isoforms, expanded set made available on ftp site

  • Fastafiles by FTP

    • One file per species containing canonical + isoform sequences


? Organisation

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

Master headline


ad