proteomics resources at the ebi n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Proteomics Resources at the EBI PowerPoint Presentation
Download Presentation
Proteomics Resources at the EBI

Loading in 2 Seconds...

play fullscreen
1 / 78

Proteomics Resources at the EBI - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on

Proteomics Resources at the EBI. Sandra Orchard EMBL-EBI. What do Protein scientists require?. 1. Protein Identification

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Proteomics Resources at the EBI' - marshall-monroe


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
proteomics resources at the ebi

Proteomics Resources at the EBI

Sandra Orchard

EMBL-EBI

what do protein scientists require
What do Protein scientists require?

1. Protein Identification

A high quality, non redundant protein database, with maximal coverage including splice isoforms, disease variant and PTMs to act as a reference set. Stable identifiers and sequence archiving essential

2. Protein annotation

Detailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source

3. Reference data sets

Comparative datasets to compare tissue specificity patterns, normal/disease protein sets

where do we go from here
Where do we go from here?

Sequence similarity programs run against UniProt

What is UniProt?

Based on the original work on PIR, Swiss-Prot and TrEMBL

Funded mainly by NIH

Collaboration between EBI, SIB and PIR

slide5

UniRef 50

UniRef 90

IPI

Proteome

Sets

UniRef 100

UniSave

UniProtKB

UniMes

UniParc

PDB

Sub/

Peptide

Data

FlyBase

WormBase

Patent

Data

INSDC

(incl. WGS,

Env.)

RefSeq

Ensembl

VEGA

Database sources

UniProt data sources and data flow

uniprotkb
UniProtKB
  • UniProt Knowledgebase:
    • Aims to describe in a single record all protein products derived from a certain gene from a certain species
    • 2 sections
    • UniProtKB/Swiss-Prot Non-redundant, high-quality manual annotation - reviewed
    • UniProtKB/TrEMBL Redundant, automatically annotated - unreviewed

www.uniprot.org

what does uniprotkb give you
What does UniProtKB give you?
  • Curated protein sequences – correction of frameshifts, premature stop sites, incorrect initiator methionine…….. stable identifiers, with archiving and versioning
  • Consistent nomenclature – plus synonyms
  • Identification of splice variants and/or alternative promoter usage - stable identifiers, with archiving and versioning
what does uniprotkb give you1
What does UniProtKB give you?

4.Identification of variants (at amino acid level) and of PTMs – where known, consequence is given - stable identifiers, with archiving and versioning

5. Annotation of literature experimental data in 27 defined fields. Increasing use of controlled vocabularies, without loss of detail

what does uniprotkb give you2
What does UniProtKB give you?

6. Extensive cross-referencing, a central portal to a wealth of external resources - 81 external databases cross-referenced to UniProtKB

slide14

1. Sequence curation, stable identifiers, versioning and archiving

www.ebi.ac.uk/uniprot/unisave

slide17

Sequence curation, stable identifiers, versioning and archiving

    • For example – erroneous gene model predictions….

…frameshifts

..premature stop codons, readthroughs, erroneous initiator methionines…..

slide22

Domain annotation

Binding sites

slide23

Splice variants

Experimental mutations

Sequence conflicts

slide26

Controlled vocabularies used whenever possible…

..but ability to further describe each specific situation retained

slide27

Disease specific annotation added to

human entries…

… with supporting cross-referencing

slide30

6. Extensive cross-referencing, a central portal to a wealth of external resources…

.. Additional annotation (Gene Ontology)..

slide39

UniProtKB/TrEMBL

  • Redundant – only 100% identical sequences merged
  • Automated clean-up of annotation from original nucleotide sequence entry
  • Additional value added by using automatic annotation
slide40

Automatic Annotation

  • Recognises common annotation belonging to a closely related family within UniProtKB/Swiss-Prot
  • Identifies all members of this family using pattern/motif/HMMs in InterPro
  • Transfers common annotation to related family members in TrEMBL
slide41

BLAST

more sequences

Conserved signatures

Protein Sequence Characterisation

Basic information

Build up consensus sequences of families, domains, motifs or sites

slide42

Simplest (limited)

More information

Finding Conserved Signatures

  • Pattern
  • Fingerprint
  • Sequence clustering
  • Profile
  • HMM
slide43

Integration of signatures

InterPro

Foundations of InterPro

Manual curation

slide44

(100)

1)

PROSITE

IPR000001

(100)

PFAM

(100)

IPR000001

2)

PROSITE

(50)

IPR000002

PFAM

3)

(100)

IPR000001

PROSITE

IPR000001

(100)

IPR000002

PFAM

IPR000002

(100)

PROSITE

4)

(100)

PFAM

Integration Process

Same positions

Same protein hits

Same positions

Different protein hits

Different positions

Same protein hits

Different positions

slide45

(100)

Protein kinase

PFAM

PFAM

(75)

Serine kinase

SMART

Protein kinase

*

(100)

Protein kinase

PFAM

(25)

PROSITE

Tyrosine kinase

SMART

PROSITE

Serine kinase

Tyrosine kinase

SMART

PROSITE

Children

No proteins in common

Signature Relationships

1) Parent - Child

(subgroup of more closely related proteins)

*

Parent

Applies to domains and families

slide46

Receptor family

PFAM

N-terminal domain

C-terminal domain

SMART

PROSITE

Contains

(Smart and Prosite)

PFAM

Receptor Family

Found in

(Pfam)

SMART

PROSITE

N-terminal domain

C-terminal domain

Signature Relationships

2) Contains – Found in

(Describes domain composition)

Both families and domains can contain domains

slide48

PDB sequence

InterPro

sequence-structure comparison

MSD

Residue-by-residue

mapping

UniProt amino acid position

Structural Representation in InterPro

slide49

PDB structures displayed as striped patterns

Structural classification in CATH

and SCOP

CATH

SCOP

and ModBase

Homology models from Swiss-model

Swiss-M

ModB

Structural Representation

slide50

Signatures predictive of protein annotation

Structural data for specific proteins

Sequence-Structure Display

slide51

Member database search engines

Paste in sequence

Upload sequence file

InterProScan search

slide52

InterProScan search results

Single InterPro entry

slide55

INTERPRO

1) Extract conditions from InterPro

2) Group Swiss-Prot entries by conditions

Swiss-Prot

TrEMBL

4) Group TrEMBL by conditions and add common annotation to TrEMBL entries

Automated annotation in TrEMBL

3) Extract common annotation

Automatic Annotation

non redundant proteome sets
Non-redundant proteome sets
  • 1 entry/gene/species required
  • Stable protein identifier required for long-term maintenance of data
  • Need to have splice isoforms clearly identified as such
non redundant proteome sets1
Non-redundant proteome sets
  • Many species with fully sequenced genomes and known coding regions identified by keyword

“Complete proteome”

  • Currently available for >220 bacterial, >25 archaeal and 10 fungal species plus

Plasmodium yoelii yoelii Encephalitozoon cuniculi Caenorhabditis elegans Caenorhabditis briggsae

Plasmodium falciparum Anopheles gambiae

Drosophila melanogaster

non redundant proteome sets2
Non-redundant proteome sets
  • Complete experimentally determined protein sets not yet available for higher organisms
  • Require inclusion of predicted proteins to give full proteome
  • International Protein Index (IPI) merges data from UniProt, Ensembl and Ref-Seq to produce non-redundant dataset
international protein index
International Protein Index
  • Non-redundant protein sets produced for human, mouse, rat, Arabidopsis, zebrafish, cow and chicken
  • effectively maintains a database of cross references between the primary data sources
  • provides minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript)
  • maintains stable identifiers (with incremental versioning) to allow the tracking of sequences in IPI between IPI releases.
slide62
IPI
  • Merges entries with >95% redundancy
  • Treats products of alternative splicing, alternative promoter usage, alternative initiation as separate transcripts
uniparc
UniParc

Protein sequence archive

Holds sequences from all sources, including patents and predicted proteins which are normally excluded from UniProtKB

All unique sequences are stored as string and cross-referenced to source databases in which they appear

Status of sequence in source database is recorded, sequence remains in UniParc when removed from source database

intact molecular interaction database
IntAct Molecular Interaction Database

IntAct provides

  • data repository for molecular interactions
  • data analysis for molecular interactions
  • graph visualisation of molecular interactions with GO and InterPro
  • open source resource – both software and data available

www.ebi.ac.uk/intact

intact molecular interaction database1
IntAct Molecular Interaction Database

Where possible, controlled vocabularies used throughout

- Yeast 2-hybrid = classical 2 hybrid = Y2H

In IntAct/PSI CVs

“two hybrid” MI:0018

use intact to
Use IntAct to…
  • Look at your favourite protein in the context of the molecules it interacts with
  • Annotate interacting protein clusters using GO/InterPro
  • Download species/disease sets and install locally
  • Download sets for visualisation locally or in Cytoscape
intact curation
IntAct Curation
  • Curation of large datasets – Stelz, Rual, Giot, Ho, Gavin….
  • Curation of low throughput interactions
    • Model organisms
    • Human, mouse, drosophila, C.elegans, arabidopsis, rice, saccharomyces, S.pombe, E.coli,
    • Disease focus
    • Cancer, Alzheimer’s
    • Collaborations
    • Cyanobacteria –Commissariat a l'energie Atomique, France
intact and psi mi
IntAct and PSI-MI
  • IntAct, DIP, MINT, (BIND), MIPS, Hybrigenics all make data available in PSI-MI format
  • Can download datasets from all databases and combine into a single database
  • Has lead to formation of IMEx consortium (IntAct, DIP, MINT, BIND, MIPS) to share data and curation effort
search tools
Search tools
  • All applications have simple text search tools available, some also have advanced/power search
  • Most applications indexed in SRS – searching across multiple databases then possible
user input
User Input
  • Feedback – if you find something wrong, outdated, missing…
  • Be thorough when writing your papers – make protein identification clear, use accession numbers etc.
  • Submit, Submit, Submit
with thanks to
With thanks to…

The Sequence Database group – EBI

UniProt collaborators – SIB, PIR

InterPro consortium

IntAct consortium

GO consortium

PRIDE

HUPO-PSI

Rolf Apweiler