classifying the protein universe n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Classifying the protein universe PowerPoint Presentation
Download Presentation
Classifying the protein universe

Loading in 2 Seconds...

play fullscreen
1 / 43

Classifying the protein universe - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

Synapse-Associated Protein 97. Classifying the protein universe . Ashwin Sivakumar . Wu et al, 2002. EMBO J 19:5740-5751. Domain Analysis and Protein Families. Introduction What are protein families? Protein families Description & Definition Motifs and Profiles

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Classifying the protein universe' - marlie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
classifying the protein universe

Synapse-Associated Protein 97

Classifying the protein universe

Ashwin Sivakumar

Wu et al, 2002. EMBO J 19:5740-5751

domain analysis and protein families
Domain Analysis and Protein Families
  • Introduction
    • What are protein families?
  • Protein families
    • Description & Definition
    • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification
protein families

Protein family 1

Protein family 2

Protein Families
  • Protein families are defined by homology:
    • In a family, everyone is related to everyone
    • Everybody in a family shares a common ancestor:
homology versus similarity

1chg

1sgt

1chg

1sgt

Homology versus Similarity
  • Homologous proteins have similar 3D structures and (usually) share common ancestry:
  • 1chg and 1sgt  31% identity, 43% similarity
  • We can infer homology from similarity!

Superfamily: Trypsin-like Serine Proteases

homology versus similarity1

1chg

1sgc

1chg

1sgc

Homology versus Similarity
  • But Homologous proteins may not share sequence similarity:

Superfamily: Trypsin-like Serine Proteases

1chg and 1sgc  15% identity, 25% similarity

We cannot infer similarity from homology

homology versus similarity2

1chg

2baa

1chg

2baa

Homology versus Similarity
  • Similar sequences may not have structural similarity:

1chg and 2baa  30% similarity, 140/245 aa

We cannot assume homology from similarity!

homology versus similarity3
Homology versus Similarity
  • Summary
    • Sequences can be similar without being homologous
    • Sequences can be homologous without being similar

Families ??

Evolution /

Homology

BLAST

Similarity

domain analysis and protein families1
Domain Analysis and Protein Families
  • Introduction
    • What are protein families?
  • Protein families
    • Description & Definition
    • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification
description of a protein family
Description of a Protein Family
  • Let’s assume we know some members of a protein family
  • What is common to them all?
  • Multiple alignment!
describing sequences in a protein family
Describing Sequences in a Protein Family
  • As a motif or rule
    • describes essential features of the protein family
    • catalytic residues, important structural residues
  • As a profile
    • describes variability in the family alignment
slide11

Techniques for searching sequence databases to

Some common strategies to uncover common domains/motifs of biological significance that categorize a protein into a family

• Pattern - a deterministic syntax that describes multiple combinations of possible residues within a protein string

• Profile - probabilistic generalizations that assign to every segment position, a probability that each of the 20 aa will occur

slide12

Consensus - mathematical probability that a particular amino acid will be located at a given position.

• Probabilistic pattern constructed from a MSA. Opportunity to assign penalties for insertions and deletions

• PSSM - (Position Specific Scoring Matrix)

– Represents the sequence profile in tabular form

– Columns of weights for every aa corresponding to each column of a MSA.

slide13
HMMs
  • Hidden Markov Models are Statistical methods that consider all the possible combinations of matches, mismatches, and gaps to generate a consensus (Higgins, 2000)
  • •Sequence ordering and alignments are not necessary at the onset (but in many cases alignments are recommended)
  • More the number of sequences better the models.
  • One can Generate a model (profile/PSSM), then search a database with it (Eg: PFAM)
motif description of a protein family
Motif Description of a Protein Family
  • Regular expressions:

........C.............S...L..I..DRY..I.......................W...

I E W V

/ C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /

x = [AC-IK-NP-TVWY]

motif description of a protein family1
Motif Description of a Protein Family
  • Database: PROSITE

“PROSITE is a database of protein families and domains. It is based on the observation that, while there is a huge number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families. Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor. It is apparent, when studying protein sequence families, that some regions have been better conserved than others during evolution. These regions are generally important for the function of a protein and/or for the maintenance of its three-dimensional structure. By analyzing the constant and variable properties of such groups of similar sequences, it is possible to derive a signature for a protein family or domain, which distinguishes its members from all other unrelated proteins.”

http://au.expasy.org/prosite/prosite_details.html

automated motif discovery
Automated Motif Discovery
  • Given a set of sequences:
    • GIBBS Sampler
      • http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=protein
    • MEME
      • http://meme.sdsc.edu/meme/

PRATT

      • http://www.ebi.ac.uk/pratt
    • TEIRESIAS
      • http://cbcsrv.watson.ibm.com/Tspd.html
automated profile generation
Automated Profile Generation
  • Any multiple alignment is a profile!
  • PSIBLAST
    • Algorithm:
      • Start from a single query sequence
      • Perform BLAST search
      • Build profile of neighbours
      • Repeat from 2 …
    • Very sensitive method for database search
psi blast
PSI-Blast
  • Starts with a sequence, BLAST it,
  • align select results to query sequence, estimate a profile with the MSA, search database with the profile - constructs PSSM
  • Iterate until process stabilizes
  • Focus here is on domains, not entire sequences
  • Greatly improves sensitivity
psiblast

Profile2

After n iterations

Query

Profile1

...

Threshold for inclusion in profile

PSIBLAST
  • Position Specific Iterative Blast
benchmarking a motif profile
Benchmarking a motif/profile
  • You have a description of a protein family, and you do a database search…
  • Are all hits truly members of your protein family?
  • Benchmarking:

TP: true positive

TN: true negative

FP: false positive

FN: false negative

Result

family member

Dataset

not a family member

unknown

benchmarking a motif profile1
Benchmarking a motif/profile
  • Precision / Selectivity
    • Precision = TP / (TP + FP)
  • Sensitivity / Recall
    • Sensitivity = TP / (TP + FN)
  • Balancing both:
    • Precision ~ 1, Recall ~ 0: easy but useless
    • Precision ~ 0, Recall ~ 1: easy but useless
    • Precision ~ 1, Recall ~ 1: perfect but very difficult
domain analysis and protein families2
Domain Analysis and Protein Families
  • Introduction
    • What are protein families?
  • Protein families
    • Description & Definition
    • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification
the modular architecture of proteins

Triosephosphate isomerase

Phosphoglycerate kinase

The Modular Architecture of Proteins
  • BLAST search of a multi-domain protein
what are domains
What are domains?
  • Functional - from experiments:

example: Decay Accelerating Factor (DAF) or CD55

  • Has six domains (units):
    • 4x Sushi domain (complement regulation)
    • 1x ST-rich ‘stalk’
    • 1x GPI anchor (membrane attachment)
    • PDB entry 1ojy (sushi domains only)

P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12): 10691-10696

there is only so much we can conclude
There is only so much we can conclude…
  • Classifying domains [To aid structure prediction (predict structural domains, molecular function of the domain)]
  • Classifying complete sequences (predicting molecular function of proteins, large scale annotation)
  • Majority of proteins are multi-domain proteins.
what are domains1
What are domains?
  • Structural - from structures:

MKTQVAIIGAGPSGLLLGQLLHKAGIDNVILERQTPDYVLGRIRAGVLEQGMVDLLREAGVDRRMARDGLVHEGVEIAFAGQRRRIDLKRLSGGKTVTVYGQTEVTRDLMEAREACGATTVYQAAEVRLHDLQGERPYVTFERDGERLRLDCDYIAGCDGFHGISRQSIPAERLKVFERVYPFGWLGLLADTPPVSHELIYANHPRGFALCSQRSATRSRYYVQVPLTEKVEDWSDERFWTELKARLPAEVAEKLVTGPSLEKSIAPLRSFVVEPMQHGRLFLAGDAAHIVPPTGAKGLNLAASDVSTLYRLLLKAYREGRGELLERYSAICLRRIWKAERFSWWMTSVLHRFPDTDAFSQRIQQTELEYYLGSEAGLATIAENYVGLPYEEIE

Are these domains?

Yes - structural domains!

1phh

M A Marti-Renom (2003) Identification of Structural Domains in Proteins. DIMACS, Rutgers University, Piscataway, NJ, Feb 27 2003.

what are domains2
What are domains?
  • Mobile – Sequence Domains:

Protein 1

Protein 2

Protein 3

Protein 4

Mobile module

domains are
Domains are...
  • ...evolutionary building blocks:
    • Families of evolutionarily-related sequence segments
    • Domain assignment often coupled with classification
  • With one or more of the following properties:
    • Globular
    • Independently foldable
    • Recurrence in different contexts
  • To be precise,
    • we say: “protein family”
    • we mean: “protein domainfamily”
example global alignment
Example: global alignment
  • Phthalate dioxygenase reductase (PDR_BURCE)
  • Toluene - 4 -monooxygenase electron transfer component (TMOF_PSEME)

Global alignment fails!

Only aligns largest domain.

sometimes even more complex
Sometimes even more complex!

PGBM_HUMAN:“Basement membrane-specific heparan sulphate proteoglycan core protein precursor”

980

1960

2940

3920

4391

45 domains of 9 different type, according to PFam

http://www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.pl?name=P98160

http://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html

domain analysis and protein families3
Domain Analysis and Protein Families
  • Introduction
    • What are protein families?
  • Protein families
    • Description & Definition
    • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification
categories of domain definitions
Categories of Domain Definitions

Structure(discontinuous domains)

Sequence(continuous domains)

PFAM

SCOP

Curated

SMART

CATH

PROSITE

PRINTS

ADDA

DALI

PUU

DETEKTIVE

DOMAINPARSER 1 & 2

DIAL

STRUDL

DOMAK

DOMO

TRIBE-MCL

GENERAGE

SYSTERS

PROTOMAP

Automatic

slide33

Pfam-Protein family database

  • Families of HMM profiles built from hand curated multiple alignments. (Pfam A)
  • Pfam A covers 7973 protein families.
  • You can search your sequence against these profiles to decipher family membership for your sequence.

7973

sequence space graph
Sequence Space Graph
  • Why we need to consider domains:

Sequence

Alignment

Topology:

  • 80% of all sequences in one giant component
  • 10% smaller groups
  • 10% in singletons
automatic domain definitions

Distant relatives

Automatic domain definitions
  • Rely on alignment information
  • Alignment information is unreliable
    • Incomplete sequences (fragments)
    • Spurious alignments
    • Conserved motifs in mostly disordered region
  • How to remove the noise?

UREA_CANEN: three domain protein

slide36

Sequence Space Graph:

  • Where to cut connections?
  • What is real, what is noise?
  • Precision vs Sensitivity…
slide37
ADDA
  • HolmGroup in-house database!
    • http://ekhidna.biocenter.helsinki.fi:9801/sqgraph/pairsdb
  • Classification of non-redundant sequences
    • 100% level: 1562243 sequences, 2697368 domains
    • 40% level: 479740 sequences, 827925 domains
  • PFAM-A benchmark
    • Sensitivity: 87% (average unification in single cluster)
    • Selectivity: 98% (average purity of cluster)
    • Coverage: 100% (all known proteins) [ Pfam ~50% ]
example abc transporter
Example: ABC transporter

PFAM

PRODOM

DOMO

ADDA

UniProt id: CFTR_BOVIN

properties of domains
Properties of domains
  • Most domains: size approx 75 – 200 residues
so you have a sequence
So, you have a sequence...
  • ...look it up in existing database
    • SRS: http://srs.ebi.ac.uk
    • INTERPRO: http://www.ebi.ac.uk/interpro
  • ...search against existing family descriptions
    • PFAM: http://www.sanger.ac.uk/Software/Pfam
    • SMART: http://smart.embl-heidelberg.de
    • PRINTS: http://bioinf.man.ac.uk/dbbrowser/PRINTS
    • PROSITE: http://us.expasy.org/prosite
  • ...look it up in ADDA
manually curated protein family databases
Manually Curated Protein Family Databases
  • PFAM (Hidden Markov Models)
    • http://www.sanger.ac.uk/Software/Pfam
  • SMART (Hidden Markov Models)
    • http://smart.embl-heidelberg.de
  • PROSITE (Regular Expressions, Profiles)
    • http://au.expasy.org/prosite
  • PRINTS (combination of Profiles)
    • http://bioinf.man.ac.uk/dbbrowser/PRINTS
why a multiple alignment
Why a multiple alignment?
  • With a multiple alignment, we can
    • guess which residues are “important”
      • secondary structure prediction
      • transmembrane segments prediction
      • homology modelling
      • guide to wet-lab EXPERIMENTATION!
    • build a motif/profile and find more family members
    • build phylogenetic trees

Multiple Alignments are THE central object in protein sequence analysis!

from sequence to function
From sequence to function…

3-motif resource

The server seems to be down today!

Methylmalanoyl CoA DecarboxylasePattern [ILV]-x(3)-E-x(7)-V-[GA]-x-[IVL]-x-L-N-R-Pmapped on the structure of 1DUB. Ball representation in pink shows the potential ligands and its binding pockets. The balls in blue represent the residues making up the motif on the known structure.