classifying the protein universe n.
Skip this Video
Loading SlideShow in 5 Seconds..
Classifying the protein universe PowerPoint Presentation
Download Presentation
Classifying the protein universe

Loading in 2 Seconds...

play fullscreen
1 / 43

Classifying the protein universe - PowerPoint PPT Presentation

  • Uploaded on

Synapse-Associated Protein 97. Classifying the protein universe . Ashwin Sivakumar . Wu et al, 2002. EMBO J 19:5740-5751. Domain Analysis and Protein Families. Introduction What are protein families? Protein families Description & Definition Motifs and Profiles

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Classifying the protein universe' - marlie

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
classifying the protein universe

Synapse-Associated Protein 97

Classifying the protein universe

Ashwin Sivakumar

Wu et al, 2002. EMBO J 19:5740-5751

domain analysis and protein families
Domain Analysis and Protein Families
  • Introduction
    • What are protein families?
  • Protein families
    • Description & Definition
    • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification
protein families

Protein family 1

Protein family 2

Protein Families
  • Protein families are defined by homology:
    • In a family, everyone is related to everyone
    • Everybody in a family shares a common ancestor:
homology versus similarity





Homology versus Similarity
  • Homologous proteins have similar 3D structures and (usually) share common ancestry:
  • 1chg and 1sgt  31% identity, 43% similarity
  • We can infer homology from similarity!

Superfamily: Trypsin-like Serine Proteases

homology versus similarity1





Homology versus Similarity
  • But Homologous proteins may not share sequence similarity:

Superfamily: Trypsin-like Serine Proteases

1chg and 1sgc  15% identity, 25% similarity

We cannot infer similarity from homology

homology versus similarity2





Homology versus Similarity
  • Similar sequences may not have structural similarity:

1chg and 2baa  30% similarity, 140/245 aa

We cannot assume homology from similarity!

homology versus similarity3
Homology versus Similarity
  • Summary
    • Sequences can be similar without being homologous
    • Sequences can be homologous without being similar

Families ??

Evolution /




domain analysis and protein families1
Domain Analysis and Protein Families
  • Introduction
    • What are protein families?
  • Protein families
    • Description & Definition
    • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification
description of a protein family
Description of a Protein Family
  • Let’s assume we know some members of a protein family
  • What is common to them all?
  • Multiple alignment!
describing sequences in a protein family
Describing Sequences in a Protein Family
  • As a motif or rule
    • describes essential features of the protein family
    • catalytic residues, important structural residues
  • As a profile
    • describes variability in the family alignment

Techniques for searching sequence databases to

Some common strategies to uncover common domains/motifs of biological significance that categorize a protein into a family

• Pattern - a deterministic syntax that describes multiple combinations of possible residues within a protein string

• Profile - probabilistic generalizations that assign to every segment position, a probability that each of the 20 aa will occur


Consensus - mathematical probability that a particular amino acid will be located at a given position.

• Probabilistic pattern constructed from a MSA. Opportunity to assign penalties for insertions and deletions

• PSSM - (Position Specific Scoring Matrix)

– Represents the sequence profile in tabular form

– Columns of weights for every aa corresponding to each column of a MSA.

  • Hidden Markov Models are Statistical methods that consider all the possible combinations of matches, mismatches, and gaps to generate a consensus (Higgins, 2000)
  • •Sequence ordering and alignments are not necessary at the onset (but in many cases alignments are recommended)
  • More the number of sequences better the models.
  • One can Generate a model (profile/PSSM), then search a database with it (Eg: PFAM)
motif description of a protein family
Motif Description of a Protein Family
  • Regular expressions:



/ C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /


motif description of a protein family1
Motif Description of a Protein Family
  • Database: PROSITE

“PROSITE is a database of protein families and domains. It is based on the observation that, while there is a huge number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families. Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor. It is apparent, when studying protein sequence families, that some regions have been better conserved than others during evolution. These regions are generally important for the function of a protein and/or for the maintenance of its three-dimensional structure. By analyzing the constant and variable properties of such groups of similar sequences, it is possible to derive a signature for a protein family or domain, which distinguishes its members from all other unrelated proteins.”

automated motif discovery
Automated Motif Discovery
  • Given a set of sequences:
    • GIBBS Sampler
    • MEME


automated profile generation
Automated Profile Generation
  • Any multiple alignment is a profile!
    • Algorithm:
      • Start from a single query sequence
      • Perform BLAST search
      • Build profile of neighbours
      • Repeat from 2 …
    • Very sensitive method for database search
psi blast
  • Starts with a sequence, BLAST it,
  • align select results to query sequence, estimate a profile with the MSA, search database with the profile - constructs PSSM
  • Iterate until process stabilizes
  • Focus here is on domains, not entire sequences
  • Greatly improves sensitivity


After n iterations




Threshold for inclusion in profile

  • Position Specific Iterative Blast
benchmarking a motif profile
Benchmarking a motif/profile
  • You have a description of a protein family, and you do a database search…
  • Are all hits truly members of your protein family?
  • Benchmarking:

TP: true positive

TN: true negative

FP: false positive

FN: false negative


family member


not a family member


benchmarking a motif profile1
Benchmarking a motif/profile
  • Precision / Selectivity
    • Precision = TP / (TP + FP)
  • Sensitivity / Recall
    • Sensitivity = TP / (TP + FN)
  • Balancing both:
    • Precision ~ 1, Recall ~ 0: easy but useless
    • Precision ~ 0, Recall ~ 1: easy but useless
    • Precision ~ 1, Recall ~ 1: perfect but very difficult
domain analysis and protein families2
Domain Analysis and Protein Families
  • Introduction
    • What are protein families?
  • Protein families
    • Description & Definition
    • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification
the modular architecture of proteins

Triosephosphate isomerase

Phosphoglycerate kinase

The Modular Architecture of Proteins
  • BLAST search of a multi-domain protein
what are domains
What are domains?
  • Functional - from experiments:

example: Decay Accelerating Factor (DAF) or CD55

  • Has six domains (units):
    • 4x Sushi domain (complement regulation)
    • 1x ST-rich ‘stalk’
    • 1x GPI anchor (membrane attachment)
    • PDB entry 1ojy (sushi domains only)

P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12): 10691-10696

there is only so much we can conclude
There is only so much we can conclude…
  • Classifying domains [To aid structure prediction (predict structural domains, molecular function of the domain)]
  • Classifying complete sequences (predicting molecular function of proteins, large scale annotation)
  • Majority of proteins are multi-domain proteins.
what are domains1
What are domains?
  • Structural - from structures:


Are these domains?

Yes - structural domains!


M A Marti-Renom (2003) Identification of Structural Domains in Proteins. DIMACS, Rutgers University, Piscataway, NJ, Feb 27 2003.

what are domains2
What are domains?
  • Mobile – Sequence Domains:

Protein 1

Protein 2

Protein 3

Protein 4

Mobile module

domains are
Domains are...
  • ...evolutionary building blocks:
    • Families of evolutionarily-related sequence segments
    • Domain assignment often coupled with classification
  • With one or more of the following properties:
    • Globular
    • Independently foldable
    • Recurrence in different contexts
  • To be precise,
    • we say: “protein family”
    • we mean: “protein domainfamily”
example global alignment
Example: global alignment
  • Phthalate dioxygenase reductase (PDR_BURCE)
  • Toluene - 4 -monooxygenase electron transfer component (TMOF_PSEME)

Global alignment fails!

Only aligns largest domain.

sometimes even more complex
Sometimes even more complex!

PGBM_HUMAN:“Basement membrane-specific heparan sulphate proteoglycan core protein precursor”






45 domains of 9 different type, according to PFam

domain analysis and protein families3
Domain Analysis and Protein Families
  • Introduction
    • What are protein families?
  • Protein families
    • Description & Definition
    • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification
categories of domain definitions
Categories of Domain Definitions

Structure(discontinuous domains)

Sequence(continuous domains)























Pfam-Protein family database

  • Families of HMM profiles built from hand curated multiple alignments. (Pfam A)
  • Pfam A covers 7973 protein families.
  • You can search your sequence against these profiles to decipher family membership for your sequence.


sequence space graph
Sequence Space Graph
  • Why we need to consider domains:




  • 80% of all sequences in one giant component
  • 10% smaller groups
  • 10% in singletons
automatic domain definitions

Distant relatives

Automatic domain definitions
  • Rely on alignment information
  • Alignment information is unreliable
    • Incomplete sequences (fragments)
    • Spurious alignments
    • Conserved motifs in mostly disordered region
  • How to remove the noise?

UREA_CANEN: three domain protein


Sequence Space Graph:

  • Where to cut connections?
  • What is real, what is noise?
  • Precision vs Sensitivity…
  • HolmGroup in-house database!
  • Classification of non-redundant sequences
    • 100% level: 1562243 sequences, 2697368 domains
    • 40% level: 479740 sequences, 827925 domains
  • PFAM-A benchmark
    • Sensitivity: 87% (average unification in single cluster)
    • Selectivity: 98% (average purity of cluster)
    • Coverage: 100% (all known proteins) [ Pfam ~50% ]
example abc transporter
Example: ABC transporter





UniProt id: CFTR_BOVIN

properties of domains
Properties of domains
  • Most domains: size approx 75 – 200 residues
so you have a sequence
So, you have a sequence...
  • ...look it up in existing database
    • SRS:
  • against existing family descriptions
    • PFAM:
    • SMART:
    • PRINTS:
    • PROSITE:
  • ...look it up in ADDA
manually curated protein family databases
Manually Curated Protein Family Databases
  • PFAM (Hidden Markov Models)
  • SMART (Hidden Markov Models)
  • PROSITE (Regular Expressions, Profiles)
  • PRINTS (combination of Profiles)
why a multiple alignment
Why a multiple alignment?
  • With a multiple alignment, we can
    • guess which residues are “important”
      • secondary structure prediction
      • transmembrane segments prediction
      • homology modelling
      • guide to wet-lab EXPERIMENTATION!
    • build a motif/profile and find more family members
    • build phylogenetic trees

Multiple Alignments are THE central object in protein sequence analysis!

from sequence to function
From sequence to function…

3-motif resource

The server seems to be down today!

Methylmalanoyl CoA DecarboxylasePattern [ILV]-x(3)-E-x(7)-V-[GA]-x-[IVL]-x-L-N-R-Pmapped on the structure of 1DUB. Ball representation in pink shows the potential ligands and its binding pockets. The balls in blue represent the residues making up the motif on the known structure.