classifying the protein universe l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Classifying the protein universe PowerPoint Presentation
Download Presentation
Classifying the protein universe

Loading in 2 Seconds...

play fullscreen
1 / 37

Classifying the protein universe - PowerPoint PPT Presentation


  • 259 Views
  • Uploaded on

Synapse-Associated Protein 97 Classifying the protein universe Wu et al, 2002. EMBO J 19:5740-5751 Domain Analysis and Protein Families Introduction What are protein families? Protein families Description & Definition Motifs and Profiles The modular architecture of proteins

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Classifying the protein universe' - Melvin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
domain analysis and protein families
Domain Analysis and Protein Families
  • Introduction
    • What are protein families?
  • Protein families
    • Description & Definition
    • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification
protein families

Protein family 1

Protein family 2

Protein Families
  • Protein families are defined by homology:
    • In a family, everyone is related to everyone
    • Everybody in a family shares a common ancestor:
homology versus similarity

1chg

1sgt

1chg

1sgt

Homology versus Similarity
  • Homologous proteins have similar 3D structures and (usually) share common ancestry:
  • 1chg and 1sgt  31% identity, 43% similarity
  • We can infer homology from similarity!

Superfamily: Trypsin-like Serine Proteases

homology versus similarity5

1chg

1sgc

1chg

1sgc

Homology versus Similarity
  • But Homologous proteins may not share sequence similarity:

Superfamily: Trypsin-like Serine Proteases

1chg and 1sgc  15% identity, 25% similarity

We cannot infer similarity from homology

homology versus similarity6

1chg

2baa

1chg

2baa

Homology versus Similarity
  • Similar sequences may not have structural similarity:

1chg and 2baa  30% similarity, 140/245 aa

We cannot assume homology from similarity!

homology versus similarity7
Homology versus Similarity
  • Summary
    • Sequences can be similar without being homologous
    • Sequences can be homologous without being similar

Families ??

Evolution /

Homology

BLAST

Similarity

domain analysis and protein families8
Domain Analysis and Protein Families
  • Introduction
    • What are protein families?
  • Protein families
    • Description & Definition
    • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification
description of a protein family
Description of a Protein Family
  • Let’s assume we know some members of a protein family
  • What is common to them all?
  • Multiple alignment!
slide10

Techniques for searching sequence databases to

Some common strategies to uncover common domains/motifs of biological significance that categorize a protein into a family

• Pattern - a deterministic syntax that describes multiple combinations of possible residues within a protein string

• Profile - probabilistic generalizations that assign to every segment position, a probability that each of the 20 aa will occur

• Intermediate sequence search - link many profile searches

motif description of a protein family
Motif Description of a Protein Family
  • Regular expressions:

........C.............S...L..I..DRY..I.......................W...

I E W V

/ C x{13} S x{3} [LI] x{2} I x{2} [DE] R [YW] x{2} [IV] x{10} – x{12} W /

automated motif discovery
Automated Motif Discovery
  • Given a set of sequences:
    • GIBBS Sampler
      • http://bayesweb.wadsworth.org/cgi-bin/gibbs.8.pl?data_type=protein
    • MEME
      • http://meme.sdsc.edu/meme/

PRATT

      • http://www.ebi.ac.uk/pratt
    • TEIRESIAS
      • http://cbcsrv.watson.ibm.com/Tspd.html
      • Combinatorial output!
automated profile generation
Automated Profile Generation
  • Any multiple alignment is a profile!
  • PSIBLAST
    • Algorithm:
      • Start from a single query sequence
      • Perform BLAST search
      • Build profile of neighbours
      • Repeat from 2 …
    • Very sensitive method for database search
psi blast

Profile2

After n iterations

Query

Profile1

...

Threshold for inclusion in profile

PSI-BLAST
  • Position Specific Iterative Blast
  • PSI-Blast profile models only positions in the query sequence
slide15
HMMs
  • Hidden Markov Models are Statistical methods that consider all the possible combinations of matches, mismatches, and gaps to generate a consensus (Higgins, 2000)
  • •Sequence ordering and alignments are not necessary at the onset (but in many cases alignments are recommended)
  • More the number of sequences better the models.
  • One can Generate a model (profile/PSSM), then search a database with it (Eg: PFAM)
hmm libraries
HMM libraries
  • PFAM
    • http://www.sanger.ac.uk/Pfam
    • The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
    • Pfam-A entries are high quality, manually curated families.
    • Pfam-B entries are generated automatically.
gtg steps
GTG steps
  • Generate alignment trace graph
    • Nodes = residues
    • Edges = aligned in PSI-Blast library
    • Unweighted
  • Edge weighting
    • Using consistency
  • Clustering
    • Driven by consistency
    • Single site occupancy rule
  • Post-processing
    • Generate non-redundant set of inter-cluster edges
    • Identify sub-trees with conserved residues
alignment trace graph

Protein 1

Protein 2

Protein 3

Protein 4

Protein 5

Alignment trace graph

Residues more residues

  • Graph representation of input pairwise alignment data
  • Vertices = residues
  • Edges = aligned in a pairwise alignment from input library
consistency neighbour overlap
Consistency = neighbour overlap

i

j

Weight = intersection / union

gtg global trace graph
GTG – global trace graph
  • Input: PSI-Blast all versus all alignments in NRDB40
  • Output: superalignment of all proteins
  • Applications
    • Pairwise alignment of query and target sequences
    • Transitive sequence database searching (fast)
    • Tracking conserved residues (feature space)
slide21

Protein 1

Protein 2

Protein 3

Protein 4

Protein 5

Protein 1

Protein 2

Protein 3

Protein 4

Protein 5

Edge weight = consistency (fraction of common neighbours) Cluster ≈ hypothetical column of multiple alignment (single site occupancy)

Alignment trace graph

Cluster 1

Cluster 2

motif tracking

consistency

consistency

consistency

A

H

G

A

A

A

K

K

K

K

K

K

K

K

A

‘Motif tracking’

Each vertex is labelled with source protein and position in sequence.

Motifs are subtrees enriched in one particular amino acid type.

remote homolog detection based on gtg alignment score
Remote homolog detectionbased on GTG alignment score

GTG clustering is informative; detect as many remote homologs as threading methods

summary
Summary
  • Super-families form elongated clusters in “protein space”
    • Profile models fluctuations around an equilibrium point
  • Consistency ~ path model
    • Exploits multiple profile models
    • Discriminative in database searching
  • Global trace graph data structure
    • Feature space for pattern discovery

http://ekhidna.biocenter.helsinki.fi/gtg/start

relationships between families
Relationships between families
  • Pfam clans
    • A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM.
  • Superfamily
    • http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/hmm.html
    • The sequence search method uses a library (covering all proteins of known structure) consisting of 1539 SCOP superfamilies from classes a to g. Each superfamily is represented by a group of hidden Markov models.
  • Pfam-squared
    • Based on GTG comparisons of representative sequences from each PFAM-A family against all PFAM-A families.
    • Rules of thumb: motif score>1000 means probably related, motif score >500 means possibly related, score <500 means dubious
benchmarking a motif profile
Benchmarking a motif/profile
  • You have a description of a protein family, and you do a database search…
  • Are all hits truly members of your protein family?
  • Benchmarking:

TP: true positive

TN: true negative

FP: false positive

FN: false negative

Result

family member

Dataset

not a family member

unknown

benchmarking a motif profile27
Benchmarking a motif/profile
  • Precision / Selectivity
    • Precision = TP / (TP + FP)
  • Sensitivity / Recall
    • Sensitivity = TP / (TP + FN)
  • Balancing both:
    • Precision ~ 1, Recall ~ 0: easy but useless
    • Precision ~ 0, Recall ~ 1: easy but useless
    • Precision ~ 1, Recall ~ 1: perfect but very difficult
domain analysis and protein families28
Domain Analysis and Protein Families
  • Introduction
    • What are protein families?
  • Protein families
    • Description & Definition
    • Motifs and Profiles
  • The modular architecture of proteins
  • Domain Properties and Classification
the modular architecture of proteins

Triosephosphate isomerase

Phosphoglycerate kinase

The Modular Architecture of Proteins
  • BLAST search of a multi-domain protein
what are domains
What are domains?
  • Functional - from experiments:

example: Decay Accelerating Factor (DAF) or CD55

  • Has six domains (units):
    • 4x Sushi domain (complement regulation)
    • 1x ST-rich ‘stalk’
    • 1x GPI anchor (membrane attachment)
    • PDB entry 1ojy (sushi domains only)

P Williams et al (2003) Mapping CD55 Function. J Biol Chem 278(12): 10691-10696

there is only so much we can conclude
There is only so much we can conclude…
  • Classifying domains [To aid structure prediction (predict structural domains, molecular function of the domain)]
  • Classifying complete sequences (predicting molecular function of proteins, large scale annotation)
  • Majority of proteins are multi-domain proteins.
what are domains32
What are domains?
  • Mobile – Sequence Domains:

Protein 1

Protein 2

Protein 3

Protein 4

Mobile module

domains are
Domains are...
  • ...evolutionary building blocks:
    • Families of evolutionarily-related sequence segments
    • Domain assignment often coupled with classification
  • With one or more of the following properties:
    • Globular
    • Independently foldable
    • Recurrence in different contexts
  • To be precise,
    • we say: “protein family”
    • we mean: “protein domainfamily”
example global alignment
Example: global alignment
  • Phthalate dioxygenase reductase (PDR_BURCE)
  • Toluene - 4 -monooxygenase electron transfer component (TMOF_PSEME)

Global alignment fails!

Only aligns largest domain.

sometimes even more complex
Sometimes even more complex!

PGBM_HUMAN:“Basement membrane-specific heparan sulphate proteoglycan core protein precursor”

980

1960

2940

3920

4391

45 domains of 9 different type, according to PFam

http://www.sanger.ac.uk/cgi-bin/Pfam/swisspfamget.pl?name=P98160

http://www.glycoforum.gr.jp/science/word/proteoglycan/PGA09E.html

properties of domains
Properties of domains
  • Most domains: size approx 75 – 200 residues
so you have a sequence
So, you have a sequence...
  • ...look it up in existing database
    • INTERPRO: http://www.ebi.ac.uk/interpro
  • ...search against existing family descriptions
    • PFAM: http://www.sanger.ac.uk/Software/Pfam
    • INTERPROSCAN: http://www.ebi.ac.uk/Tools/InterProScan/