slide1 n.
Skip this Video
Download Presentation
Representative sets and Clustering.

Loading in 2 Seconds...

play fullscreen
1 / 25

Representative sets and Clustering. - PowerPoint PPT Presentation

  • Uploaded on

Representative sets and Clustering. Tom Oldfield. Representative sets. A subset of data that provides a statistically valid sample set for the complete data. A set structure fragments that best represent the “protein databank” or “protein space” during data analysis . What is the PDB ?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Representative sets and Clustering.' - halia

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
representative sets
Representative sets

A subset of data that provides a statistically valid sample set for the complete data.

A set structure fragments that best represent the “protein databank” or “protein space” during data analysis

what is the pdb
What is the PDB ?
  • The protein databank is a collection of experimental data.
    • Approx. 80 % from X-ray crystallography*
    • Approx. 20 % from NMR
    • Rest (!) are models, and other techniques
    • *Asymmetric units
which really means
Which really means…
  • The structures deposited are almost exclusively the solution of “hypothesis driven data analysis”
    • What will make pharmaceutical companies money as target structures.
    • What research can be justified to obtain grant money from the research councils.
    • A “great” idea for a PhD project (we have crystallised/solubilised it)
hypothetical proteins
Hypothetical proteins…
  • Structure genomics : The structure solution of all the ORF’s within a genome.
    • OK; the ones that we can : clone, express, purify, crystallise/solubilise….
  • So far a very small number.
why representative sets
Why - representative sets
  • There are (will be) too many structures
  • Proteins just get solved many times
    • Comparative research

Lysoyme was used in a systematic survey to study the structural effect of mutating each residue.

    • Competitive research
    • Get solved better as techniques improve
    • Degeneracy within protein fold space
problem 1 experiment
Problem 1:experiment
  • The whole PDB is not a representative set

It is a list of solutions to experiment

  • The NMR and X-ray data have their own statistical basis :

difficult to use both data in some analysis.

  • If it does not (crystallize & > 25KD) then no structure

Fold space is biased by experiment – no membrane proteins.

problem 2a error
Problem 2a : error
  • All experiment results in error
  • Not all proteins are equal
    • The amount of data collected affects the accuracy (nearness to the truth) of a structure.
    • Crystallography and NMR do not allow direct deduction of a protein structure from the data

80 % of the information for a X-ray structure is unknown (phase problem)

    • Not all parts of a structure solution are equal.
  • We need to select the best ones !
problem 2b least error
Problem 2b : Least error ?
  • X-ray : best resolution / Free-R
  • NMR : minimal violation list.
  • Best geometry :
    • You must not define structure quality based on a target of experimental procedure. (!)
  • Date : ML is less biased than LSQ
  • Mutations : are not “natural products”

Another story !

problem 3 evolution
Problem 3 : evolution
  • “Evolution” resolves a problem usually only once : the problem is a particular structure/function
    • Each protein is collection of bits of structure that work.
    • These structural bits are “domains” (one definition of a domain anyway).
    • Some proteins share domains, some proteins are many copies of the same/different domains.
  • A useful bit of structure will be found everywhere
problem 4 statistics and lies
Problem 4 :Statistics and Lies
  • You should not classify objects using a parameter you wish to study.
    • Current representative sets are classified by fold.
    • You should not use them to study fold !
  • The Domain problem - there is no maths definition (or agreement) for this : fold classification is non-deterministic. (ie more than one !)
  • Proteins share fold fragments : Protein fold space is “non-transitive” if xRy & xRz does not imply yRz.
  • Discrete/bounded or continuous/unbounded – discuss

Current Fold space is “not-closed”

problem 5 species
Problem 5 : Species
  • The PDB is a collection of experiments on a convenient organism
  • Different species may have different biochemical pathways
    • Two similar structures (from different species) may have different function.
    • The best example structure may not have biochemical relevance.
problem 6 to do what
Problem 6: To do what ?
  • Active sites, chemistry

Should use all the structures with that site.

  • Overall Fold analysis

Representatives selected by non-fold analysis

  • Local structure – depends
    • Fold base representatives sets
    • Non-fold based representatives
    • All known examples
  • Sequence
representative sets1
Representative sets
  • MSD provides the SCOP and CATH representative sets. These are published accepted standards.
    • You can use these as the basis set for queries MSDLite, MSDpro
    • They do have limited use
  • Make you own ?

MSDmine has the facility to define your own list


A group of similar things


  • Grouping by similarity
  • Sequence

Moderately easy (direct solution) and well defined and fast. 1D

  • Structure

Difficult (iterative & non-exhaustive, non-transitive data, multiple solutions, non-closed data)

  • Function

Needs biological/chemical knowledge first

Why ?
  • We wish to show difference and similarity
    • Shows evolutionary changes
    • Areas that do not change : critical to function
    • Shows variance
  • To visualise information rather than present data.
    • Show different and similarity
    • Comparative analysis
  • The method of superposition depends on what we wish to observe
    • Structure : align by fold (difficult)
    • Sequence : align by sequence similarity (fast)
    • Function :
      • By environment residues (around ligand)
      • By active site residues (residues that do chemistry)
      • Atoms that do chemistry
      • By ligand (actually must be inhibitor !)
msd clustering
MSD clustering
  • Structure & Sequence
    • MSDfold is a service that will provide structure superposition by fold.
    • Visualisation of hit list results from MSDpro are automatically superposed by structure and sequence.
    • Function : MSDsite provides alignment by site environment and ligand.
msdfold clustering
MSDfold clustering
  • Pair-wise
    • To PDB / representative set
  • Multiple structure alignment


clustering structure sequence


Clustering – structure/sequence

Known Alignments



On the fly sequence alignment

Hit list to align

List of groups



View list

Matrices to align structures

List of files to view


clustering by function
Clustering – by function


  • MSDsite multi-view
    • Search by ligand/environment
    • View superposed
      • By ligand
      • By sequence pattern (PROsite)
      • environment
clustering by occurrence
Clustering by occurrence
  • Data mining (ie discovery driven data analysis)
    • Unbiased by pre-conceived ideas
    • Finds unknown new clusters
    • True mean & SD
clustering by occurrence1
Clustering : by occurrence
  • Data mined results using statistical analysis of protein local structure.
    • Returns common local features (2000)
    • Many associated with ligands
    • Loaded within DB – query system under development
    • True statistical distribution (centre + variance)
    • Found many new “features”
  • Local fold structure annotation
    • (James Milner White)
  • Representative sets
    • SCOP and CATH sets provided
    • Depends what you want to do
  • Clustering
    • All of our services have prevision for similarity searching and clustering
    • Forms the basis of comparative analysis