Representative sets and Clustering.
1 / 25

Representative sets and Clustering. - PowerPoint PPT Presentation

  • Uploaded on

Representative sets and Clustering. Tom Oldfield. Representative sets. A subset of data that provides a statistically valid sample set for the complete data. A set structure fragments that best represent the “protein databank” or “protein space” during data analysis . What is the PDB ?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Representative sets and Clustering.' - halia

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Representative sets
Representative sets

A subset of data that provides a statistically valid sample set for the complete data.

A set structure fragments that best represent the “protein databank” or “protein space” during data analysis

What is the pdb
What is the PDB ?

  • The protein databank is a collection of experimental data.

    • Approx. 80 % from X-ray crystallography*

    • Approx. 20 % from NMR

    • Rest (!) are models, and other techniques

    • *Asymmetric units

Which really means
Which really means…

  • The structures deposited are almost exclusively the solution of “hypothesis driven data analysis”

    • What will make pharmaceutical companies money as target structures.

    • What research can be justified to obtain grant money from the research councils.

    • A “great” idea for a PhD project (we have crystallised/solubilised it)

Hypothetical proteins
Hypothetical proteins…

  • Structure genomics : The structure solution of all the ORF’s within a genome.

    • OK; the ones that we can : clone, express, purify, crystallise/solubilise….

  • So far a very small number.

Why representative sets
Why - representative sets

  • There are (will be) too many structures

  • Proteins just get solved many times

    • Comparative research

      Lysoyme was used in a systematic survey to study the structural effect of mutating each residue.

    • Competitive research

    • Get solved better as techniques improve

    • Degeneracy within protein fold space

Problem 1 experiment
Problem 1:experiment

  • The whole PDB is not a representative set

    It is a list of solutions to experiment

  • The NMR and X-ray data have their own statistical basis :

    difficult to use both data in some analysis.

  • If it does not (crystallize & > 25KD) then no structure

    Fold space is biased by experiment – no membrane proteins.

Problem 2a error
Problem 2a : error

  • All experiment results in error

  • Not all proteins are equal

    • The amount of data collected affects the accuracy (nearness to the truth) of a structure.

    • Crystallography and NMR do not allow direct deduction of a protein structure from the data

      80 % of the information for a X-ray structure is unknown (phase problem)

    • Not all parts of a structure solution are equal.

  • We need to select the best ones !

Problem 2b least error
Problem 2b : Least error ?

  • X-ray : best resolution / Free-R

  • NMR : minimal violation list.

  • Best geometry :

    • You must not define structure quality based on a target of experimental procedure. (!)

  • Date : ML is less biased than LSQ

  • Mutations : are not “natural products”

Another story !

Problem 3 evolution
Problem 3 : evolution

  • “Evolution” resolves a problem usually only once : the problem is a particular structure/function

    • Each protein is collection of bits of structure that work.

    • These structural bits are “domains” (one definition of a domain anyway).

    • Some proteins share domains, some proteins are many copies of the same/different domains.

  • A useful bit of structure will be found everywhere

Problem 4 statistics and lies
Problem 4 :Statistics and Lies

  • You should not classify objects using a parameter you wish to study.

    • Current representative sets are classified by fold.

    • You should not use them to study fold !

  • The Domain problem - there is no maths definition (or agreement) for this : fold classification is non-deterministic. (ie more than one !)

  • Proteins share fold fragments : Protein fold space is “non-transitive” if xRy & xRz does not imply yRz.

  • Discrete/bounded or continuous/unbounded – discuss

    Current Fold space is “not-closed”

Problem 5 species
Problem 5 : Species

  • The PDB is a collection of experiments on a convenient organism

  • Different species may have different biochemical pathways

    • Two similar structures (from different species) may have different function.

    • The best example structure may not have biochemical relevance.

Problem 6 to do what
Problem 6: To do what ?

  • Active sites, chemistry

    Should use all the structures with that site.

  • Overall Fold analysis

    Representatives selected by non-fold analysis

  • Local structure – depends

    • Fold base representatives sets

    • Non-fold based representatives

    • All known examples

  • Sequence

Representative sets1
Representative sets

  • MSD provides the SCOP and CATH representative sets. These are published accepted standards.

    • You can use these as the basis set for queries MSDLite, MSDpro

    • They do have limited use

  • Make you own ?

    MSDmine has the facility to define your own list


A group of similar things



  • Grouping by similarity

  • Sequence

    Moderately easy (direct solution) and well defined and fast. 1D

  • Structure

    Difficult (iterative & non-exhaustive, non-transitive data, multiple solutions, non-closed data)

  • Function

    Needs biological/chemical knowledge first

Why ?

  • We wish to show difference and similarity

    • Shows evolutionary changes

    • Areas that do not change : critical to function

    • Shows variance

  • To visualise information rather than present data.

    • Show different and similarity

    • Comparative analysis


  • The method of superposition depends on what we wish to observe

    • Structure : align by fold (difficult)

    • Sequence : align by sequence similarity (fast)

    • Function :

      • By environment residues (around ligand)

      • By active site residues (residues that do chemistry)

      • Atoms that do chemistry

      • By ligand (actually must be inhibitor !)

Msd clustering
MSD clustering

  • Structure & Sequence

    • MSDfold is a service that will provide structure superposition by fold.

    • Visualisation of hit list results from MSDpro are automatically superposed by structure and sequence.

    • Function : MSDsite provides alignment by site environment and ligand.

Msdfold clustering
MSDfold clustering

  • Pair-wise

    • To PDB / representative set

  • Multiple structure alignment


Clustering structure sequence


Clustering – structure/sequence

Known Alignments



On the fly sequence alignment

Hit list to align

List of groups



View list

Matrices to align structures

List of files to view


Clustering by function
Clustering – by function


  • MSDsite multi-view

    • Search by ligand/environment

    • View superposed

      • By ligand

      • By sequence pattern (PROsite)

      • environment

Clustering by occurrence
Clustering by occurrence

  • Data mining (ie discovery driven data analysis)

    • Unbiased by pre-conceived ideas

    • Finds unknown new clusters

    • True mean & SD

Clustering by occurrence1
Clustering : by occurrence

  • Data mined results using statistical analysis of protein local structure.

    • Returns common local features (2000)

    • Many associated with ligands

    • Loaded within DB – query system under development

    • True statistical distribution (centre + variance)

    • Found many new “features”

  • Local fold structure annotation

    • (James Milner White)


  • Representative sets

    • SCOP and CATH sets provided

    • Depends what you want to do

  • Clustering

    • All of our services have prevision for similarity searching and clustering

    • Forms the basis of comparative analysis