slide1
Download
Skip this Video
Download Presentation
Representative sets and Clustering.

Loading in 2 Seconds...

play fullscreen
1 / 25

Representative sets and Clustering. - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

Representative sets and Clustering. Tom Oldfield. Representative sets. A subset of data that provides a statistically valid sample set for the complete data. A set structure fragments that best represent the “protein databank” or “protein space” during data analysis . What is the PDB ?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Representative sets and Clustering.' - halia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
representative sets
Representative sets

A subset of data that provides a statistically valid sample set for the complete data.

A set structure fragments that best represent the “protein databank” or “protein space” during data analysis

what is the pdb
What is the PDB ?
  • The protein databank is a collection of experimental data.
    • Approx. 80 % from X-ray crystallography*
    • Approx. 20 % from NMR
    • Rest (!) are models, and other techniques
    • *Asymmetric units
which really means
Which really means…
  • The structures deposited are almost exclusively the solution of “hypothesis driven data analysis”
    • What will make pharmaceutical companies money as target structures.
    • What research can be justified to obtain grant money from the research councils.
    • A “great” idea for a PhD project (we have crystallised/solubilised it)
hypothetical proteins
Hypothetical proteins…
  • Structure genomics : The structure solution of all the ORF’s within a genome.
    • OK; the ones that we can : clone, express, purify, crystallise/solubilise….
  • So far a very small number.
why representative sets
Why - representative sets
  • There are (will be) too many structures
  • Proteins just get solved many times
    • Comparative research

Lysoyme was used in a systematic survey to study the structural effect of mutating each residue.

    • Competitive research
    • Get solved better as techniques improve
    • Degeneracy within protein fold space
problem 1 experiment
Problem 1:experiment
  • The whole PDB is not a representative set

It is a list of solutions to experiment

  • The NMR and X-ray data have their own statistical basis :

difficult to use both data in some analysis.

  • If it does not (crystallize & > 25KD) then no structure

Fold space is biased by experiment – no membrane proteins.

problem 2a error
Problem 2a : error
  • All experiment results in error
  • Not all proteins are equal
    • The amount of data collected affects the accuracy (nearness to the truth) of a structure.
    • Crystallography and NMR do not allow direct deduction of a protein structure from the data

80 % of the information for a X-ray structure is unknown (phase problem)

    • Not all parts of a structure solution are equal.
  • We need to select the best ones !
problem 2b least error
Problem 2b : Least error ?
  • X-ray : best resolution / Free-R
  • NMR : minimal violation list.
  • Best geometry :
    • You must not define structure quality based on a target of experimental procedure. (!)
  • Date : ML is less biased than LSQ
  • Mutations : are not “natural products”

Another story !

problem 3 evolution
Problem 3 : evolution
  • “Evolution” resolves a problem usually only once : the problem is a particular structure/function
    • Each protein is collection of bits of structure that work.
    • These structural bits are “domains” (one definition of a domain anyway).
    • Some proteins share domains, some proteins are many copies of the same/different domains.
  • A useful bit of structure will be found everywhere
problem 4 statistics and lies
Problem 4 :Statistics and Lies
  • You should not classify objects using a parameter you wish to study.
    • Current representative sets are classified by fold.
    • You should not use them to study fold !
  • The Domain problem - there is no maths definition (or agreement) for this : fold classification is non-deterministic. (ie more than one !)
  • Proteins share fold fragments : Protein fold space is “non-transitive” if xRy & xRz does not imply yRz.
  • Discrete/bounded or continuous/unbounded – discuss

Current Fold space is “not-closed”

problem 5 species
Problem 5 : Species
  • The PDB is a collection of experiments on a convenient organism
  • Different species may have different biochemical pathways
    • Two similar structures (from different species) may have different function.
    • The best example structure may not have biochemical relevance.
problem 6 to do what
Problem 6: To do what ?
  • Active sites, chemistry

Should use all the structures with that site.

  • Overall Fold analysis

Representatives selected by non-fold analysis

  • Local structure – depends
    • Fold base representatives sets
    • Non-fold based representatives
    • All known examples
  • Sequence
representative sets1
Representative sets
  • MSD provides the SCOP and CATH representative sets. These are published accepted standards.
    • You can use these as the basis set for queries MSDLite, MSDpro
    • They do have limited use
  • Make you own ?

MSDmine has the facility to define your own list

clustering
Clustering

A group of similar things

structure/sequence/function

clustering1
Clustering
  • Grouping by similarity
  • Sequence

Moderately easy (direct solution) and well defined and fast. 1D

  • Structure

Difficult (iterative & non-exhaustive, non-transitive data, multiple solutions, non-closed data)

  • Function

Needs biological/chemical knowledge first

slide17
Why ?
  • We wish to show difference and similarity
    • Shows evolutionary changes
    • Areas that do not change : critical to function
    • Shows variance
  • To visualise information rather than present data.
    • Show different and similarity
    • Comparative analysis
slide18
How
  • The method of superposition depends on what we wish to observe
    • Structure : align by fold (difficult)
    • Sequence : align by sequence similarity (fast)
    • Function :
      • By environment residues (around ligand)
      • By active site residues (residues that do chemistry)
      • Atoms that do chemistry
      • By ligand (actually must be inhibitor !)
msd clustering
MSD clustering
  • Structure & Sequence
    • MSDfold is a service that will provide structure superposition by fold.
    • Visualisation of hit list results from MSDpro are automatically superposed by structure and sequence.
    • Function : MSDsite provides alignment by site environment and ligand.
msdfold clustering
MSDfold clustering
  • Pair-wise
    • To PDB / representative set
  • Multiple structure alignment

EG

clustering structure sequence

DB

Clustering – structure/sequence

Known Alignments

Grouping

EG

On the fly sequence alignment

Hit list to align

List of groups

FastA

Server

View list

Matrices to align structures

List of files to view

Client

clustering by function
Clustering – by function

EG

  • MSDsite multi-view
    • Search by ligand/environment
    • View superposed
      • By ligand
      • By sequence pattern (PROsite)
      • environment
clustering by occurrence
Clustering by occurrence
  • Data mining (ie discovery driven data analysis)
    • Unbiased by pre-conceived ideas
    • Finds unknown new clusters
    • True mean & SD
clustering by occurrence1
Clustering : by occurrence
  • Data mined results using statistical analysis of protein local structure.
    • Returns common local features (2000)
    • Many associated with ligands
    • Loaded within DB – query system under development
    • True statistical distribution (centre + variance)
    • Found many new “features”
  • Local fold structure annotation
    • (James Milner White)
summary
Summary
  • Representative sets
    • SCOP and CATH sets provided
    • Depends what you want to do
  • Clustering
    • All of our services have prevision for similarity searching and clustering
    • Forms the basis of comparative analysis
ad