Computational Approaches for Identifying Functional Signatures in Protein Structures

Identifying Functional signatures in Proteins - a computational design approach David Bernick Rohl group16-Mar-2005

The big picture • what is function? • hinges • substrate/DNA/protein binding/alignment/recognition • catalytic sites • what isn’t function ? (structure) • secondary structures, • fold architecture • thermodynamically required elements • nature selects for function (structure is implicit) • computational methods select for structure • can we predict…quickly ?

Some terms • pssm - position specific score matrix • a [20 x length] model of residue frequencies for every position of sequence family • homolog - natural sequences evolved from a common parent • morpholog - computationally derived sequence generated from a parent structure • ortholog - common ancestor, derived by speciation (constrained functional divergence) • paralog - common ancestor, same species (unconstrained functional divergence)

pssm from an alignment

structure ensembles • Larson (2003) - Improved homology searches • Pei(2003) - Homology detection and active site searches • Kuhlman(2000) - Structural optimality of Natural sequences

Results - SH3 domain 11 Structures 62 additional sequences

Results - S100 domain Ca++ loop1 not detected backbone coordinated residues Ca++ loop2 not detected insufficient homolog depth 11 structures 30 additional sequences

the protocol Sequence CE+SCOPTaylorDomsFlexible Design cogs, pfam, reverse blast blast representative structure homolog Alignment paralog structures fixeddesign score pssmH pssmM statistical geometric

genome scale • high cost step - producing pssmM • precalculate pssmM for every domain

morpholog pssmsgenome scale • Data Sources • Taylor parsed Domain database • CE all-to-all + SCOP • Precompute pssms for every domain • ~8000 domains • 100 sequences ~90% diversity1000 sequences ~99% diversity • ~4-8 wks, 70p cluster for initial set

scoring • compare PSSMh to PSSMm • PSSMm contains only structure signal • PSSMh contains both function and structure • each position represents a count-normalized position in 20-space (H or M) • R-position -- average aa position • RH and RM define 20 space vectors • ‘function vector’ • ‘structure vector’

next steps • complete this set of domains - verification • full domain pssmM generation

acknowledgements • Carol Rohl • Kevin Karplus • Craig Lowe • Rohl group • HP

Computational Approaches for Identifying Functional Signatures in Protein Structures