Addressing Protein Crystallization Bottlenecks by Screening Multiple Homologs

Addressing Protein Crystallization Bottlenecks by Screening Multiple Homologs Lukasz Jaroszewski, Lukasz Slabinski, John Wooley, Ian. A. Wilson,Ashley M. Deacon, Scott. A. Lesley, and Adam Godzik The Protein Structure Initiative "Bottlenecks" Workshop, April 14-16 Bethesda 2008

TargetDB database provides the first large and diverse learning sets for studying protein “production” and crystallization Protein “production” learning set (from cloning to purified protein) Positive subset (successes): 12,850 targets listed as purified in TargetDB by PSI centers Negative subset (failures): 13,587 targets: all stopped targets that were listed as cloned, but not purified all targets that were cloned, but not purified, and did not show any further progress after 18 months Protein crystallization learning set Positive subset (successes): 3,140 protein structures solved by X-ray crystallography by PSI centers Negative subset (failures): 5,819 targets: all stopped targets listed as purified, but not crystallized, and not assigned to NMR + all targets that were purified and did not show any progress for more than 18 months.

The ability to crystallize is correlated only for very close homologs…

…while difficulties with crystallization are correlated for more distantly related proteins

Probability distributions of protein “production”(stages from cloning to purified protein)

Probability distributions of protein crystallization

It is possible to combine individual probabilities into one estimate of crystallization probability (“crystallization score”) We used a method called a logarithmic opinion pool known in financial risk analysis. The probability of protein crystallization is estimated by the product of individual probabilities. k – normalizing constant pi – individual probability distributions, such as: Plength, PpI, PGRAVY, etc. n – number of individual probability distributions wi – weights of a individual probability distributions(we used all weights equal 1/n since the size of learning sets did not allow optimization of individual weights).

Jack-knife tests confirm that the crystallization score has predictive power. Sc – learning set rank-ordered by crystallization score derived from the same set SCs – crystallization score derived from the data from four large PSI centers (JCSG, MCSG, NESG, and NYSGXRC) and used to rank-order targets from all other centers (BSGC, BCGI, CESG, ISFI, OPPF, S2F, SECSG, SGPP, SPINE-EU, YSG, TB, and RSGI). Scl – opposite to SCs Scb – crystallization score used to rank-order targets deposited in TargetDB after crystallization score was derived Since sets of targets used in tests have different average success rates (from 33% to 41%), the normalized plot is shown in the inset.

Crystallization score can be used to split targets from TargetDB into classes with different success rates Protein crystallization

Each completely sequenced genomes brings more suitable targets for about 7 protein families Pfam families without structures All Pfam families

Broad distribution of crystallization classes in protein families allows promising targets to be found in many “difficult” families. The number of structures solved from a family is correlated with the number of “crystallizable” targets from that family.

Assessing the bias introduced in representative structures from protein families by crystallizability • Distributions of protein features calculated for: • Sequences of microbial members of Pfam families without any solved structures (red) • Sequences of microbial members of Pfam families with at least one solved structure(green) • Sequences of solved members of Pfam families (blue) • Actual sequence constructs of solved members of Pfam families (black).

The distribution of crystallizability classes in microbial genomes is more even than in protein families.

JCSG crystallization score distribution also confirms high complementarity of X-ray and NMR

Requirements for optimal NMR targets are different than requirements for optimal X-ray targets Specific sequence features which increase the cost of solving protein structure by NMR All solved structures vs. NMR-solved proteins (from PDB)

As expected solving structures from “very difficult” families more often requires nontrivial construct design or use of NMR Structures from very difficult group of families Structures from optimal group of families

Target selection and optimization server XtalPred

Summary Crystallization screening of multiple homologous sequences is justified by the observation that probability of crystallization is correlated only for very close homologs. Based on the statistics derived from TargetDB database, it is now possible to estimate crystallization probability and separate proteins into “crystallizability classes” with different crystallization success rates. Most protein families contain proteins from different crystallizability classes. Continuous growth of available sequence data helps in crystallization efforts by providing promising targets from “difficult” protein families.

PSI success rate per protein family is several times higher than success rate per protein • 26-months after 1-st PSI target draft: • 1369 Pfam families initially identified: 352 are now solved(203 by 4 large PSI centers) • JCSG prioritized 742 families in 3 categories, 268 solved (36%) • 1-250 optimal 125 solved (50%) • 251-500 suboptimal 75 solved (30%) • 500-742 difficult 68 solved (27%) • 742-1369 very difficult 84 solved (13%)

Impact of the PSI on structural coverage of protein families annotated in Pfam database Since start of PSI1 solved structures:2894 solved Pfam families:611 (25% of the World) solved large Pfam families:216 (20% of the World) Since start of PSI2 solved structures:1521 solved Pfam families:312 (43% of the World) solved large Pfam families:76 (35% of the World) (rapid growth of PSI contribution in 2007 is partly effect of slow release of non-PSI structures)

Scientific Advisory Board Sir Tom Blundell Univ. Cambridge Homme Hellinga Duke University Medical Center James Naismith The Scottish Structural Proteomics facility Univ. St. Andrews James Paulson Consortium for Functional Glycomics, The Scripps Research Institute Robert Stroud Center for Structure of Membrane Proteins, Membrane Protein Expression Center, UCSF Soichi Wakatsuki Photon Factory, KEK, Japan James Wells UC San Francisco Todd Yeates UCLA-DOE, Inst. for Genomics and Proteomics UCSD & Burnham Bioinformatics Core John Wooley Adam Godzik Lukasz Jaroszewski Slawomir Grzechnik Sri Krishna Subramanian Andrew Morse Tamara Astakhova Lian Duan Piotr Kozbial Dana Weekes Natasha Sefcovic Prasad Burra Josie Alaoen Cindy Cook GNF & TSRI Crystallomics Core Scott Lesley Mark Knuth Heath Klock Dennis Carlton Thomas Clayton Kevin D. Murphy Christina Trout Marc Deller Daniel McMullan Polat Abdubek Claire Acosta Linda M. Columbus Julie Feuerhelm Joanna C. Hale Thamara Janaratne Hope Johnson Edward Nigoghossian Linda Okach Sebastian Sudek Aprilfawn White Ylva Elias Glen Spraggon Bernhard Geierstanger Sanjay Agarwalla Charlene Cho Bi-Ying Yeh Anna Grzechnik Jessica Canseco Mimmi Brown Stanford /SSRL Structure Determination Core Keith Hodgson Ashley Deacon Mitchell Miller Herbert Axelrod Hsiu-Ju (Jessica) Chiu Kevin Jin Christopher Rife Qingping Xu Silvya Oommachen Henry van den Bedem Scott Talafuse Ronald Reyes Abhinav Kumar Christine Trame Debanu Das TSRI NMR Core Kurt Wüthrich Reto Horst Maggie Johnson Amaranth Chatterjee Michael Geralt Wojtek Augustyniak Pedro Serrano Bill Pedrini William Placzek Ex officio founding members Raymond Stevens , TSRI Susan Taylor, UCSD Peter Kuhn, SSRL/TSRI Duncan McRee, TSRI/Syrrx TSRI Administrative Core Ian Wilson Marc Elsliger Gye Won Han David Marciano Henry Tien Lisa van Veen The JCSG is supported by the NIH Protein Structure Initiative (PSI) Grant U54 GM074898 from NIGMS (www.nigms.nih.gov).

JCSG Annual Meeting 2007

Addressing Protein Crystallization Bottlenecks by Screening Multiple Homologs