How Much Computation is Enough? Curt M. Breneman*, N. Sukumar, Mike Krein, Matt Sundling,

How Much Computation is Enough? Curt M. Breneman*, N. Sukumar, Mike Krein, Matt Sundling, Jed Zaretzki, and Margaret McLellan August 18th, 2008

First, we need to define "acceptable" or “useful” modeling results… Those that come from the best theory and most computing possible? Those that yield “Experimental accuracy”? How about just those that permit good decisions to be made? Is more computing always better? What is the level of diminishing return? Are there any other downsides? Human effort vs Machine Effort? Human effort (software tricks, cleverness, expert domain knowledge) Machine effort (brute-force computation) Temptations of Big Hardware: Over-computation! Do we sometimes use more computer resources simply because they are there? What are the real costs of over-computation? Is it possible to get a "wrong“ answer at higher levels of theory, more sophisticated machine learning or longer simulation times? How Much is Enough?

What is the level of diminishing return?

A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms, Machine Learning, 40, 203-229 (2000) Is more computing really better?Not always…

Other Considerations… • Garbage In, Garbage Out • Accuracy vs. precision High accuracy Low precision High precision Low accuracy

Brute-force computation “A primitive computing style in which the scientist relies on the computer's processing power instead of using his/her own intelligence to simplify the problem, often ignoring problems of scale and applying naive methods suited to small problems directly to large ones.” -Anon Sometimes, unfortunately, there is no better general solution than brute force. "When in doubt, use brute force" ― Ken Thompson, co-inventor of Unix.

Computational Center for Nanotechnology Innovations (CCNI) @ RPI One of the world’s most powerful university-based supercomputing centers … $100 millionIBM Blue Gene supercomputer that will operate at more than 90 peak teraflops… TOP500 List - June 2007 http://www.top500.org/list/2007/06/100 Rank Site Computer Rmax Rpeak (TFlops) (TFlops) • DOE/NNSA/LLNL IBM BlueGene 280.60 367.00 • Oak Ridge National Lab Jaguar Cray 101.70 119.35 • Sandia National Lab Cray Red Storm 101.40 127.41 • IBM Thomas J. Watson IBM Blue Gene 91.29 114.69 • Stony Brook/BNL IBM Blue Gene 82.16 103.22 • DOE/NNSA/LLNL IBM ASC Purple 75.76 92.78 • RPI CCNI IBM Blue Gene 73.03 91.75 • NCSA Dell PowerEdge 62.68 89.59 • Barcelona IBM Cluster 62.63 94.21 • Leibniz Rechenzentrum SGI HLRB-II Altix 56.52 62.26

Over-computation:Do we sometimes use more computer resources simply because they are there? Sent to CCL : “Dear Friends, I did some FCI/cc-pvQZ calculations for H2+ system using Molpro program. I calculated the energy for various distances of H atoms with a step size of 0.1 atomic units. The Potential energy profile showed a minimum at around 2.074 atomic units. However, I couldn't optimize the geometry at this level to get the optimized structure and energy….” ******************************************************** FCI UHF-SCF -.60223115 -.60223115 ******************************************************** RESPONSE: “Dear …, H2+ only has 1 electron so doing a FCI is meaningless since there is no electron correlation in this system. A UHF or ROHF calculation is the exact solution with this basis set…”

Is it possible to get a "wrong“ answerat higher levels of theory? Sc Ti V Cr Mn Fe Co Ni Cu 4s2 3d1 4s2 3d2 4s2 3d3 4s1 3d5 4s2 3d5 4s2 3d6 4s2 3d7 4s2 3d8 4s2 3d9 Hartree–Fock Non-Relativistic Relativistic Sc 4s2 3d1 −759.73571776 −763.17110138 4s1 3d2 −759.66328045 −763.09426510 For Sc both non-relativistic and relativistic ab initio calculations correctly compute that the 4s2 configuration has the lowest energy in accordance with experimental data. Hartree–Fock Non-Relativistic Relativistic Cr 4s1 3d5 −1043.14175537 −1049.24406264 4s2 3d4 −1043.17611655 −1049.28622286 Both non-relativistic and relativistic Hartree–Fock calculations fail to predict the experimentally observed 4s1 3d5 ground state configuration. Hartree–Fock Non-Relativistic Relativistic Cu 4s13d10 −1638.96374169 −1652.66923668 4s2 3d9 −1638.95008061 −1652.67104670 For Cu a non-relativistic calculation gives the correct result (4s1 3d10), but including relativistic effects gives the wrong prediction. Relativistically, one predicts the opposite order of stabilities than observed experimentally! Eric Scerri “Just how ab initio is ab initio quantum chemistry?” Foundations of Chemistry 6(1) Jan. 2004

Another (Over?) Computing Example:Predicting Sites of CYP-3A4 metabolism. • When metabolized by CYP-3A4, specific regions of a molecule are oxidized selectively • Rate limiting step involves hydrogen atom abstraction. • The goal: develop a model to predict the group (composed of topologically equivalent hydrogens), which is most like to be oxidized by the 3A4 isozyme, and to predict relative rates of oxidation between molecules. Lidocaine (Groups of topologically equivalent hydrogens are designated by color)

Creating a successful model of 3A4 oxidation selectivity: How much computation is enough? • Size of dataset: 233 (50 train, 183 test) • Question: Will the model improve if a Boltzmann distribution of conformations is considered? • Not really…

Progress ReportNew model under development(constrained dynamics/descriptor based): • Method: For a given molecule: • For each potential site of metabolism, a 1.5 picosecond dynamics simulation with a forced contraint between the modification site and the active sight of the 3A4 isozyme is performed. • Computed descriptors are based upon the final protein/ligand complex. • Metabolism sites are ranked according to descriptors and a user-defined heuristic. • Size of dataset: 322 (results are heuristic, no training/testing)

Descriptors: The More is Not Always the Merrier • Classic QSAR Example: ACE inhibitors • Question: Do 3D descriptors contribute to model quality? • What are the benefits of expensive, ab-initio descriptors? • Case Study: • Uses PLS models with 2D, and 2D+PEST 3D descriptors • Models with 1 latent variable were chosen based on y-scrambled training data. • 56 molecule training set, 55 molecule test set, stratified test/training via activity

Descriptors: The More is Not Always the Merrier

Descriptors: The More is Not Always the Merrier MOE 2D Descriptors

Descriptors: The More is Not Always the Merrier MOE 2D Descriptors + PEST 3D DescriptorsForcefield geometries, HF/STO-3G wavefunction – 1,000x more expensive

Descriptors: The More is Not Always the Merrier MOE 2D Descriptors + PEST 3D DescriptorsAM1 geometries, HF/3-21G wavefunctions – 10,000x more expensive

Descriptors: The More is Not Always the Merrier MOE 2D Descriptors + PEST 3D DescriptorsHF/3-21G geometries, HF/6-31+G* wavefunctions – 100,000x more expensive

Descriptors: The More is Not Always the Merrier What Happened? • Models were all of roughly of the same quality… • Despite 1,000x – 100,000x increase in computation • Reasons? • Optimization to a (local) minimum energy conformation • Not necessarily the biologically active conformation • Increases in the basis set often highlight absolute changes to properties, but relative differences often remain similar • Sometimes descriptor families just don’t correlate with the response. • Domain Knowledge should guide descriptor choices – more expensive doesn’t always equal better.

Counter Example: Olfactory prediction Where 3D descriptors are king… non-musk musk musk non-musk

GA/PCA Musk Classification Results with TAE descriptors (Fast) (C. Davidson and B. Lavine)7 selected features • 1—Nonmusk • 2—Musk 2 1

3D PC Plot Dim(9) 3 • 1—Nonmusk • 2—Musk 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 2 2 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 PC2 2 1 1 1 1 0 1 1 2 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 1 1 2 2 1 2 2 2 1 1 2 1 1 2 2 1 1 1 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 -1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 -2 2 2 -3 -3 -2 -1 0 1 2 3 4 5 PC1 Aromatic Musk classification with TAE and 3D PEST Descriptors (slower, but much better) (C. Davidson and B. Lavine) 2 1

Musk Discrimination using both TAE and 3D PEST Descriptors (C. Davidson and B. Lavine) • 1Macro Non-Musk • 2Macro Musk • 1 Nitro Non-Musk • 2Nitro Musk

So, sometimes more computing is better, and sometimes not. What do we do? • Define best-practice methods for determining the value of particular methods for specific modeling problems. • Case Study: QSPR • Do I have informative descriptors for this problem? • Do I need to use more expensive descriptors or not?

Example - Descriptor Assessment: Dataset Truncation Procedure Training Set Training Set Training Set Training Set (70%) Training Set Training Set Dataset 90% Training Set Training Set Training Set Subset Training Set Testing Set (30%) Testing Set (30%) Testing Set (30%)

Dataset Truncation testing: Do 3D Descriptors improve model robustness – are they worthwhile?

What are the real costs of over-computation? • In Douglas Adams' The Hitchhiker's Guide to the Galaxy, a "simple answer" to The Ultimate Question is requested from the computer Deep Thought, specially built for this purpose. • After 7½ million years of computing, Deep Thought's answer is 42. • "Forty two?!" yelled Loonquawl. "Is that all you've got to show for seven and a half million years' work?" • "I checked it very thoroughly," said the computer, "and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you've never actually known what the question is."

The RECCR Community http://reccr.chem.rpi.edu

Thank you! RECCR is funded under the Molecular Libraries Roadmap Initiative of NIH (# 1P20HG003899-01 of 09-23-2005) http://reccr.chem.rpi.edu

ACKNOWLEDGMENTS • Current and Former members of the DDASSL group • Breneman Research Group (RPI Chemistry) • N. Sukumar • M. Sundling • Min Li • Long Han • Jed Zaretski • Theresa Hepburn • Mike Krein • Steve Mulick • Shiina Akasaka • Hongmei Zhang • C. Whitehead (Pfizer Global Research) • L. Shen (BNPI) • L. Lockwood (Syracuse Research Corporation) • M. Song (Synta Pharmaceuticals) • D. Zhuang (Simulations Plus) • W. Katt (Yale University chemistry graduate program) • Q. Luo (J & J) • Embrechts Research Group (RPI DSES) • Tropsha Research Group (UNC Chapel Hill) • Bennett Research Group (RPI Mathematics) • Collaborators: • Tropsha Group (UNC Chapel Hill - CECCR) • Cramer Research Group (RPI Chemical Engineering) • Funding • NIH (GM047372-07) • NIH (1P20HG003899-01) • NSF (BES-0214183, BES-0079436, IIS-9979860) • GE Corporate R&D Center • Millennium Pharmaceuticals • Concurrent Pharmaceuticals • Pfizer Pharmaceuticals • ICAGEN Pharmaceuticals • Eastman Kodak Company • Chemical Computing Group (CCG)

Reserve Slides

What are the real costs of over-computation?

Do 3D Descriptors improve model robustness – are they worthwhile?

Example - Descriptor Assessment: Dataset Truncation Procedure Training Set (70%) Training Set Dataset 90% Training Set Subset Testing Set (30%) Testing Set (30%) Testing Set (30%) PLS models of five components were used throughout the study.

How Much Computation is Enough? Curt M. Breneman*, N. Sukumar, Mike Krein, Matt Sundling,