270 likes | 536 Views
Management of Computational Chemistry Electronic Structure Data in the U.S. David A. Dixon, Mingyang Chen, Amanda Stott, Shenggang Li Department of Chemistry, The University of Alabama, Tuscaloosa AL 35487-0336. Robert Ramsay Chair Fund. SEURAT Computational Chemistry (Commercial). Pros:
E N D
Management of Computational Chemistry Electronic Structure Data in the U.S. David A. Dixon, Mingyang Chen, Amanda Stott, Shenggang Li Department of Chemistry, The University of Alabama, Tuscaloosa AL 35487-0336 Robert Ramsay Chair Fund
SEURAT Computational Chemistry (Commercial) • Pros: • GUI (Visulization) • Multiple file formats support • User-defined Meta-data • Cons: • Not efficient when dealing with large amount of files (manually import file by file) • Commercial http://www.synapticscience.com/seurat/feature/
ChemDataBase (Not in U.S.) • Pros: • - GUI • Cons: • Parsers rely on CDK (Chemistry Development Kit) • No emphasis on database table building Huarong Sun, Ruisheng Zhang et al. 2008 International Multi-symposiums on Computer and Computational Sciences
NIST Computational Chemistry Comparison and Benchmark DataBase (CCCBDB) • Provide a benchmark set of molecules for the evaluation of ab initio computational methods. • Over 250,000 calculations at different levels. • Thermochemical values: • 1. Enthalpies of formation. • 2. Entropies, heat corrections. • 3. Supporting data, such as geometries, vibrational frequencies, etc. • 4. Additional computed properties such as atomic charges, electric dipole moments, HOMO-LUMO gaps, etc. http://cccbdb.nist.gov/
Carnegie-Mellon Quantum Chemistry Archive (CMQCA) • CMQCA is a collection of compressed results from GAUSSIAN 80 program; archive outputs adopting similar standards are generated in later versions of GAUSSIAN. • Accessed by a 300-baud terminal to the CMU Vax 11/780 through a telephone line. • 2000 Hartree-Fock structures with STO-3G, 3-21G and 6-31G* are listed in 1st version of CMQCA. • Self-reproductivity is required for the data archiving. • CMQCA format: input data followed by the numerical results(energies, dipoles, frequencies, etc.). • Final geometries are stored for a local minimum or saddle point serching. • Prototype for the meta-data format in RCCC
Ampac / Agui from Semichem • Ampac for fast semi-empirical calculations • Fast and reliable • Many methods: AM1, MNDO, MINDO3, PM3, MNDO/d, RM1, PM6, SAM1, MNDOC • Geometry optimization, frequencies, transition state, IRC, solvation, etc. • Agui for molecular visualization • Support most features of Gaussian 09 including periodic systems, ONIOM, etc. • Support many file formats including Mol, Mol2, SDF, PDB, CIF • Support many platforms: Windows, Linux, Mac OS X, etc. Manage Molecular Orbitals 3D Reaction Surface Plot Surface Adsorption
Extensible Computational Chemistry Environment EcceA domain encompassing problem solving environment for computational chemistry including a sophisticated graphical user interface, scientific visualization tools, and an underlying data management framework. The resulting environment enables research scientists to transparently utilize from their desktop workstations complex computational modeling software and high-performance computers. MS3 EMSL Molecular Science Software
Science Drivers: Science across Scales in Space & Time • Catalysis: Computational catalysis – transition metal oxides, homogeneous catalysts, metal clusters, site isolated catalysts • Nanoscience: TiO2 clusters for sensors and photocatalysts; Shape memory alloys (Nitinol) (NASA) • Energy: H2 storage in chemical systems – organic & inorganic • Energy: Advanced Fuel Cycle Initiative – Metal oxide clusters in solution for new fuels and environmental cleanup • Energy: New sources of energy (solar) • Geochemistry: Geological CO2 sequestration • The Environment: Atmosphere, Clean Water, Subsurface & Cleanup • Biochemistry: Peptide and amino acid negative ion chemistry • Computational main group chemistry – fluorine chemistry, acids and bases, other elements • Computational thermodynamics and kinetics – high accuracy, solvation effects. • Chemical End Station: RC3 & software development
Overview of UA Computing Resources Office Clients (Windows / Linux) Storage Server (Samba, NFS, WWW, SQL, etc.) Data backup Dell PowerEdge 2950 & PowerVault MD1000 Intel Xeon 8 cores Memory:16 GB Storage: 13 TB Data access (console/web) Data backup, parsing, & database building Network Traffic Gateway Chinook (EMSL, PNNL) 18,480 CPUs Colonel/Hope/Pople (UA) 348 CPUs UAHPC (UA) 260 CPUs DMC/Altix (ASC) 1,484 CPUs Home Clients Supercomputers
Computing Software Resources • Other computational chemistry programs • For quantum chemistry: ACES3, CFour, Columbus, Dalton, GAMESS, Molcas, MPQC, PSI3, etc. • For molecular dynamics: CPMD, Espresso, NAMD, Tinker, ZORI, etc. • Software for program development • Intel C/C++/Fortran compilers, MKL/IPP/TBB libraries; • PGI C/C++/Fortran compilers, ACML libraries
Build a Robust Computational Database • Machine-built database • Database grows as new computational data are generated • Expandable support for new computational output format • Go open source
Computational Chemistry Data Management System: RCCC = RC3 (Regional Computational Chemistry Collaboratory) • Manage and mine the vast amounts of chemistry-specific data that petascale computation will generate. • Perspectives of a user, a group, or a project. • Resides on the remote supercomputer and a local server computer at each registered site. • At supercomputer, data are automatically parsed by a program to extract essential information either during or after job execution. • Data packaged and stored with registration of relevant metadata in a database. • Local server automatically mirrors relevant data. • Database exposes a standard directory hierarchy and file system so that standard tools can be used in to manipulate data. • The main objective is to perform the day-to-day data backup, collect calculation meta-data, and organize them for research uses, so that users could have an easier way to access, present, and reuse their computational results.
Implementation Details • File Mirroring: synchronize new files from computing servers to storage server • account management • scheduled mirroring • SSH via Paramiko (Python module) • Data Extracting: collect meta-data from transferred files. • user, group, filename, location, software version, title, keywords, coordinates, frequencies, energies, dipole, and etc. • user can easily define and expand extracting rules • parsing while transferring • on demand scan • Database: • MySQL • Python Interface • Encryption: • Crypto (Python module) • Geometry visualization: • Jmol • User Interface: • Command Line (Bash shell) • Webpage (Jmol integrated) • Implementation: • Python 2.x
Progress and Status • RC3 Demo tested in our group for more than 1 years • 1 group, 36 users, 94 accounts • Most of our goals achieved • So far, 1.5 Terabytes (1.6 Million pieces of) files backed up and well organized in tree-structured directories • 144,000 data entries generated in the database • Currently supports NWChem, Molpro, Gaussian, and etc. • Parsing various properties
Account Manager Snapshot Enter menu[ [USER]:mchen10 ] now.. ================================================================== [ MENU TITLE ]: [USER]:mchen10 ------------------------------------------------------------------ [ 1 ] --- LIST ALL ACCOUNTS [ 3 ] --- ENTER ACCOUNT MENU [ 5 ] --- ADD AN ACCOUNT [ 6 ] --- DEL AN ACCOUNT [ 8 ] --- SWITCH TO ANOTHER USER [ 9 ] --- CHANGE THE PASSWORD FOR THE CURRENT USER ------------------------------------------------------------------ [ HELP ]: Choose an entry or: (q)quit; (m)print menu. ================================================================== [ [USER]:mchen10 ]Choose an entry(q to quit, m to print the menu) [ [USER]:mchen10 ]Your Input:
How to search for all the calculated files whose molecules contain C, H, N, O and Ru, print the calculation settings and specific properties (e.g. energy, Gibbs free energy correction, etc.), and sort the results by their energies? • mysql query • mysql > select Formula, Basis, CalcBy, Energy, Gibbs, Thermal, Enthalpy, • FinTime from RC3_MAIN where Formula like "C%H%N%O%Ru%" • and JobStatus like "CalcDone%" Order by Energy; • rc3query (a bash wrapper) • $ rc3query ‘C*H*N*O*Ru*’
Future Work • Publish as open source package • Improve parsing engines and write configuration for more computational codes • Web GUI • Better documentation
Benefits of Problem Solving Environments (PSEs) to Scientists • Integrates the key activities of scientific research, from problem definition, research design, experiment execution, and analysis • Allows scientists to efficiently execute their computational models over a distributed network • Integrates the scientist’s processes, data, and resources into a common working environment • Guides scientists in the research and experimentation process • Allows scientists to share their knowledge and expertise in their specific domains • Reduces barriers to collaboration among scientists who are geographically dispersed
Problem-solving Tools Record Management • Experimental Data Management • Collaborative Record (e.g., lab notebook) • Data & Processing History • Experiment Reproduction Research Support • Literature Searches • Data Repositories • Professional Forums • Research Design (e.g, problem definition, hypotheses, research approach) Workflow management • Interactive steering • Process automation • Decision support • Knowledge/expertise transfer • Experimental design • Collaboration models Computation management • Partitioning & assignments • Status and monitoring • Software history • Data validation
Architecture of a Collaborative PSE C O M B U S T I O N C L I M A T E M A N U F A C T U R I N G C H E M I S T R Y E N G I N E E R I N G Scientific Domains Problem Solving Computation Management Records Management Decision Support Work Flow Distributed OS Support Distributed Data Management Distributed Messaging Collaborative Technologies Resource Management Registry Service Security Model Execution Remote Access Computational Grid
Basis Set Tool Calculation Launcher Calculation Editor Calculation Manager Calculation Viewer 3-D Builder Data Model and Library Interface Molecular Dynamics Data Model Ecce Data Model Electronic Single, Experiment Structure Remote Chemistry Data Model Job and Job Data Management Model Multiple Computaional Tasks Data Model EMSL Data Model