1 / 22

Toward a Common Data and Command Representation for Quantum Chemistry

Toward a Common Data and Command Representation for Quantum Chemistry. Outline of e-CCP1 Project. Investigate the technological requirements for enabling effective use of Grid resources by the quantum chemistry community Middleware (Globus, Unicore, EGEE) Compute resources

carys
Download Presentation

Toward a Common Data and Command Representation for Quantum Chemistry

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Toward a Common Data and Command Representation for Quantum Chemistry

  2. Outline of e-CCP1 Project • Investigate the technological requirements for enabling effective use of Grid resources by the quantum chemistry community • Middleware (Globus, Unicore, EGEE) • Compute resources • Client tools (CoG kits) • Track and develop, where necessary, the emerging standards in computational chemistry data and command representation (XML-based CML, CMLComp, FSAtom). • Realise these requirements by developing some core tools that can be deployed and customised by CCP1 code developers. • Develop GUI interfaces that will operate with a range of CCP1 codes and implement Grid functionality.

  3. Motivation • Motivation: • The emergence of Grid technologies has provided a generalised framework for the interoperability of computational codes. • A common data and command representation: • Promotes appropriate data re-use • Makes data available to a wider community • There are many existing ways to represent data – why not just convert between them (e.g. Open Babel) • Error prone • If there are n formats, n(n-1) converters are required. A solution is to find a common ‘middle ground’ for data (2n)

  4. Data Types • What data could we represent? • Data/parameters • Structures • Scalar properties • Molecular orbitals • Normal modes of vibration • Dynamics • Basis sets • Force fields • Pseudo-potentials • Control • Energy convergence criteria • SCF steps • Mixing parameters • Mesh properties… • Some data can be shared amongst codes, others will be code specific – semantics is important • Some of the data will be meta-data (e.g. code used, version, method…) • Some of the data will define relationships between other data.

  5. Data Representation • What are the existing ways of representing data? • Formats like CIF • Relational databases • XML (e.g. CML) • Objects, methods, data members (intermediate step) • But, how do we implement the data models (how do we define our vocabulary)? • SQL • XML schema • Class interfaces (e.g. W3C IDL based DOM recommendations) • UML

  6. Semantics and Ontology Semantics • Providing the meaning of vocabulary is important. • We want to ensure appropriate re-use of data. • Semantics can be controlled by: • Annotating the data model (e.g. in XML schema <xsd:annotation>) • Links to external sources (e.g. XML dictionaries in CML) Ontology • An ontology can be thought of as ‘an explicit specification of concepts and the relationships between them.’ • Relationships between concepts can be expressed using the Resource Description Framework (RDF). RDF is the basis of ontology languages such as OWL and DAML+OIL. • RDF schema specify the relationships used by the RDF and the relationships between relationships… • An ontology helps to reduce implicit assumptions about data and their relationships.

  7. XML Representation • XML is a strongly adopted and mature method of representing structured information • A vast and increasing range of tools makes XML easily readable and interpretable by applications authored by different groups • At the expense of conciseness: • XML is self describing – it carries meta-data • XML can be explicit about data • Some methods of representing data in XML already exist (e.g. CML), for which there are many tools

  8. An Example A geometry representation for the CH molecule A basis set representation for the CH molecule

  9. Relationships • How do we link the basis sets and geometries? • Could rely on implicit linking (<atom elementType=“C”…> with <basisSet id=“C1”…> • But what happens if we want to change the rules? • Could use attributes (<atom id=“a1” basis=“C1”…>) • But Documents could come from different sources, and don’t know about each others attributes • Continual revision of the data model • Could describe the relationship using RDF, or in an RDF-like manner • RDF/n3: • @prefix r1: <file://chGeom.xml#xpointer> . • @prefix r2: <file://chBasis.xml#xpointer> . • @prefix r3: <file://eccpRelations.html#> . • <r2:(//basisSet[@id=“C1"])> <r3:isBasisFor> <r1:(//atom[@elementType="C"])> . • <r2:(//basisSet[@id=“H1"])> <r3:isBasisFor> <r1:(//atom[@elementType="H"])> .

  10. Relationships • But… • Passing text is not straight forward – can serialise RDF/n3 to RDF/XML • RDF/XML (converted using CWM) • <rdf:RDF xmlns:r1="file://chGeom.xml#xpointer" • xmlns:r2="file://chBasis.xml#xpointer" • xmlns:r3="file://chRelations.html#" • xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> • <rdf:Description rdf:about="r2:(//basisSet[@id=&#34;C1&#34;])"> • <isBasisFor xmlns="r3:" • rdf:resource="r1:(//atom[@elementType=&#34;C&#34;])"/> • </rdf:Description> • <rdf:Description rdf:about="r2:(//basisSet[@id=&#34;H1&#34;])"> • <isBasisFor xmlns="r3:" • rdf:resource="r1:(//atom[@elementType=&#34;H&#34;])"/> • </rdf:Description> • </rdf:RDF>

  11. Other Design Considerations • How implicit/explicit should we be • When should we use ‘general’ or ‘grouping’ tags for data? • E.g. to take a CML-like example: Exchange energy = -1.025771783 or <eexchange>-1.025771783</eexchange> or <scalar dictRef=“eccp:eexchange”>-1.025771783</scalar> • To what extent should we tag data • E.g. <basisExponents>0.0 0.0 0.0 0.0</basisExponents> or <basisExponents> <n>0.0</n> <n>0.0</n> <n>0.0></n> <n>0.0</n> </basisExponents>

  12. Using XML Reading: Convert to standard application input (e.g. use XSLT) Read in the XML directly by using existing/writing your own native/foreign code Writing: Parse the standard application output and convert to XML Write the XML directly by using existing/writing your own native/foreign code doc1.xml doc3.xml doc2.xml XML Sax Dom Parser in.txt Native/foreign libraries Application out.txt Meta-data Parser out.xml

  13. Using XML - Comments • Comments: • Careful choice of DOM or SAX parser implementation • DOM – potentially large overheads when used with large model instances • SAX – difficult to code when data is heavily cross referenced • Until recently, XML support for FOTRAN has been poor. No existing native parsers and it’s difficult to write your own • Solutions • native FORTRAN XML modules (Alberto Garcia) • FORTRAN DOM (Jon Wakelin) • XML libraries such as libXML and Xerces could be used with appropriate wrappers • There are mixed SAX and DOM API implementations, e.g libXML xmlTextReader • Parsing standard output is a good option for proprietary code, but suffers from versioning • Writing formatted data directly is error prone • FORTRAN WXML module (Alberto Garcia) • FORTRAN CML writer (Jon Wakelin)

  14. Automation • Data models evolve with time. It is hard work to maintain code by hand. Ideally… • CML - Java and C++ API generators • CCPN – Python API generators • Still have to worry about mapping the wrapper data structures to the internal data structures of the application. API generator schema ? Validation XML API wrapper objects application

  15. Data Modelling • The focus is back to the data model. • SXD is not easy to interpret, impeding a collaborative approach to data model design • Designing is complicated by implementation decisions – it is a good idea to separate the conceptualisation and implementation • Represent the data model in the Unified Modelling Language (UML)? • This is a graphical notation (mainly) for expressing designs. • Can UML express XSD implementation decisions? • Yes, through UML stereotypes (subtypes of Meta-model types) • A UML profile (collection of stereotypes) for schema design has been developed by David Carlson

  16. UML data model • UML equivalent to the XSD geometry and basis set data model • UML can be represented as XMI to facilitate the communication of data models between applications. • Hypermodel will convert XMI to XSD.

  17. Binary Data • Some scientific data would be best stored in binary (e.g. molecular orbitals) • Binary data could simply be pointed to by XML • But… • Sharing binary data requires a machine independent way of storing it. • Could use: • HDF • NetCDF • BinX/DFDL

  18. Current Status • Drafting CML-like markup and schema for some computational chemistry data • Basis sets • Molecular orbitals • Cartesian and internal coordinates • Molecular vibrations • Job parameters • Scalar quantities • Setup an eCCP1 Wiki for discussions (grids.ac.uk/eccp) • Setup NeSCForge project page for code/data model development

  19. Current Status • Developing a C API for parsing CML geometries • Linked to libXML2 • Designed to scale well with xml file size (uses xmlTextReader) • Designed to be easily FORTRAN callable • Transparently reads gzipped XML files • Python module written to read in CML1/2 molecular atom information for the CCP1 GUI • GROWL (Grid Resources on Workstation Language), a C API for utilising current CLRC Grid portal services, is being developed.

  20. Meeting Logistics Agenda Monday 5th April Time Format Location • 10.00 – 11.10 Presentations Lecture theatre • 11.10 - 11.25 Refreshments • 11.25 – 12.35 Presentations Lecture theatre • 12.35 – 13.35 Lunch • 13.35 – 14.10 Presentations Lecture theatre • 14.10 – 15.40 Practical session and announcements Lecture theatre, training lab, Cramond • 15.40 – 16.00 Refreshments • 16.00 – 17.30 Practical session Lecture theatre, training lab, Cramond • 19.00 Conference dinner Tuesday 6th April • 09.00 – 10.10 Presentations Lecture theatre • 10.10 – 10.25 Refreshments • 10.25 – 12.10 Presentations Lecture theatre • 12.10 – 13.10 Lunch • 13.10 – 14.25 Open discussions Cramond • 14.25 – 14.40 Refreshments • 14.40 – 16.00 Open discussions Cramond • 16.00 Meeting close

  21. Publication of meeting material • Contributed material to be published independently (e.g. NeSC technical report on CML) Meeting proceedings (summary) + presentations on web Create a meeting CD containing presentations and proceedings 2. Contributed material to be published independently Meeting proceedings (technical report written by authors of contributed material), focus on existing material and decisions on way forward. Create a meeting CD, presentations, proceedings, and code?

  22. Discussion Topics • How to construct a working group • Who could be involved? • Process of data model refinement • Reference Implementation • Platforms? • What are the requirements? • Man power? • How focused should we be - data types to include, platforms…etc • How do we reach a consensus

More Related