data management and representations in ecce and cmcs
Download
Skip this Video
Download Presentation
Data Management and Representations in Ecce and CMCS

Loading in 2 Seconds...

play fullscreen
1 / 39

Data Management and Representations in Ecce and CMCS - PowerPoint PPT Presentation


  • 272 Views
  • Uploaded on

Data Management and Representations in Ecce and CMCS. Theresa L. Windus Pacific Northwest National Laboratory Environmental Molecular Sciences Laboratory Molecular Science Software Group. Outline. Some “definitions” Data and task representations Ecce CMCS Summary Acknowledgement.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data Management and Representations in Ecce and CMCS' - Gideon


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data management and representations in ecce and cmcs

Data Management and Representations in Ecce and CMCS

Theresa L. Windus

Pacific Northwest National Laboratory

Environmental Molecular Sciences Laboratory

Molecular Science Software Group

outline
Outline
  • Some “definitions”
  • Data and task representations
    • Ecce
    • CMCS
  • Summary
  • Acknowledgement

2

data and metadata one scientist s data is another scientist s metadata
522.09

2.02

Data and metadata(one scientist’s data is another scientist’s metadata)

H°atomiz ( ) =

0

±

kcal/mol

CH3OOH

[calculated, G3//B3LYP, T. Windus, more at http://...]

data

: value and uncertainty

units: kcal/mol

quantity: enthalpy of atomization

species: methylhydroperoxide, CAS# 3031-73-0

temperature: 0 K

calculated: G3//B3LYP

creator: T. Windus using Ecce

more info: http://avatar.emsl.pnl.gov:8080/Ecce/.../CH3OOH/.../GxEnergy

3

metadata converts scientific data into knowledge
Metadata Converts Scientific Data into Knowledge
  • Metadata provides identification and documentation to scientific data.
    • Example: Attaching an owner, creation date, abstract, type to data.
    • Example: Tracking data to program versions, and possibly bugs for that version.
  • Metadata documents the context and value of the data.
    • Example: The theoretical atomization energy of methylhydroperoxide (and its uncertainty) from Ecce (used as input to ATcT) contains information identifying the species and the quantity, units, the theoretical method used, vibrational frequencies and geometry, reference to source file, creator, etc.
  • Metadata facilitates cross-scale transfer of data.
    • Example: Can show a chain of inputs, including input parameters and configuration files, across scales.
    • Example: Can retrieve literature references which describe this data.
  • Metadata allows users to comment on the data and its quality.
    • Example: Can be used for scientific peer review of data.
  • Metadata is necessary for effective collaboration.
    • Example: Scientific data becomes more usable to others when it is documented.

Annotation is another term for metadata. Annotations can be added by either the data owner or a third party.

4

data pedigree a special kind of metadata
Data Pedigree: A Special Kind of Metadata
  • Data pedigree or data provenance is a relationship which provides a “line of ancestors”.
  • Pedigree allows for the categorization and tracing of the scientific data, and for the identification of the data’s ultimate origin, possibly across scales.
  • Pedigree includes the series of steps necessary to reproduce the data.
  • Data is linked, for example, to projects, references, inputs, and outputs.

5

knowledge grid
Knowledge Grid
  • A set of scalable tools, middleware, and services
  • For the creation, analysis, dissemination, evaluation, and use
  • Of data, information, and knowledge
  • By individuals, groups, and communities

…A digital place for performing ‘all’ aspects of science

6

ecce nwchem
Ecce – Extensible Computational Chemistry Environment

comprehensive problem solving environment

common graphical user interfaces

scientific modeling management

seamless transfer of information between applications

persistent data storage through DAV

integrated scientific data management

tools for ensuring efficient use of computing resources across a distributed network

visualization of multi-dimensional data structures

http://ecce.emsl.pnl.gov

NWChem – massively parallel computational chemistry program

Energetics, geometries, frequencies, etc. at various levels of theory

http://www.emsl.pnl.gov/docs/nwchem

Ecce & NWChem

7

distributed authoring and versioning dav
Distributed Authoring and Versioning (DAV)
  • An early web service (XML commands over HTTP)
  • A widely adopted standard for metadata/data transport
  • Put/Get data with arbitrary properties (dynamic)
  • Properties can be discovered and accessed independently
  • DASL, Versioning, Transactions, …

10

ecce physical model
Ecce Physical Model

Calculations are referred to as a “virtual document” because we distribute the structure across many physical objects.

Physical collections and resources are URI addressable.

Collections are unordered and allow mixed content.

15

calculation setup
Basis Set

Tool

Builder

Template

File

Parameters

Perl

.edml File

Calculation

Editor

Geometry

ai.input

ESP

Basis Set

Input Deck

Basis Set

Reformatting

Script

Theory

Details

Runtype

Details

Python

Perl

Calculation Setup

16

output parsing
Perl

Output

Ecce

DataBase

Text Block 1

Parse Script 1

Text Block 2

Parse Script 2

Job Monitor

Calculation

Viewer

.

.

.

.

.

.

Parse

Descriptor

Text Block N

Parse Script N

Output Parsing

17

example metadata
On the calculation:

http://www.emsl.pnl.gov/ecce:contenttype=ecceCalculation

http://www.emsl.pnl.gov/ecce:resourcetype=VIRTUAL_DOCUMENT

http://www.emsl.pnl.gov/ecce:createdWith=v3.2

http://www.emsl.pnl.gov/ecce:owner=d39974

http://www.emsl.pnl.gov/ecce:application=NWChem

http://www.emsl.pnl.gov/ecce:theory=SCF/RHF

http://www.emsl.pnl.gov/ecce:spinmultiplicity=Singlet

http://www.emsl.pnl.gov/ecce:currentVersion=v3.2

http://www.emsl.pnl.gov/ecce:creationdate=Mon, 22 Mar 2004 17:24:00 GMT

http://www.emsl.pnl.gov/ecce:reviewed=false

http://www.emsl.pnl.gov/ecce:runtype=ESP

http://www.emsl.pnl.gov/ecce:launch_machine=arunta

http://www.emsl.pnl.gov/ecce:launch_nodes=1

http://www.emsl.pnl.gov/ecce:launch_rundir=/home/d39974/ecceruns

http://www.emsl.pnl.gov/ecce:launch_totalprocs=1

http://www.emsl.pnl.gov/ecce:launch_user=d39974

http://www.emsl.pnl.gov/ecce:launch_maxmemory=0

http://www.emsl.pnl.gov/ecce:launch_remoteShell=ssh

http://www.emsl.pnl.gov/ecce:job_jobid=13858

http://www.emsl.pnl.gov/ecce:job_path=/home/d39974/ecceruns/tracebug/esp

http://www.emsl.pnl.gov/ecce:job_clienthost=arunta

http://www.emsl.pnl.gov/ecce:startdate=Mon, 22 Mar 2004 17:25:11 GMT

http://www.emsl.pnl.gov/ecce:version=Thu May  8 13:16:51 PDT 2003 Version 4.5

http://www.emsl.pnl.gov/ecce:state=Complete

http://www.emsl.pnl.gov/ecce:completiondate=Mon, 22 Mar 2004 17:25:14 GMT

DAV:resourcetype=

DAV:creationdate=2004-03-22T17:24:38Z

DAV:getlastmodified=Mon, 22 Mar 2004 17:24:38 GMT

DAV:getetag="b2805d-1000-926a8180“

DAV:supportedlock=

DAV:getcontenttype=httpd/unix-directory

On the molecule:

http://www.emsl.pnl.gov/ecce:empiricalFormula=H4C

http://www.emsl.pnl.gov/ecce:charge=0.000000

http://www.emsl.pnl.gov/ecce:useSymmetry=false

http://www.emsl.pnl.gov/ecce:symmetrygroup=C1

DAV:creationdate=2004-03-22T17:24:38Z

DAV:getcontentlength=386

DAV:getlastmodified=Mon, 22 Mar 2004 17:24:38 GMT

DAV:getetag="b28064-182-926a8180“

DAV:executable=F

DAV:supportedlock=

DAV:getcontenttype=chemical/x-ecce-mvm

Example metadata

18

example mvm file
title: demo

type: molecule

num_atoms: 1065

atom_info: symbol cart

atom_list:

O -2.37400 -3.09100 13.5210

H -1.91600 -2.20200 14.0480

...

pdb_list:

H O5* RC 1 157D A

H H5T RC 1 157D A

attr_list:

-0.622300 1 1 0 0

0.429500 1 1 0 0

atom_type_list:

OH

HO

num_bonds: 1028

bond_list:

2 1 1.00000

1 3 1.00000

Example MVM file

19

xml format for properties
XML format for Properties

9.60000000000000e-011.99199825923126e+00 1.18803456337004e+00 3.08260463820159e+00 9.34340637068915e-019.34340635555820e-01 9.34340634042729e-01 9.34340632529639e-010.000000000000000e+00 0.000000000000000e+00 0.000000000000000e+00 -6.755000000000000e-01-6.755000000000000e-01 6.755000000000000e-01 6.755000000000000e-01 6.755000000000000e-016.755000000000000e-01 6.755000000000000e-01 -6.755000000000000e-01 -6.755000000000000e-01-6.755000000000000e-01 6.755000000000000e-01 -6.755000000000000e-016.767628142309400e-15 -6.950100046595310e-09 1.390021315920880e-08 -6.239857395114590e-01-6.239857464615680e-01 6.239857534116811e-01 6.239857568867110e-01 6.239857499366001e-016.239857707869190e-01 6.239857742619920e-01 -6.239857812120860e-01 -6.239857603617700e-01-6.239857916372510e-01 6.239857846871540e-01 -6.239857777370440e-016.549446678833860e-15 1.124467050187860e-09 -2.248938851918010e-09 -6.252750669032320e-01-6.252750631744280e-01 6.252750594456050e-01 6.252750588833910e-01 6.252750626121890e-016.252750514257610e-01 6.252750508635410e-01 -6.252750471347340e-01 -6.252750583211300e-01-6.252750428437061e-01 6.252750465725070e-01 -6.252750503012980e-01

20

slide21
Input Parameters

Crossing the Molecular to

Thermodynamic Scales Data Model

Optimization and

Frequencies

B3LYP

NWChem

Input File

Vinoxy

B3LYP

Vibrational Mode

Animated GIF

6-31G*

Pedigree is imperative to moving data across scales.

Properties

NWChem

Output File

Properties

Input Parameters

Gaussian

Input

QCISD

G3(MP2)B3LYP

Hf Vinoxy

NASA File

Energy

QCISD(T,FC)

Legend

Gaussian

Output

Vinoxy

NWChem

6-31G*

Ecce

Input Parameters

Properties

Gaussian

Properties

Energy

CMCS

MP2(FC)

Active Tables

NWChem

Input

Vinoxy

Pedigree - hasInput

MP2

Pedigree - hasOutput

G3MP2large

NWChem

Output

Properties

21

Properties

the multi scale challenge for chemical science
The Multi-scale Challengefor Chemical Science
  • Impact of chemical science relies upon flow of information across physical scales
    • Data from smaller scales supports models at larger scales
    • Critical science lies at scale interfaces
      • Molecular properties, transport
      • Mechanism validation, reduction
      • Chemistry – fluid interactions
  • The pedigree of information matters
    • The propagation of data pedigree across scales is difficult
    • Validation and data reliability is often a post-publication process
  • Multi-scale science faces barriers
    • Normal publication route is slow
    • Numerous sub-disciplines employ different applications, formats, models
    • Centers of excellence are geographically distributed

23

multi scale chemical science data
Multi-scale Chemical Science Data
  • Unique terascale reacting flow simulation databases – collection of files @ N x Dt, and experimental data
  • Chemical Mechanisms – k, MB files in various formats containing collections of reaction rates and transport coefficients. Modeled using theory, validated against experiments
  • Kinetic rates – by measurement and computation. Tables collected, reviewed and annotated. NIST WebBook, publications
  • Thermo-Chemistry- Tables of ‘constant’ properties of all molecules (of interest w/data) derived from many experiments, computations, extrapolations
  • Quantum chemistry computations of molecular properties – data from one number to large potential energy surfaces - input to thermo-chemistry and reaction rate computations

24

slide25
CMCS Spans Scales & Geography

Biggest barrier is “language” and informatics

25

adaptive informatics infrastructure
Adaptive Informatics Infrastructure
  • Infrastructure – a well designed, scalable, reusable, flexible set of tools, middleware, and services
  • Informatics – the emerging use of semi-automated means to derive new knowledge from the analysis of (large amounts of) heterogeneous data, annotating existing data with its newly discovered meaning
  • Adaptive – able to dynamically change to incorporate new knowledge and support new activities
    • Low Barriers
      • Many access points
      • Storage of data in original formats with dynamic metadata extraction and translation
    • Powerful
      • Arbitrary formats (binary, ASCII, XML)
      • Integrated data, metadata, pedigree across internal and external tools
    • Evolvable
      • Schema can be changed/extended as needed
      • Metadata, translations, viewers, portal, etc. can be dynamically configured

26

cmcs technical choices enable adaptive long lived infrastructure
CMCS Technical Choices Enable Adaptive, Long-lived Infrastructure
  • CMCS Data/Metadata services
    • SAM Translation, Annotation
    • WebDAV implementation
    • Notification (JMS, NED)
    • Search
    • Pedigree browsing
    • Core XML schema
    • Security (JAAS)
  • Chemical Science Portal
    • Jetspeed (CHEF)
    • CMCS Explorer
    • Application portlets
    • Community services
  • Application Integration
    • Webservices
    • WebDAV API
    • Multi-scale data including NIST access

A diagram representing the major conceptual elements of the CMCS Informatics Infrastructure.

27

how metadata is populated in cmcs
How Metadata is Populated in CMCS
  • SAM Metadata Services Layer
    • When data is put into WebDAV, SAM causes XSLTs to be executed to extract metadata from XML files, based on MIME type.
    • Similarly, Binary File Descriptor (BFD) provides an interface to extract metadata from binary files.
    • Other translators can be used as well.
  • CMCS data management/pedigree API to facilitate insertion and modification of metadata, in the proper XML format.
    • Java code which allows software developers and scientists to easily write programs to add/edit metadata.
    • Scientists can use these APIs to integrate with existing or new chemical science applications.
    • Uses open source DAV and XML libraries.
  • Any WebDAV client application
    • DAVExplorer: Java application
    • CMCSExplorer: Integrated in the CMCS portal

28

cmcs metadata annotations and pedigree
CMCS Metadata, Annotations, and Pedigree
  • Using Dublin Core for some basic pedigree properties of electronic publication: creator, dates, publisher, is-referenced-by, references, etc.
    • Digital library standard for metadata
    • http://www.dublincore.org
  • CMCS properties for Chemical Science to enable searching: species name, CAS, chemical properties, and chemical formula.
  • CMCS properties for defining scientific data: inputs, outputs, and is-part-of-project.
  • CMCS properties for scientific publication and peer review annotations: is-sanctioned-by.
  • Currently defined more than 35 elements in the core CMCS pedigree.
  • Flexible infrastructure for addition of new metadata. As new metadata is added to infrastructure,current apps will not break!

CMCS metadata is strongly encouraged, though not required, for all CMCS data, and CMCS metadata is highly extensible.

29

pedigree browsing
Pedigree Browsing

Data is linked to projects, references, inputs, and outputs

The Browser enables metadata editing.

31

automatic translation and metadata extraction
Automatic Translation and Metadata Extraction

Data translations provided automatically by SAM using previously registered XSLT’s for this file type.

32

adaptive infrastructure enables application integration
launch

REACTIONLAB

ELN 5.0

Ecce

DAV+SAM NS

DAV

NWChem/

GRID RESOURCES

Adaptive Infrastructure Enables Application Integration

Browser,

e-mail

Browser, e-mail

MCS Portal

Portlet

API

Shared Data Repository

Portlet

API

Active Table

SAM

Mime-type Assignment

Metadata Extraction

Translation

Pedigree Relationships

SAM

Web service

CMCS/DAV

API

CMCS/DAV

API

Fitdat

Notification

Web service

Notification

API

Notification

API

Grid Fabric

Federation ML

NIST

Kinetics

DB

33

summary
Summary
  • Users just want to have ease of use and flexibility in viewing output – adaptive informatics infrastructure
  • “Standards” are useful, but it is necessary to be able to translate between diverse “schema” and “ontologies”
  • Metadata converts scientific data into knowledge

35

multi disciplinary ecce development team
Multi-disciplinary Ecce Development Team

Gary Black -- Project lead

Karen Schuchardt -- Software architect lead

Bruce Palmer -- Chemist architect

Todd Elsethagen -- Data management lead

Erich Vorpagel – Chemist consultant

Michael Peterson -- Operations support

Mahin Hackler -- Operations support

Sue Havre -- Application development

Brett Didier -- Application development

Carina Lansing -- Application development

Steve Matsumoto -- Online help lead

Colleen Winters -- Online help

Doug Rice -- Online help

36

multi disciplinary cmcs team
Multi-disciplinary CMCS Team

Chemical Science Computer/Information Science

Christine Yang, SNL

Larry Rahn*, SNL

Carmen Pancerella, SNL

Renata McCoy, SNL

Michael Lee, SNL

Wendy Koegler, SNL

Ed Walsh, SNL

John Hewson, SNL

David Montoya*, LANL

Lili Xu, LANL

Yen-Ling Ho, LANL

William H. Green, Jr. *, MIT

Michael Frenklach*, UCB

William Pitz*, LLNL

Michael Minkoff, ANL

Thomas C. Allison*, NIST

Sandra Bittner, ANL

Gregor von Laszewski, ANL

David Leahy, SNL

Sandeep Nijsure, ANL

Al Wagner*, ANL

Kaizar Amin, ANL

James D. Myers, PNL

Branko Ruscic, ANL

Brett Didier, PNL

Reinhardt Pinzon, ANL

Karen Schuchardt, PNL

Baoshan Wang, ANL

Eric Stephan, PNL

Carina Lansing, PNL

Theresa Windus*, PNL

Elena Mendoza, PNL

SAM

37

National Collaboratory Program

slide38
Acknowledgements

This research was performed in part using the Molecular Science Computing Facility (MSCF) in the William R. Wiley Environmental Laboratory at the Pacific Northwest National Laboratory (PNNL). The MSCF is funded by the Office of Biological and Environmental Research in the U. S. Department of Energy (DOE). PNNL is operated by Battelle for the U. S. Department of Energy under contract DE-AC06-76RLO 1830. Funding is also provided by the Mathematics, Information and Computer Science and Basic Energy Sciences Division of DOE.

38

slide39
End

39

ad