Smart Storage for Physical Properties

Or How on Earth do we Store this Stuff? Smart Storage for Physical Properties Kieron Taylor with Jeremy Frey and Jonathan Essex

What makes up chemical data? • Numbers - big, small, precise and vague • Circumstances - How hot? What pressure? • Assumptions • This is pretty pure, let's say it's pure • Standard conditions? More or less • That peak on the spectrum isn't important

Using the Data: QSPR Take lots of data Magical statistics occur Validate results Predictive model

So What is Real Data like? Bad - take the commercial Physprop Database Can we handle these melting points?

Let's Make a Database • One data source is not enough • Good(?) data isn't free • Different sources have varied style of content • Most database software not suited to data mining • We cannot plumb these varied sources for data, we must reconcile them to make sensible statistics

Relational Design For one molecule: Cyclohexanone Property Value Error Units Source Method Author Note Solubility 2500 +/-50 mg/L Physprop Laboratory ... 2650+/-60 mg/L Southampton Simulation Me Superceded 2599+/-25 mg/L Southampton Simulation B Me Melting point -31 +/-0.1 C Detherm Laboratory ... Boiling point 155.4 +/-0.5 C Merck Index Laboratory ... Decomposing Property Value Units Solubility 2500 mg/L Melting point -31 C Boiling point 155.4 C Property Value Error Units Source Solubility 2500 +/-50 mg/L Physprop 2650 +/-60 mg/L Our lab Melting point -31 +/-0.1 C Detherm Boiling point 155.4 +/-0.5 C Merck Index Property Value Error Units Source Method Author Solubility 2500 +/-50 mg/L Physprop Laboratory ... 2650 +/-60 mg/L Southampton Simulation Me Melting point -31 +/-0.1 C Detherm Laboratory ... Boiling point 155.4 +/-0.5 C Merck Index Laboratory ... Arbitrary numbers of points are hard to store in relational databases We're not done yet: We still have to account for multiple experimental conditions, statements of validity and molecules. Provenance = Senary relational model?

RDF Triplestore is the Solution • RDF describes trees and networks of entities • Data of this complexity lends itself well to a tree representation • RDF trees enable additional clever things • Triplestores provide persistent RDF models

What can we do with this? • Store almost any chemical data as normal • Track the where, when and how of each and every data point • Filter values down whether real, simulated, old, new, from a particular source, or done by a particular person. • Bolt on RDF schemas such as FOAF and our units system.

What have we done with this? http://green.chem.soton.ac.uk/triangle/query.html

Thanks to: • AKT and Steve Harris for 3store • Rob Gledhill for web tech and discussion • Perl for s/ / /g

Smart Storage for Physical Properties

Smart Storage for Physical Properties

Presentation Transcript

Physical Properties

Physical properties

Physical Properties

Physical Properties

Physical Properties

Physical Properties

Physical Properties

Physical Properties

Physical Properties

Physical Properties

Physical Properties

Physical Properties

Physical Properties

Physical Properties

Physical properties

Physical properties

Physical Properties

Physical Properties

Physical Properties

Physical Properties

Physical Properties

Physical Properties :