120 likes | 208 Views
Learn about smart storage methods for chemical data, from numbers to circumstances, assumptions to standard conditions. Explore the complexities of data sources, mining, and relational design to ensure sensible statistics. Discover the importance of provenance and RDF triplestores for precise data management. Find out how to track, validate, and predict chemical properties effectively. Visit the provided link for more insights.
E N D
Or How on Earth do we Store this Stuff? Smart Storage for Physical Properties Kieron Taylor with Jeremy Frey and Jonathan Essex
What makes up chemical data? • Numbers - big, small, precise and vague • Circumstances - How hot? What pressure? • Assumptions • This is pretty pure, let's say it's pure • Standard conditions? More or less • That peak on the spectrum isn't important
Using the Data: QSPR Take lots of data Magical statistics occur Validate results Predictive model
So What is Real Data like? Bad - take the commercial Physprop Database Can we handle these melting points?
Let's Make a Database • One data source is not enough • Good(?) data isn't free • Different sources have varied style of content • Most database software not suited to data mining • We cannot plumb these varied sources for data, we must reconcile them to make sensible statistics
Relational Design For one molecule: Cyclohexanone Property Value Error Units Source Method Author Note Solubility 2500 +/-50 mg/L Physprop Laboratory ... 2650+/-60 mg/L Southampton Simulation Me Superceded 2599+/-25 mg/L Southampton Simulation B Me Melting point -31 +/-0.1 C Detherm Laboratory ... Boiling point 155.4 +/-0.5 C Merck Index Laboratory ... Decomposing Property Value Units Solubility 2500 mg/L Melting point -31 C Boiling point 155.4 C Property Value Error Units Source Solubility 2500 +/-50 mg/L Physprop 2650 +/-60 mg/L Our lab Melting point -31 +/-0.1 C Detherm Boiling point 155.4 +/-0.5 C Merck Index Property Value Error Units Source Method Author Solubility 2500 +/-50 mg/L Physprop Laboratory ... 2650 +/-60 mg/L Southampton Simulation Me Melting point -31 +/-0.1 C Detherm Laboratory ... Boiling point 155.4 +/-0.5 C Merck Index Laboratory ... Arbitrary numbers of points are hard to store in relational databases We're not done yet: We still have to account for multiple experimental conditions, statements of validity and molecules. Provenance = Senary relational model?
RDF Triplestore is the Solution • RDF describes trees and networks of entities • Data of this complexity lends itself well to a tree representation • RDF trees enable additional clever things • Triplestores provide persistent RDF models
What can we do with this? • Store almost any chemical data as normal • Track the where, when and how of each and every data point • Filter values down whether real, simulated, old, new, from a particular source, or done by a particular person. • Bolt on RDF schemas such as FOAF and our units system.
What have we done with this? http://green.chem.soton.ac.uk/triangle/query.html
Thanks to: • AKT and Steve Harris for 3store • Rob Gledhill for web tech and discussion • Perl for s/ / /g