NIST Scientific Data for Data ScienceUnited Nations Open Data / Open Government Conference, April 26-28, Abu Dhabihttp://semanticommunity.info/Data_Science/NIST_Scientific_Data_for_Data_Science Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup April 26, 2014
Open Data / Open Government Conference • Request: • Interesting case studies about open government / open data. • Information on relevant federal apps designed. • A short bio. • Response: • AOL Government published about 80 of my 200 some stories at Semantic Community about open government data and activities. • Over 250 Spotfire dashboard apps in my cloud library including most of the major open government dashboards and new data sets. • Helped Data.gov get started in the US, and open government data get started in the SEMIC.EU and Japan.
Speaker Bio • Brand Niemann, former Senior Enterprise Architect & Data Scientist with the US EPA, works as a data scientist, produces data science products, and publishes data stories for Semantic Community, AOL Government, & Data Science & Data Visualization DC. • He co-organized the Federal Big Data Working Group Meetup with Kate Goodier that has Data Science Teams producing big data applications for government and business and provides a free on-line graduate course entitled Practical Data Science for Data Scientists.
Broader Context • NIST and other agencies need to support the following Federal Government Initiatives: • Big Data • Digital Government Strategy • Public access mandated for "scientific results" supported by the U.S. government • Federal agencies have submitted their "initial plans" for public access to scientific data to OSTP • Digital Object Architecture: One result will be to make the scientific record into a first class scientific object • The author has suggested that all of these can be addressed with agency digital content by following the Data Mining Standard. • See “Data Science Makes Data More Important Than Code and Ontology”
Data Mining Standard • Business Understanding: • NIST Mission • Standardize measurement • Data Understanding: • NIST Digital Archives • Promised to publish raw data sets • Data Preparation: • Knowledge Base of the Above • Need raw data for figures • Modeling: • Semantic Knowledge Base, Data Papers, and NanoPublications • See White Paper on “Making Big Data Small" using Data Science and Semantics • Evaluation: • Searchability, Discovery, and Reasoning • Relational Queries and Graph Traversal • Deployment: • Story and Knowledge Base in MindTouch, Excel, NodeXL, Spotfire, and Be Informed • Data ecosystem
NIST • NIST Supports its employees and others with the following Information Services: • Research Library • Publishing Services • NIST Museum and Archives • The NIST Digital Archives (NDA) present images of NIST Museum artifacts and full-text NIST publications: • NBS Bulletins • Journal of Research of NIST • NBS-NIST Directors • NBS-NIST Histories • NBS Circulars and Reports
NIST Home Page http://www.nist.gov/
NIST Virtual Library http://www.nist.gov/nvl
NIST Digital Archive Interface http://nistdigitalarchives.contentdm.oclc.org/
NIST Digital Archive Contents My Note: 9602 Items! http://nistdigitalarchives.contentdm.oclc.org/cdm/search/display/200/order/title/ad/asc
NIST Digital Archive Example My Note: Can Read PDF On-line, but Where Is the Data? http://cdm16009.contentdm.oclc.org/cdm/compoundobject/collection/p13011coll6/id/153009/rec/1
PDF-to-MindTouch My Note: Need Data for Figure 8 and for Table 1 to be Real Data (it is!) Figure 8 The solid circles show the measured absorbance Table 1 Properties of 2.0 μm microspheres at 266 nm obtained from the fit of the L-M apparent cross section to the absorbance measurements
Modeling: Approaches by the Federal Big Data Working Group Meetup • Semantic Medline: • Semantic MEDLINE Query: mesothelioma and Data Science for VIVO • Data Papers: • Sepublica 2014: The Semantics for e-science in an intelligent Big Data Context • http://sepublica.mywikipaper.org/ • Nanopublications: • The smallest unit of publishable information: an assertion about anything that can be uniquely identified and attributed to its author. • http://nanopub.org/wordpress/?page_id=65
Modeling: Examples Dr. BarendMons: BRAIN Dr. Tom Rindflesch: Semantic Medline Most Recent: 500 citations, Start Date: 01/01/1900, End Date: 11/30/2013, 3169 predications extracted. Summarized for Substance Interactions
Evaluation and Deployment • The Evaluation and Deployment examples of each is as follows: • Semantic Knowledge Base: Web & PDF • Selected Data Papers: PDF-to-MindTouch • Measurement of Scattering and Absorption Cross Sections of Microspheres for Wavelengths between 240 nm and 800 nm • OMNIDATA and the Computerization of Scientific Data • Nanopublication: Extracts from the Data Papers-to-Excel • My Note: Still need the NIST raw data sources to re-create the figures in the publications. • I have been promised that NIST is going to publish their data sets as part of the Open Government Data Initiative.
How was the data collected? My Note: Unstructured Information to Structured Data, Including the Two PDF Papers, with Well-defined URLs According to the SEMIC.EU Standards. http://semanticommunity.info/Data_Science/NIST_Scientific_Data_for_Data_Science
Where is the unstructured and structured data stored? Web and PDF Footnote and References Metadata and Data Sources Well-defined URLs for Linked Data Relational and Graph Ready for NodeXL & Spotfire http://semanticommunity.info/@api/deki/files/28860/NISTDataScience.xlsx
What are the results?:NIST Scientific Data Knowledge Base Visualization My Note: Sections with Many Reference Links Can be Very Important!
What are the results?:NIST Digital Archives Century of Excellence My Note: The Featured Seminal Data Paper is the 60th out of 106 Which I Found from Doing the Index Below!
What are the results?:NIST Digital Archives My Note: The NIST Digital Archive Can be an Interface to Data Papers with Data Tables and Interactive Visualizations. This Work Can be Used to Prioritize the Additional Work and Reduce Duplication.
What are the results?:NIST Library Catalog Search for Data My Note: This Was a Test for Searching the Catalog for “data” and Converting the Results to a Spreadsheet (20 of 259). There is Also the Need to Search for Data Tables Within the Individual Publications.
What is our data story and product? • Need a scientific data publishing environment that supports: • Conformance to editorial policies • Facilitates peer review • Standardizes dissemination • Manages references and URLs • Promotes data publication, validation, and mining • Semantic Community is doing that for NIST: • More work in progress to be reported at the conference and elsewhere