1 / 38

Linked Data and the Provenance Explosion

Linked Data and the Provenance Explosion. Deborah L. McGuinness Tetherless World Constellation Chair Professor of Computer Science and Cognitive Science Director RPI Web Science Trust Network Lab Rensselaer Polytechnic Institute DPDM, March 18, 2011 Melbourne, Australia. Outline.

nash
Download Presentation

Linked Data and the Provenance Explosion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linked Data and the Provenance Explosion Deborah L. McGuinness Tetherless World Constellation Chair Professor of Computer Science and Cognitive Science Director RPI Web Science Trust Network Lab Rensselaer Polytechnic Institute DPDM, March 18, 2011 Melbourne, Australia

  2. Outline • Motivating scenarios • Light Weight data connectivity in health setting • PopSciGrid • Provenance-related questions • Mid-weight semantic modeling in interdisciplinary virtual observatories • Virtual Solar Terrestrial Observatory -> Semantic Provenance Capture in Data Ingest Systems • Mobile platform, social networking enhanced advisor • Semantic Sommelier • Provenance Infrastructure and Directions • Discussion and Directions

  3. Selected Background • Bell Labs: designing Description Logics (DLs) & environments aimed at supporting applications such as configuration. • led to research on making DL-based systems useful – with focus on explanation • Stanford University: focus on ontology-enabled xx, large hybrid systems, later X-informatics often for eScience • led to ontology evolution & diagnostic environments, expanded explanation settings including “messy” hybrid systems with new provenance emphasis

  4. Background cont. • Rensselaer Polytechnic Institute/ Tetherless World Constellation: next generation web, web science research center, open data, next generation semantic eScience • Led to more connections with social platforms, empowering collections (of users, data, etc.)

  5. TWC

  6. Population Sciences Grid Goals(with NIH/NCI, Northwester) • Convey complex health-related information to consumer and public health decision makers for community health impact • Leverage the growing evidence base for communicating health information on the Internet • Inform the development of future research opportunities effectively utilizing cyberinfrastructure for cancer prevention and control.

  7. Computer Science Slant • How can semantic technologies be used to integrate, present, and analyze data for a wide range of users? • Can tools allow lay people to build their own demos and support public usage and accurate interpretation? • How do we facilitate collaboration and making applications “viral”? • Within PopSciGrid: • Which policies (taxation, smoking bans, etc) impact health and health care costs? • What data should we display to help scientists and lay people evaluate related questions? • What data might be presented so that people choose to make (positive) behavior changes? • What does the data show? why should someone believe that? • What are appropriate follow ups?

  8. PopSciGrid

  9. PopSciGrid

  10. PopSciGrid II http://logd.tw.rpi.edu/demo/tax-cost-policy-prevalence

  11. PopSciGrid III

  12. Questions • Overall data: • What data is used? • How recent is it? • What are the conditions under which it was obtained? • Is it reliable for this purpose? • Pick one item like prevalence – is this the best parameter to focus on? • What is prevalence (definition)? • How is it measured (overall / in this data set)? • Are there another or better proxy (e.g., packs sold) • Do we need more data, more inference, more xxx…

  13. Virtual Observatory (VSTO) • General: Find data subject to certain constraints and plot appropriately • Specific: Plot the observed/measured Neutral Temperature as recorded by the Millstone Hill Fabry-Perot interferometer while looking in the vertical direction at any timeofhigh geomagnetic activity in a way that makes sense for the data.

  14. Partial exposure of Instrument class hierarchy - users seem to LIKE THIS Deborah L. McGuinness

  15. VSTO Results Many Benefits: • Reduced query formation from 8 to 3 steps and reduced choices at each stage • Allowed scientists to get data from instruments they never knew of before (e.g., photometers in example) • Supported augmentation and validation of data • Useful and related data provided without having to be an expert to ask for it • Integration and use (e.g. plotting) based on inference • Ask and answer questions not possible before BUT Needed Provenance • Deborah McGuinness, Peter Fox, Luca Cinquini, Patrick West, Jose Garcia, James L. Benedict, and Don Middleton. The Virtual Solar-Terrestrial Observatory: A Deployed Semantic Web Application Case Study for Scientific Research. In the Proceedings of the Nineteenth Conference on Innovative Applications of Artificial Intelligence (IAAI-07). Vancouver, British Columbia, Canada, July 22-26, 2007. • Peter Fox, Deborah L. McGuinness, Luca Cinquini, Patrick West, Jose Garcia, James L. Benedict, and Don Middleton. Ontology-supported Scientific Data Frameworks: The Virtual Solar-Terrestrial Observatory Experience. In Computers and Geosciences - Elsevier. Volume 35, Issue 4 (2009).

  16. Inference Web (IW) End Users End-User Interact ion services Data Access & Data Analysis Services Validate PML data Explanation via Graph Distributed PML data Explanation via Customized Summary Explanation via Annotation Access published PML data • Inference Web is a semantic web-based knowledge provenance management infrastructure: • Uses a provenance interlingua (PML) for encoding and interchange of provenance metadata in distributed environments • Provides interactive explanation services for end-users • Provides data access and analysis services for enriching the value of knowledge provenance • It has been used in a wide range of applications

  17. Making Systems Actionable using Knowledge Provenance Mobile Wine Agent CALO Combining Proofs in TPTP Intelligence Analyst Tools Knowledge Provenance in Virtual Observatories GILA 17 NOW including Data-gov 17 17

  18. Proof/Provenance Markup Language (PML) World Wide Web Enterprise Web D PML data PML data PML data PML data PML data PML data Enterprise Web D D D D D D PML data D … • A kind of linked data on the Web • Modularized & extensible • Provenance: annotate provenance properties • Justification: encodes provenance relations (including support for multiple justifications) • Trust: add trust annotation • Semantic Web based

  19. User Require Provenance! Users demand it! If users (humans and agents) are to use, reuse, and integrate system answers, they must trust them. Intelligence analysts: (from DTO/IARPA’s NIMD) Andrew. Cowell, Deborah McGuinness, Carrie Varley, and David A. Thurman. Knowledge-Worker Requirements for Next Generation Query Answering and Explanation Systems. Proc. of Intelligent User Interfaces for Intelligence Analysis Workshop, Intl Conf. on Intelligent User Interfaces (IUI 2006), Sydney, Australia. Intelligent Assistant Users: (from DARPA’s PAL/CALO) Alyssa Glass, Deborah L. McGuinness, Paulo Pinheiro da Silva, and Michael Wolverton. Trustable Task Processing Systems. In Roth-Berghofer, T., and Richter, M.M., editors, KI Journal, Special Issue on Explanation, KunstlicheIntelligenz, 2008. Virtual Observatory Users: (from NSF’s VSTO) Deborah McGuinness, Peter Fox, Luca Cinquini, Patrick West, Jose Garcia, James L. Benedict, and Don Middleton. The Virtual Solar-Terrestrial Observatory: A Deployed Semantic Web Application Case Study for Scientific Research. Proc. of the Nineteenth Conference on Innovative Applications of Artificial Intelligence (IAAI-07). Vancouver, British Columbia, Canada. And… as systems become more diverse, distributed, embedded, and depend on more varied data and communities, more provenance and more types are needed .

  20. CHIP Pipeline (Chromospheric Helium Image Photometer) Intensity Images (GIF) Velocity Images (GIF) Raw Image Data Captured by CHIP Chromospheric Helium-I Image Photometer Publishes Mauna Loa Solar Observatory (MLSO) Hawaii National Center for Atmospheric Research (NCAR) Data Center. Boulder, CO • Raw Image Data • Raw Data Capture • Follow-up Processing • on Raw Data • (e.g., Flat Field Calibration) • Quality Checking • (Images Graded: GOOD, BAD, UGLY) 20

  21. Semantic Provenance Capture for Data Ingest Systemcs (SPCDIS) Fact: Scientific data services are increasing in usage and scope, and with these increases comes growing need for access to provenance information. Provenance Project Goal: to design a reusable, interoperable provenance infrastructure. Science Project Goal: design and implement an extensible provenance solution that is deployed at the science data ingest/ product generation time. Outcome: implemented provenance solution in one science setting AND operational specification for other scientific data applications. Extends vsto.org

  22. ACOSData Ingest • Typical science data processing pipelines • Distributed • Some metadata in silos • Much metadata lost • Many human-in-loop decisions, events • No metadata infrastructure for any user • Community is broadening Chromospheric Helium Imaging Photometer (CHIP) Data Ingest ACOS – Advanced Coronal Observing System

  23. The Advanced Coronal Observing System case for Provenance • Provenance metadata currently not propagated with or linked to the data products • Processing metadata • Origin (observation) metadata • Data products are the result of “black box” systems • Most users do not know what calibrations, transformations, and QA processing have been applied to the data product ??? Source Processing Product

  24. Advanced Coronal Observing System (ACOS) Provenance Use Cases • What were the cloud cover and seeing conditions during the observation period of this image? • What calibrations have been applied to this image? • Why does this image look bad?

  25. PML Usage in SPCDIS SourceUsage Source DateTime • Justification • Explanation • Causality graph • Provenance • Conclusion • Source • Engine • Rule • Trust • Trust/Belief metrics Engine Rule Rule hasInferenceRule hasInferenceEngine hasSourceUsage NodeSet NodeSet Justification Justification Conclusion Conclusion hasAntecedentList NodeSet Justification Conclusion

  26. PML in Action in SPCDIS • This is the PML provenance encoding for a “quick look” gif file, which is generated from two image data datasets The “antecedents” of the quicklook gif file are other node sets InferenceStep: how the gif file was derived hasAntecedents hasInferenceRule Node set for the quickloook gif file hasInferenceEngine hasConclusion: a reference to the gif file itself

  27. A PML-Enhanced Image CHIP PML-Enhance Quick-Look CHIP Quick-Look provenance

  28. Integrated View • Observer log’s information added into quicklook image’s provenance

  29. Provenance aware faceted search Tetherless World Constellation

  30. Current Issues • Successful interdisciplinary VO; needed provenance • Successful provenance integration for experts; needs to support more diverse audience • As the user base diversifies, what updates are needed? • Will a domain ontology for MLSO/NCAR-affiliated staff be understandable by citizen scientists?... No • How can our representational infrastructure be extended with contextual information relevant to user needs? E.g., linking data products from one part of the CHIP pipeline to specific solar events or events at MLSO (such as reports of bad weather) • Should provenance ontologies provide extensional capabilities to include domain-informed extensions – yes • [1] Stephan Zednik, Peter Fox and Deborah L. McGuinness, “System Transparency, or How I Learned to Worry about Meaning and Love Provenance!” Proceedings of IPAW 2010 • [2] James R. Michaelis, Li Ding, Zhenning Shangguan, Stephan Zednik, Rui Huang, Paulo Pinheiro da Silva, Nicholas Del Rio and Deborah L. McGuinness, “Towards Usable and Interoperable Workflow Provenance: Empirical Case Studies Using PML” Proceedings of SWPM 2009 • [3] AGU 2010 with papers with Fox, et al, McGuinness et al., Zednick et al,, West. et. al, Michaelis et al, …

  31. Wine Agent – Semantic Sommelier

  32. Wine Agent for iPhone • Client application which talks to a SW service • Make requests for dishes and wines using auto-generated interfaces • Make recommendations to the system for others

  33. Getting the Recommendation • Recommendations are made up of two classes: 1 dish, 1 wine • When the instance is realized, the agent looks up matching recommendations and returns the results • Tapping a particular recommendation causes the wine agent to look for pairings which match the recommendation

  34. Our Position System Transparency supports user understanding and trust Our Research Goal: Provide interoperable infrastructure that supports explanations of sources, assumptions, and answers as an enabler for trust

  35. Provenance Events CSV2RDF visualize derive derive create revision Archive SemDiff Enhance derive

  36. Challenges for Data Aggregators (with Tim Lebo, Greg Williams)

  37. Discussion • Provenance is growing in acceptance, need, and type • Provenance data could easily dwarf other data in volume • Some interlinguas have emerged that have significant usage and have shown significant value and are ready to be used (plus standard likely from W3C) • Interdisciplinary eScience and open data are increasing the need and pace

  38. Discussion II A few trends we have observed: • Techniques for supporting interaction with large diverse communities are needed (we believe user annotation is one such critical technique) • Data aggregators face additional challenges if provenance is not available… and may accelerate the demand for provenance and provenance standards • Getting back to the portion of the source used is critical for some • Tracking manipulations is critical for some • Providing and creating provenance as part of a larger eco-system is key • Domain-specific extensions can be of value Open (govt, science, etc) data (along with semantic web applications with embedded information about knowledge provenance and term meaning) is providing many new opportunities and will continue to change our lives. • Questions? dlm <at> cs <dot> rpi <dot> edu

More Related