1 / 44

A New Enterprise Data Management Strategy for the US EPA

A New Enterprise Data Management Strategy for the US EPA. Brand Niemann, Senior Enterprise Architect, US EPA Presentation for the 2007 Metatopia Conference, November 5-7, 2007 Hosted by Data Management Association of the National Capital Region. Context.

Download Presentation

A New Enterprise Data Management Strategy for the US EPA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A New Enterprise Data Management Strategy for the US EPA Brand Niemann, Senior Enterprise Architect, US EPA Presentation for the 2007 Metatopia Conference, November 5-7, 2007 Hosted by Data Management Association of the National Capital Region

  2. Context • In SICoP's work on a semantic interoperability data management strategy for the community, the focus has shifted from applying it to EPA, which is done, to the broader community: • 1. SICoP delivered a semantic interoperability data management strategy to the Best Practices Committee of the Federal CIOC in June. • 2. SICoP applied the semantic interoperability data management strategy to the US EPA in July - September as part of its ongoing work with federal agencies and presented this work at several conferences and had it reviewed and accepted by the Metatopia 2007 Committee for publication and use at future conferences. • 3. SICoP is now applying the semantic interoperability data management strategy for the Net-Centric Operations Industry Consortium (NCOIC) with its members to selected agencies (e.g. FAA NextGen, Logistics, USCG, DHS, etc.

  3. Context

  4. Context • EPA CIO Mollie O’Neill: Discovery of information gets harder and harder as more and more information is posted on the web. I think this new use of folders is a great feature to help with ease of discovery. I think there are some good opportunities ahead for knowledge transfer and research projects around indicators. (July 13 and August 30, 2007).

  5. Context • EPA CIO Mollie O’Neill: Many of us, and the people we support, are working on the front lines of environmental issues. Most people don’t care “what’s under the hood”, they care about getting the specific, accurate information they need quickly and easily. EPA has a variety of systems and mechanisms for storing, finding, and accessing information that are challenging for our staff and stakeholders to navigate. It is a challenge for us to make this information easily discoverable. (October 4, 2007).

  6. Abstract • At present, EPA has no overarching strategy or comprehensive set of business and technical rules that can support the level of integration and data sharing that the EPA target applications architecture recommends. This will be addressed, in part, by an enterprise data management strategy being developed by a Data Integration Workgroup Community of Interest hosted by the Enterprise Architecture Team this fall. A key premise articulated by participants in the recent Enterprise Data Integration and Data Warehousing Meeting was to “Design to Share” for new data systems and to “Enable to Share” existing data systems from the Federal Enterprise Architecture’s Data Reference Model 2.0 (2).

  7. Abstract • An EPA Data Architecture for DRM 2.0 was developed and applied last year (3) based on the premise of reusing the data and information rather than changing the data systems themselves. The concepts and standards of the Semantic Web (also called the Data Web or Web 3.0) are being applied to the reuse of data and information in an EPA Data Architecture for DRM 3.0 and Web 3.0 (4) using EPA and interagency data and information sources. This is a fundamentally different approach than conventional data integration and data warehousing that requires changes in data systems (5) and is being applied to net-centric operations (6) by the Federal Semantic Interoperability Community of Practice (SICoP) (7).

  8. Abstract • This enterprise data management strategy illustrates • (a) “how the DRM’s abstract nature enables agencies to use multiple implementation approaches, methodologies, and technologies while remaining consistent with the foundational principles of the DRM”, and • (b) “that associating elements of concrete architectures with the DRM abstract model promotes interoperability between cross-agency architectures / implementations.” (8)

  9. Abstract • The example used in this enterprise data management strategy illustrates how: • (a) bringing the data and the metadata back together, • (b) bringing the structured data and unstructured information back together, and • (c) bringing the data description and context back together promotes sharing and reuse and makes further sharing and reuse easier.

  10. Bio • Dr. Brand Niemann is a Senior Enterprise Architect in the Office of the Chief Information Officer of the Environmental Protection Agency. His work on EPA and Interagency data architecture to facilitate electronic information sharing was recently recognized by Federal Computer Week in its 2006 Power Player Series Special Report. • See http://colab.cim3.net/cgi-bin/wiki.pl?BrandNiemann and http://www.himotion.us/2/2006/139.html • Brand is a recognized leader in the use of communities of practice (CoP) supported by Wiki technology and serves as Co-Chair of both the Federal Service-Oriented Architecture CoP and the Federal Semantic Interoperability CoP. He was a keynote speaker at the recent Gartner Spring EA Summit Conference on “Data and Information Architecture: Not Just for Enterprise Architects!” • See http://colab.cim3.net/file/work/SICoP/2007-06-14/SICoPGartner06142007A.ppt

  11. Overview • 1. Introduction • 2. EPA Data Architecture for DRM 3.0 and Web 3.0 • 3. RDF from Data Tables and Relational Databases • 4. Examples • 5. Recommendations • 6. References • Appendix A. EPA’s 2007 Report on the Environment: Science Report Metadata Template

  12. 1. Introduction • It is always essential to start with definitions and specific examples that illustrate those definitions (9): • Data Architecture: Describes how data is processed, stored, and utilized in a given system. It provides criteria for data processing operations that make it possible to design data flows and also control the flow of data in the system. • A term commonly used in one of two senses: physical (e.g., servers) and logical or enterprise-wide (e.g. Zachman framework) • Data Modeling: A model that describes in an abstract way how data are represented in a business organization, an information system, or a database management system. • An Ontology is a data model that represents a domain and is used to reason about the objects in that domain and the relations between them. See more detailed explanation of ontologies below. • Data Networks: Communication of data between computers. • For example, a semantic service oriented architecture.

  13. 1. Introduction • For even more fundamental definitions, it is recommended that the most trusted reference knowledge source be used (10) because common words have multiple meanings. • Incidentally, the use of WordNet to more precisely define data elements and improve data quality is part of the DRM 3.0 and Web 3.0 (5). • Obviously, the first critical step for the new Data Integration Workgroup Community of Interest is to decide what it means by Enterprise Data Integration, Enterprise Data Management Strategy, etc. and to define what is meant by “data” in each component of the schematic (see next slide) entitled “EPA’s 1st Generation Target Architecture Defined a Suite of Enterprise Tools”.

  14. 1. Introduction

  15. 2. EPA Data Architecture for DRM 3.0 and Web 3.0 • Last year, EPA staff and contractors were interviewed to document and critique what they had done on EPA Data Architecture Version 1 and the results were posted to a Wiki page (4) which in turn was linked to Wiki pages that document support for a number of collaborative interagency metadata-data management-data sharing activities (state of the environment, indicators, GIS-LandView, semantic interoperability, etc.) projects.

  16. 2. EPA Data Architecture for DRM 3.0 and Web 3.0 • The same is being done this year (see next slide) (5). • More specifically see: • “EPA Data Architecture for the Data Reference Model: Build to Exchange, Share, and Reuse” (14) and • “Building DRM 3.0 and Web 3.0 Knowledgebases: Where Do the Semantics Come From?” (15).

  17. 2. EPA Data Architecture for DRM 3.0 and Web 3.0

  18. 2. EPA Data Architecture for DRM 3.0 and Web 3.0 DRM 1.0 SICoP All Three Unify DRM 3.0 Ontologies Source: Expanding E-Government, Improved Service Delivery for the American People Using Information Technology, December 2005, pp. 2-3. http://www.whitehouse.gov/omb/budintegration/expanding_egov_2005.pdf with annotations by the author.

  19. 2. EPA Data Architecture for DRM 3.0 and Web 3.0 • Specifically the purpose is to build on the EPA Data Architecture for DRM 2.0 work and evolve to Building DRM 3.0 and Web 3.0 Knowledgebases to create an “An Information Sharing Environment for the US EPA: The Semantics and Line of Sight of the Organization.” • An Information Sharing Environment includes: • Locate (e.g., Google), • Collaborate (e.g., Wiki), and • Integrate (e.g., Semantic Wiki).

  20. 2. EPA Data Architecture for DRM 3.0 and Web 3.0 • The pilot of an integrated EPA Information Sharing Environment (see next slide) using a Semantic Wiki includes: • EPA Strategic Plan: Charting Our Course (2006-2011) Knowledgebase. • U.S. EPA Enterprise Architecture 2007 Knowledgebase. • EPA's FY2006 Performance and Accountability Report Knowledgebase. • EPA's 2007 Report on the Environment (ROE) Knowledgebase.

  21. 2. EPA Data Architecture for DRM 3.0 and Web 3.0

  22. 2. EPA Data Architecture for DRM 3.0 and Web 3.0 • Knowledgebases for Service Systems are defined as: • A semantic model = ontology(s) + the database of instances built as a social contract between those the know how to build them and those that need them (business partners). An ontology is a formal description of the meaning of the information used by software systems. Just like relational databases use SQL as a query language, ontologies developed using Semantic Web standards are queried with a query language called SPARQL. SPARQL is a simple yet powerful language. A single SPARQL query can combine the selection criteria based on the data values as well as their meaning. Unlike relational databases and SQL which are tightly bound to a specific data model, ontologies are highly flexible making it possible to: • (1) easily accommodate changes in the data model, and • (2) create generic queries that work in multiple situations and don't need changing when the data model must change.

  23. 2. EPA Data Architecture for DRM 3.0 and Web 3.0 Paradigm Shift for Enterprise Data Integration and Data Warehousing

  24. 3. RDF from Data Tables and Relational Databases • At the recent W3C/WSRI Workshop entitled “Toward More Transparent Government on eGovernment and the Web” (5) SICoP suggested that a clear message about the role of RDF in data exchange and a series of pilots using government data sources would help educate and demonstrate the value of the Semantic Web (aka the Data Web) to the Federal Government. • The W3C has a new Semantic Web Layer Cake (see next slide) in which RDF has moved into the XML space and has been expanded with query and rules! The W3C also has an upcoming Workshop on RDF Access to Relational Databases (see slides 26-27 for description).

  25. 3. RDF from Data Tables and Relational Databases

  26. 3. RDF from Data Tables and Relational Databases • RDF has generate renewed discussion among the Data Management Community and the new Semantic Web Layer Cake has engendered considerable discussion among the Ontology Community. • The upcoming Workshop on RDF Access to Relational Databases will draw members of the Semantic Web and relational database communities together to examine commonalities, distinctions and next steps for expressing relational data in RDF. • Consumers and potential consumers of RDF data will provide use cases and goals (e.g. SICoP). http://www.w3.org/2007/03/RdfRDB/cfp

  27. 3. RDF from Data Tables and Relational Databases • The ubiquity of relational data makes it an attractive next target for the Semantic Web. Much of the data that is used in automation is stored in relational databases. • RDF's grounding in universal terms makes RDF attractive to the relational database community. Expressing relational data in RDF allows them to join relational data with data in other databases or in other forms. • Despite the ubiquity and utility of relational data, connecting data between databases remains problematic and resource-intensive. Joining data between independently-developed relational databases requires tedious scripting, data warehousing, or tailored integration systems. • RDF queries accessing multiple relational databases have shown that RDF can be used to unify independent relational databases and link in external sources, e.g. documents and data from the Web. • The provenance for relational data in the RDF can also be expressed in RDF. All of this data will be available for access by query and rules languages. http://www.w3.org/2007/03/RdfRDB/cfp

  28. 4. Examples • The initial discussions of the Data Integration Workgroup Community of Interest suggests the “need for an integrated, agency-wide metadata management tool” and many other things. • However, the Semantic Web/Semantic Technology approach to data integration is not the data warehousing approach but the data reuse approach – putting the business and technical rules, logic, etc. into the data itself using markup languages (5) to benefit EPA and its data sharing partners. • Most, if not all, of EPA’s databases lack this added value that provides data independence – mobility from their storage systems. • A simple example is the fact that probably EPA’s most important databases and metadata are in four Adobe PDF files (or Word files), namely the four documents used in the pilot of an integrated EPA Information Sharing Environment in Section 2!

  29. 4. Examples • Specifically, EPA's 2007 Report on the Environment (ROE) (24) contains thorough documentation and standard metadata templates (see Appendix A) for the 86 indicators selected using six criteria based on EPA’s Information Quality Guidelines (25) and a Peer Review Process described in Appendix B of the report (24). • This was “reused” to create a knowledgebase (see next slide) that contains a metadata database that could be considered to be “an integrated, agency-wide metadata management tool” and is similar to interagency data sharing and reuse done for years in EPA's Guide to Selected National Environmental Statistics in the U.S. Government (26) and the Integration of Environmental Information and Indicators (27).

  30. 4. Examples

  31. 4. Examples • Supports DRM 3.0/Web 3.0 Data and Information Functionalities: • Data and Metadata: • Full text of documents, standards, meeting notes, etc. • Harmonization: • Different ways in which the same words are used. • Enhanced Search: • Across all content and showing context (e.g. words around the term or concepts) • Mashups: • A website or application that combines content from more than one source into an integrated experience (repurposing, reuse).

  32. 4. Examples The Summary Statistics of the Data Asset Database * One question without an indicator.

  33. 4. Examples

  34. 4. Examples • The individual data tables with their elements and attributes (recall slide 32) were compiled into 5 multi-sheet spreadsheets (Microsoft Excel), one for each of the 5 topics in the 2007 EPA Report on the Environment. • The multi-sheet spreadsheet for “water” is shown for the index (table of contents) and the Exhibit 5-2 indicator data tables (see slides 35 and 36, respectively)

  35. 4. Examples

  36. 4. Examples

  37. 5. Recommendations • Obviously, the first critical step for the new Data Integration Workgroup Community of Interest is to decide what it means by Enterprise Data Integration, Enterprise Data Management Strategy, etc. and to define what is meant by “data” in each component of the schematic entitled “EPA’s 1st Generation Target Architecture Defined a Suite of Enterprise Tools”. • Data architects need to be able to use a markup language (XML, RDF, and OWL) tools to capture the relationships between data, metadata, models, and metamodels to implement DRM 2.0 and DRM 3.0 and Web 3.0.

  38. 5. Recommendations • It is significant to note that more than half the indicators are from non-EPA agencies so data sharing and reuse is critical to EPA’s mission to reporting on the State of the Environment. • The knowledgebase contains an inventory of the data assets that shows the critical importance of standardized metadata and cross-agency data sharing to the mission of the US EPA and how the implementation of the data asset inventory supports the four functionalities of DRM 3.0 and Web 3.0, namely, Integration of Data and Metadata, Harmonization, Enhanced Search, and Mashups. • We are now looking for Agency best practice examples to build additional knowledgebases like Health USA, CIA Fact Book, etc., some of which have already been worked on in previous interagency projects.

  39. 5. Recommendations • A New Enterprise Data Management Strategy Based on: • The premise of reusing the data and information rather than changing the data systems themselves: • Putting the business and technical rules, logic, etc. into the data itself using markup languages. • The concepts and standards of the Semantic Web: • Also called the Data Web or Web 3.0. • The most important tenets of the reuse are: • Bring the data and the metadata back together. • Bring the structured and unstructured data and information back together. • Bring the data and information description and context back together.

  40. 5. Recommendations • A SICoP position paper for the W3C Workshop on RDF Access to Relational Databases makes Government data tables and relational databases (e.g. LandView 6 and 7 on DVD) readily accessible for Semantic Web pilots: • http://www.w3.org/2007/03/RdfRDB/cfp • http://landview.census.gov • Others are encouraged to do the same!

  41. 6. References • (2) Federal Enterprise Architecture’s Data Reference Model 2.0 • http://www.whitehouse.gov/omb/egov/a-5-drm.html • (3) EPA Data Architecture for DRM 2.0 • Wiki Page: http://colab.cim3.net/cgi-bin/wiki.pl?EPADataArchitectureforDRM2 • Print Version: http://colab.cim3.net/file/work/SICoP/EPADRM2.0/EPADADRM08072006.doc • (4) EPA Data Architecture for DRM 3.0 and Web 3.0 • Wiki Page: http://colab.cim3.net/cgi-bin/wiki.pl?EPADataArchitectureforDRM3Web3 • Print Version: http://colab.cim3.net/file/work/SICoP/EPADRM3.0/EPADADRM08032007.doc • (5) Toward More Transparent Government Workshop on eGovernment and the Web, United States National Academy of Sciences, Washington DC, USA , June 18-19, 2007, W3C and Web Science Research Initiative. • http://www.w3.org/2007/06/eGov-dc/agenda.html • (6) SICoP Special Briefing, August 9, 2007: • http://colab.cim3.net/cgi-bin/wiki.pl?SICoPSpecialBriefing_2007_08_09 • (7) Federal Semantic Interoperability Community of Practice (SICoP): • Wiki Page: http://colab.cim3.net/cgi-bin/wiki.pl?SICoP • Print Version: http://colab.cim3.net/file/work/SICoP/SICoPFactSheet.doc

  42. 6. References • (8) Consolidated Reference Model (CRM) Version 2.2, July 24, 2007. See pages 8-9. • http://www.whitehouse.gov/omb/egov/documents/FEA_CRM_v22_Final_July_2007.pdf • (9) Data Architecture, Modeling, and Networks, January 5, 2007, • http://colab.cim3.net/file/work/SICoP/EPADRM2.0/EPADataArchitecture01052007.ppt • (10) Princeton Wordnet: • http://wordnet.princeton.edu/ • (14) EPA Data Architecture for the Data Reference Model: Build to Exchange, Share, and Reuse, July 30, 2007: • http://colab.cim3.net/file/work/SICoP/EPADRM3.0/SICoPEPADRM307242007.ppt • (15) Building DRM 3.0 and Web 3.0 Knowledgebases: Where Do the Semantics Come From?, July 11, 2007 (Tutorial - especially slides 31-44 specific to EPA): • http://colab.cim3.net/file/work/SICoP/2007-07-11/SICoP07112007.ppt

  43. 6. References • (21) Spatial Ontology Community of Practice: • http://www.visualknowledge.com/wiki/socop • (22) LandView 6 and 7 (in process): • http://landview.census.gov • (23) June 14, 2007, Keynote at Gartner EA Summit, Data and Information Architecture: Not Just for Enterprise Architects!: • http://colab.cim3.net/file/work/SICoP/2007-06-14/SICoPGartner06142007A.ppt • (24) EPA’s 2007 Report on the Environment: Science Report: • http://cfpub.epa.gov/eroe/index.cfm • (25) EPA’s Information Quality Guidelines: • http://www.epa.gov/quality/informationguidelines/ • (26) EPA's Guide to Selected National Environmental Statistics in the U.S. Government: • http://www.sdi.gov/ • (27) Integration of Environmental Information and Indicators: • http://colab.cim3.net/file/work/SICoP/2007-01-25/SICoPSWRR01252007.ppt

  44. Appendix A. EPA’s 2007 Report on the Environment: Science Report Metadata Template • 1. Data Set Source • 2. Data Collection Date • 3. Data Collection Frequency • 4. Data Set Description • 5. Technical Questions: • (1) Are the physical, chemical, or biological measurements upon which this indicator is based widely accepted as scientifically and technically valid? • and 13 More…

More Related