A Primer for Data Methodology in the Cloud: Making Data Governance Work in Hybrid Environments

A Primer for Data Methodology in the Cloud: Making Data Governance Work in Hybrid Environments Dr. Brand Niemann Director and Senior Data Scientist Semantic Community July 28, 2011

Webinar Description • Establishing a foundation for data governance has never been more critical as federal agencies face more data center consolidation pressures. Many agencies are following the IT trend of breaking their problems into smaller pieces to make a complex problem more solvable. Your agency may be planning to send “some data” and “some applications” to the cloud, but do you have a methodology for optimizing your data once it’s spread across a hybrid environment? • Join us to learn what you need to do to lay the groundwork for a good data governance program to support your agency’s consolidation goals: • Create views and models of your architecture • Maintain clear definitions of data, involved applications/systems and process flows • Leverage metadata for data governance processes • And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools

Speakers • Moderator:Michael Smoyer, President, Digital Government Institute • The moderator will introduce speakers, coordinate logistics and Q&A with the "virtual" attendees. • David Lyle, VP Product Strategy, Office CTO, Informatica • Co-author of “Lean Integration: An Integration Factory Approach to Business Agility” • Brand Niemann, Director and Senior Data Scientist, Semantic Community • Author of over 50 Data Science Products in the Cloud for the US EPA and Data.gov

David Lyle • He co-authored two books… his latest was just last year. The book “Lean Integration: An Integration Factory Approach to Business Agility”, published by Addison-Wesley. This book shows how “Lean” and “Agile” thinking can be applied to information management projects because they all follow a relatively small number of repeating patterns, and taking an assembly-line approach to dealing with these patterns delivers information to the business far faster, with less risk and cheaper costs than traditional approaches. • He spoke at DGI’s EA Conference about “the acceleration in volumes of data as well as the acceleration in technological “options” (cloud, appliances, SOA, etc.) makes this problem (we call it the “integration hairball” in the book) even worse. • With Lean Principles, (focus on the customer, eliminate waste in processes from the customer’s perspective, and use technology to manage this complexity more efficiently), we have a fighting chance, not to make the simple tasks mundane, but to make the seemingly impossible tasks manageable. • The goal is to create a better IT world where the “customer/citizen” can self-serve themselves (when appropriate), yet give IT the visibility, oversight and governance of what the “customer/citizen” is up to. http://www.linkedin.com/in/davelyle

Brand Niemann • Dr. Brand Niemann is the Director and Senior Data Scientist of the Semantic Community. He was the former Senior Enterprise Architect and Data Scientist at the U.S. Environmental Protection Agency and co-led the Federal CIO Council’s Semantic Interoperability Community of Practice (SICOP) with Mills Davis from 2003-2008. He is currently authoring a series of Editorials for Federal Computer Week on his work and recently made Spotfire's Twitter list for his cool visualizations on government data to produce more transparent, open and collaborative business analytics applications. • http://semanticommunity.info/A_Gov_2.0_spin_on_archiving_2.0_data • http://spotfireblog.tibco.com/?p=5328 • He is working as a data journalist for AOL Government due to launch July 11th. • http://semanticommunity.info/AOL_Government • He is also helping organize the 12th SOA for eGov Conference, October 11th. • http://semanticommunity.info/Federal_SOA

Preface • Thank you for the opportunity to present. • Primer (basic), Methodology (real-world example), and Cloud (tools I used). • Real-world example: EPA Apps for the Environment Challenge – good place to start and learn since agency data governance already in place and build on that! • Some metrics: About 50 Data Products, Over 100 Spotfire Visualizations, Nine Data Stories for Federal Computer Week this year and 15 for AOL Government: Google “AOL Government Brand Niemann” to see the three that have been published since July 13th launch.

Overview • Data Center Consolidation Initiative: Send agency data to Data.gov and to the Cloud and close data centers. • My solution was and is: Put My EPA Desktop in the Cloud in Support of the Open Government Directive and a Data.gov/Semantic • Published a Paper April 19, 2010 • Data Governance Program to Support Your Agency’s Consolidation Goals: • My solution was and is: • Create views and models of your architecture • Maintain clear definitions of data, involved applications/systems and process flows • Leverage metadata for data governance processes • And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools • Using the EPA Apps for the Environment Challenge

EPA Apps for the Environment Challenge • Applications for the challenge must use EPA data and be accessible via the Web or a mobile device. EPA experts will select a winner and runner up in each of two categories: Best Overall App and Best Student App. In addition, the public will vote for a “People’s Choice” winner. Apps will be judged based on their usefulness, innovation, and ability to address one or more of EPA Administrator Lisa P. Jackson’s seven priorities for EPA’s future. Winners will receive recognition from EPA on the agency’s website and at an event in Washington, DC in the fall, where they can present their apps to senior EPA officials and other interested parties. Source: http://www.epa.gov/appsfortheenvironment/

EPA Apps for the Environment Challenge • EPA challenges you to find new ways to combine and deliver environmental data in a new app. In the Apps for the Environment challenge, you have free reign to make an app that uses EPA data, addresses one of Administrator Lisa Jackson’s Seven Priorities, and is useful to communities or individuals. EPA encourages you to use other environmental and health data too. The winners will be honored at a recognition event in Washington, D.C. this fall and the winning apps will be publicized on EPA’s website. Source: http://www.epa.gov/appsfortheenvironment/

Create views and models of your architecture Unstructured to structured information view and model Supports Sitemap.org and Schema.org Protocol http://semanticommunity.info/AOL_Government/EPA_Announces_Apps_for_the_Environment_Challenge#Apps_for_the_Environment

Maintain clear definitions of data, involved applications/systems and process flows Data set inventory and data element dictionary Work flow for Phases I (Preparation) and II (Applications) http://semanticommunity.info/@api/deki/files/13015/=EPAApps.xlsx

Leverage metadata for data governance processes The EPA TRI 2009 has 99 data elements defined in a 30 page PDF file that was exposed here with well-defined URLs (Getting to the Five Stars of Linked Open Data) http://semanticommunity.info/EPA/EPA_Toxic_Release_Inventory_2009#Record_Layout

And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools The Data sets and data dictionaries and links to data sources and metadata are integrated here PC Desktop Spotfire

And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools Phase I identifies Data Quality Issues: The Guam Brownfields site is obviously mis-located (see outlier to extreme right in the Scatter Plot below). It should be a negative Longitude and have a larger value. SpotfireWeb Player

And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools Socrata at Data.gov

And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools • Smart Mapping: Automatic Creation of Information Models: • Spotfire3.3 Information Services users can automatically generate 1-to-1 mappings of the existing tables and columns in their Data Sources. Just generate a Data Source in Spotfire, then right click it and select “Create Default Information Model…” This helps a lot when the work has already been done to nicely model and expose tables for business applications such as Spotfire, so the mapping step is more about transparency than transformation. For example, if you use Spotfire Application Data Services, you do the work in ADS to expose Spotfire-ready tables and columns, so a simple transparent mapping of those elements through Spotfire Information Services can now be accomplished in one click. Note that the automated creation will work through nested levels of data objects in the data source you supply. • The result is a folder structure that matches the catalogs, schemas etc. that were selected with a column element for each column and an information link for each table containing those column elements. Procedures will get a procedure element and an information link of their own if they return data. • See next slide. http://semanticommunity.info/@api/deki/files/10975/=Whats_New_in_Spotfire_3.3.pdf

And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools

And clearly define the integration and interfaces among the various platform tools and between platform tools with other repositories and vendor tools • Semantic Community Workflow: • Information Architecture of Public Web Pages in Spreadsheets as Linked Open Data. • Public Reports (Web and PDF) in Wiki as Linked Open Data. • Desktop and Network Databases in Wiki and Spreadsheets in Linked Open Data Format. • Spreadsheets in Spotfire as Linked Open Data. • Spreadsheets in Semantic Insights Research Assistant for Semantic Search, Report Writing, and Ontology Development.

Questions and Answers • Now and Later: • Brand Niemann • Director and Senior Data Scientist • Semantic Community • http://semanticommunity.info • bniemann@cox.net

Supplemental Slides • 7.1 Semantic Technology Training: Building Knowledge-Centric Systems • KM 2011 • SemTech 2011 • 7.2 W3C Government Linked Data Working Group • Clinical Quality Linked Data on Health.data.gov • Build Clinical Quality Linked Data on Health.data.gov in the Cloud • Hospital Compare Downloadable Database Example of "5 Star Government Data“ • 7.3 Library of Congress Project Recollection and Digital Preservation Initiative • 7.4 Elsevier/TetherlessWorld Health and Life Sciences Hackathon (27-28 June 2011) • Build TWC in the Cloud • Build NCI CLASS in the Cloud • Build the NYC Data Mine Health in the Cloud • Build SciVerse Apps in the Cloud (IN PROCESS) • 7.5 Be Informed (IN PROCESS)

7.1 Semantic Technology Training: Building Knowledge-Centric Systems http://semanticommunity.info/FOSE_Institute/Knowledge_Management

7.1 Semantic Technology Training: Building Knowledge-Centric Systems http://semanticommunity.info/Semantic_Technology_Conferences

7.2 W3C Government Linked Data Working Group • The mission of the Government Linked Data (GLD) Working Group is to provide standards and other information which help governments around the world publish their data as effective and usable Linked Data using Semantic Web technologies. • This group will develop standards-track documents and maintain a community website in order to help governments at all levels (from small towns to nations) share their data as high quality ("five-star") linked data. • The Working Group will construct and maintain an online directory of the government linked data community. • "Cookbook" Advice Site • The group will produce Best Practices for Publishing Linked Data. • The group will develop Standard Vocabularies. • First Face-to-Face Meeting, June 29-30th, NSF, Arlington, VA. http://www.w3.org/2011/gld/charter

7.2 Open Public Dataset Catalogs Faceted Browser http://semanticommunity.info/Data.gov/An_Open_Data_Public_Dataset_Catalogs_Faceted_Browser

7.2 Linked Data Cookbook • Linked Data is an evolving set of techniques for publishing and consuming data on the Web. Learn how Linked Data can turn the Web into a distributed database and how you can participate. In this session, Bernadette Hyland takes the mystery out of Linked Data by summarizing seven steps to prepare your data sets as Linked Data and announce it so others will use it. • Model without context: There is a Process: Identify, Model, Name, Describe, Convert, Publish, and Maintain. I Disagree! • Participants will understand the actual steps to produce high quality, useful data sets that can be modeled, transformed, documented and available on the Linked Data cloud. We'll discuss a recent government agency that did just this in less than 12 weeks. Best practices for data publishing as well as the "social contract" one makes as a publisher will be discussed. • Better to make progress with something rather than do nothing because we cannot be comprehensive and complete. I Disagree! • Bernadette oversees strategy for Talis‘ North American clients. She brings a strong background in commercial and government data management strategies, coupled with expertise in leading high-growth software organizations. Prior to joining Talis, Bernadette was CEO of several profitable Internet companies delivering scalable Web-based solutions for the enterprise, including Zepheira LLC and Tucana Technologies Inc., a pioneer in the emerging semantic technology community. http://semtech2011.semanticweb.com/sessionPop.cfm?confid=62&proposalid=3822

7.2 Linked Data Cookbook • 1. Leverage what exists. • Obtain data extracts (i.e., databases and/or spreadsheets) or create data in a way that can be replicated. • 2. Model data without context to allow for reuse and easier merging of data sets. • With LD, application logic does not drive the data schema, concepts, etc. • 3. Look for real world objects of interest (e.g., people, places, things, locations, etc.) and model them. • Use common sense to decide whether or not to make link. I Disagree! • 4. Connect data from different sources and authoritative vocabularies (see list of popular vocabularies below). • Put aside immediate needs of any application. I Disagree! • Don’t think about how an application will use your data. I Disagree! • 5. Write a script or process to convert the data set repeatedly. • 6. Publish to the Web and announce it! (more details shortly). • 7. Maintenance strategy (more details in the social contract at the end). http://www.slideshare.net/bhylandwood/bernadette-hyland-semtech-2011-west-linked-data-cookbook

7.2 Linked Data Cookbook • Guidelines for merging: • URIs name the resources we are describing. • Two people using the same URI are describing the same thing. • The same URI in two datasets means the same thing. • Graphs from several different sources can be merged. • Resources with the same URI are considered identical. • No limitations on which graphs can be merged. • For a government agency ... a data policy is “a must”: • specify data quality and retention, treatment of data thru secondary sources, restrictions for use, frequency of updates, public participation, and applicability of this data policy. I Agree! http://www.slideshare.net/bhylandwood/bernadette-hyland-semtech-2011-west-linked-data-cookbook

7.2 Linked Data Cookbook http://www.slideshare.net/bhylandwood/bernadette-hyland-semtech-2011-west-linked-data-cookbook

7.2 Clinical Quality Linked Data on Health.data.gov http://www.data.gov/communities/node/81/blogs/4920 See Next Slide

7.2 Clinical Quality Linked Data on Health.data.gov http://health.data.gov/def/hospital/Hospital

7.2 Clinical Quality Linked Data on Health.data.gov http://health.data.gov/doc/hospital/393303.csv

7.2 Clinical Quality Linked Data on Health.data.gov http://www.slideshare.net/george.thomas.name/clinical-quality-linked-data-on-healthdatagov

7.2 Health data innovation 'at a crawl' • The health care data community should step up its efforts to innovate to help improve the nation’s health outcomes and reduce costs, Health and Human Services Secretary Kathleen Sebelius said at the department’s second Health Data Initiative Forum on June 9. • “Use tools and use data,” Sebelius said at the forum held at the National Institute of Medicine campus in Bethesda, Md. “Do it more, do it better and do it faster.” • Sebelius said Americans experience a “triple loss” due to having the highest public health care costs, highest private health care costs, and only mediocre health outcomes. • The goal of the conference was to present 45 winning health care IT applications developed with HHS’ newly-available data sets within the last several months. HHS CTO Todd Park called the event a “Health Data Palooza” that would showcase innovation in health IT. • PerlDiverInc and Semantic Community were one of the finalists! http://fcw.com/articles/2011/06/09/nation-needs-more-health-data-innovation-sebelius-says-at-forum.aspx

PearlDiver Data Engine & Semantic Community Data Visualization Health Data Initiative Forum Submission Medicare Zombie Hunter Benjamin Young Brand Niemann PearlDiver Technologies Inc. Semantic Community

7.2 Build Clinical Quality Linked Data on Health.data.gov in the Cloud http://semanticommunity.info/Semantic_Technology_Conferences/Clinical_Quality_Linked_Data_on_Health.data.gov

7.2 Build Clinical Quality Linked Data on Health.data.gov in the Cloud http://semanticommunity.info/Semantic_Technology_Conferences/Clinical_Quality_Linked_Data_on_Health.data.gov/Hospital_Compare_Downloadable_Database_Metadata

7.2 Build Clinical Quality Linked Data on Health.data.gov in the Cloud PC Desktop Spotfire

7.2 Build Clinical Quality Linked Data on Health.data.gov in the Cloud SpotfireWeb Player

7.3 Library of Congress Project Recollection and Digital Preservation Initiative The Libraries of Congress & MIT are developing a Semantic Web Browser (Exhibit and now Exhibit 3) to do essentially what Spotfire already does!

7.3 Library of Congress Project Recollection and Digital Preservation Initiative PC Desktop Spotfire

7.3 Library of Congress Project Recollection and Digital Preservation Initiative http://semanticommunity.info/Semantic_Technology_Conferences/Library_of_Congress

7.3 Library of Congress Project Recollection and Digital Preservation Initiative Interoperability Interface! SpotfireWeb Player

7.4 Elsevier/Tetherless World Health and Life Sciences Hackathon (27-28 June 2011) http://semanticommunity.info/Build_TWC_in_the_Cloud

7.4 NYC Data Web http://knoodl.com/ui/groups/NYC_Homepage

7.4 NYC Data Web Quote: Ontology architecture is a new aspect of system architecture and development, to our knowledge it has not been employed anywhere else in DOD. http://semanticommunity.info/Semantic_Technology_Conferences/NY_Data_Mine/Revelytix

7.4 NYC Data Web http://semanticommunity.info/Semantic_Technology_Conferences/NY_Data_Mine/Revelytix#Dashboard

7.4 NYC Data Web PC Desktop Spotfire

7.5 Be Informed • A recent paper describes the formalism and rationale that Be Informed applies to business process modeling. It explains how and why goal-oriented modeling differs from more conventional business process modeling which is procedural. In the near-term, there is applicability for many government agencies, especially for those exploring semantic approaches. • For example, Dennis Wisnosky advocates semantic web (RDF & OWL) standards for modeling data integration, and a dialect of BPMN for modeling processes. The metaphor for processes is an electronic circuit specification that uses standard building blocks. "We all know what those primitives mean." Previous, costly attempts at business process modeling were failures in part because there was no standard at the primitive level. • However, as this paper makes clear, just having unambiguous primitives is only part of what is needed to specify and manage complex and dynamic business processes. Modeling flow in swim lanes is less agile than modeling goals, activities, and pre and post conditions. Source: Mills Davis, Project10x, July 5, 2011.

7.5 Be Informed Fig. 1. Summary of the Meta Model for Capturing Business Processes Source: Specifying Flexible Business Processes using Pre and Post Conditions, Jeroen van Grondelle and Menno Gulpers, Be Informed BV, Apeldoorn, The Netherlands, 13 pp.

A Primer for Data Methodology in the Cloud: Making Data Governance Work in Hybrid Environments