Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series

Making Data a First Class Citizen38 Degrees: An AOL Gov Conference Series Dr. Brand Niemann Director and Senior Enterprise Architect – Data Scientist Semantic Community http://semanticommunity.info/ AOL Government Blogger http://gov.aol.com/bloggers/brand-niemann/ September 18-19, 2012

Overview • September 18th Tutorial (90 minute): Making Data a First Class Citizen • Digital Agenda for Europe and Building a Digital Government US Examples: • See: http://cms.aol.com/809/content/posts/edit/20264973/ • See: http://gov.aol.com/2012/06/06/health-datapalooza-a-model-of-innovation/ • Recommended APIs and Data Sets (in process): • http://semanticommunity.info/AOL_Government/Data_Services_for_Developers • Results of Competition with Recommended APIs and Data Sets • TBA • September 19th Presentation (15 minutes): Making Data a First Class Citizen: • Summary of Three Topics Above

Outline • Data • Data Scientist • Data Science Products • Data Science Teams • Tutorials

Data • Table: Rows and Columns • Relational Database: Key Field for Multiple Tables • Unstructured: Linked Data, NoSQL, & RDF Graphs • Big: Volume, Velocity, Variety, and Value/Veracity • Architecture: Business and Science, Frameworks, & Infrastructure • Major Developments: Google Big Table and Amazon Dynamo

Data Scientist • A data scientist is a job title for an employee or business intelligence (BI) consultant who excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge. • The title data scientist is sometimes disparaged because it lacks specificity and can be perceived as an aggrandized synonym for data analyst. Regardless, the position is gaining acceptance with large enterprises who are interested in deriving meaning from big data, the voluminous amount of structured, unstructured and semi-structured data that a large enterprise produces. • A data scientist possesses a combination of analytic, machine learning, data mining and statistical skills as well as experience with algorithms and coding. Perhaps the most important skill a data scientist possesses, however, is the ability to explain the significance of data in a way that can be easily understood by others. Source: http://searchbusinessanalytics.techtarget.com/definition/Data-scientist

Tim O’Reilly: The World’s 7 Most Powerful Data Scientists • Tim O'Reilly is the founder of O'Reilly Media • "The success of companies like Google, Facebook, Amazon, and Netflix, not to mention Wall Street firms and industries from manufacturing to retail and healthcare, is increasingly driven by better tools for extracting meaning from very large quantities of data. "Data Scientist" is now the hottest job title in Silicon Valley.“ • Source: http://www.forbes.com/pictures/lmm45emkh/tim-oreilly-is-the-founder-of-oreily-media/#gallerycontent

#1 Larry Page, CEO, Google • Google, more than any other company, has pushed the boundaries of what is possible with big data. Along with Sergey Brin, he built the search engine that tamed the web, solved the problem posed by John Wanamaker a century ago ("Half the money I spend on advertising is wasted; the trouble is I don't know which half."). And in his quest to provide access to all the world’s information, he has accumulated the largest database on the planet.

#2 Jeff Hammerbacher, Chief Scientist, Cloudera and DJ Patil, Entrepreneur-in-Residence, Greylock Ventures • Hammerbacher and Patil coined the term "data scientist.” Now it’s Silicon Valley's hottest job title. These two built the first formal data science teams at Facebook and LinkedIn, respectively. Now at Cloudera, Hammerbacher has been key to driving the success of Hadoop as a standard tool for processing large, unstructured data sets with a network of commodity computers. As Data Scientist in Residence at Greylock, Patil is seeking out the next generation of hot data-driven startups.

#3 Sebastian Thrun, Professor, Stanford University and Peter Norvig, Data Scientist, Google • When Thrun and Norvig decided to teach their Stanford course, Introduction to Artificial Intelligence, over the internet, they managed to sign up over 140,000 students and proved that AI is no longer just an academic subject. Norvig is Google's chief scientist. Thrun is leading Google’s efforts to build a self-driving car that relies on AI algorithms and the memory of hundreds of thousands of miles driven by Google’s street view vehicles, recording and measuring everything they saw.

4 Elizabeth Warren, Candidate, U.S. Senate (Massachusetts) • The banking system excesses that led to the economic crash of 2008 are an example of big data gone wrong. As the provisional head of the Consumer Finance Protection Bureau, Elizabeth Warren began the job of building the algorithmic checks and balances needed to counter the sorcerer's apprentices of Wall Street. In her campaign for the US Senate, she promises to continue that fight.

#5 Todd Park, CTO, Department of Health and Human Services • Park is leading the charge to transform American healthcare into a data driven business. From medical diagnostics to insurance reimbursement to community health statistics, he is finding ways to use data to make healthcare more effective and affordable.

#6 Alex "Sandy" Pentland, Professor, MIT • Sandy is not only a wide-ranging polymath, he's providing the intellectual leadership on how sensors, the internet of things, geolocation and promiscuous connectivity can be used to uncover insights regarding human behavior. Sandy is also looking at privacy - an important adjunct to the data space - and helping develop the conversation regarding the trade-offs between privacy and the value of personal data.

#7 Hod Lipson and Michael Schmidt, Computer Scientists, Cornell University • Cornell computer scientists Hod Lipson and Michael Schmidt created an AI program that could distill the laws of motion merely by observing data from the swings of a pendulum. In the process, they kicked off the field of robotic science in which AIs try to derive meaning from datasets too large or complex for humans to study.

Data Science Products • Introduction to Data Science (Spring 2012): • Course Information: • Organizations use their data for decision support and to build data-intensive products and services. The collection of skills required by organizations to support these functions has been grouped under the term “Data Science”. This course will attempt to articulate the expected output of Data Scientists and then equip the students with the ability to deliver against these expectations. The assignments will involve web programming, statistics, and the ability to manipulate data sets with code. • Instructors: • Jeff Hammerbacher and Mike Franklin and Guest Speakers • Components: • Data preparation • Data presentation • Data products • Observation • Experimentation • Final Project • Resources (Fabulous!): • http://datascienc.es/resources/ Source: http://datascienc.es/

Data Science Products • My Process Model (Jeff Hammerbacher): • 1. Identify problem • 2. Instrument data sources • 3. Collect data • 4. Prepare data (integrate, transform, clean, impute, filter, aggregate) • 5. Build model • 6. Evaluate model • 7. Communicate results • Jim Gray (“River of Data”): • 1. Capture • 2. Curate • 3. Communicate • Data Preparation: • HTML tables • File downloads • REST APIs • Exercises: • 2012 Presidential Campaign Finance website • http://elections.nytimes.com/2012/campaign-finance • MINE: dataset was used for a competition hosted by Kaggle • UNZIP TAR: votes made by users of a social news website TAR • http://datascienc.es/2011final-project/ Source: http://datascienc.es/

Data Science Products • Chief Data Officer for a Day: • Your team has been tasked with enabling your organization to “compete on analytics” • 1. Define the top three priorities of the organization • 2. Determine the data sources you’d like to collect • 3. Highlight the largest data integration challenges you’ll face • 4. Determine the most important data to present to your organization • 5. What data products could you build? • 6. What studies could you run to answer the most pressing questions for the organization? • 7. Suggest some experiments to run to help guide the organization towards their goals

Data Science Products • AOL Government (Wyatt Kash, Editorial Director): • Clear Compelling Headline • Original Graphic • Contextual Introduction Sentence(s) • Descriptive Paragraph • Chart Itself • Individual Static Graphics • Spotfire Interactive Visualizations • Caption • Source • Rate this chart • BBC (Andrew Leimdorfer, BBC News Interactive and Graphics and Olivier Thereaux, BBC R&D) • Six Tabs: The story, The figures, Explore the data (including download the full data), Analysis: 1, Analysis: 2. Methodology, and Your comments. NOTE: Examples of these are provided in the actual tutorial.

Data Science Products 1. Identify who keeps the data and how it is kept 2. Download and prepare the data 3. Create a database 4. Double-checking and analysis Source: http://datajournalismhandbook.org/1.0/en/

Data Science Products • Data Journalism Handbook Excerpts: • The data journalism project brought a lot of people into the room who do not normally meet at the ABC. In lay terms — the hacks and the hackers. Many of us did not speak the same language or even appreciate what the other does. Data journalism is disruptive! • The practical things: • Co-location of the team is vital. Our developer and designer were off-site and came in for meetings. This is definitely not optimal! Place in the same room as the journalists. • Our consultant EP was also on another level of the building. We needed to be much closer, just for the drop-by factor • Choose a story that is solely data driven.

Data Science Teams • Building Data Science Teams • Figure 1. The rise in demand for data science talents • Being Data Driven • The Roles of a Data Scientist • Decision sciences and business intelligence • Product and marketing analytics • Fraud, abuse, risk and security • Data services and operations • Data engineering and infrastructure • Organizational and reporting alignment • What Makes a Data Scientist? • Hiring and talent • Would we be willing to do a startup with you? • Can you “knock the socks off” of the company in 90 days? • In four to six years, will you be doing something amazing? • Building the LinkedIn Data Science Team • Reinvention • About the Author http://semanticommunity.info/AOL_Government/Data_Science_for_the_Government_Community/Building_Data_Science_Teams

Tutorial:Introduction to Open Government Data • Understanding the Foundations of Open Data • What makes data open • Why countries share data • Why people want open data • Making Data Open, Accessible, and Discoverable • Policies • Processes • Change Management • Selecting and Managing Open Data Technologies • Commercial solutions • Open source platforms • Semantic web and linked data • Creating an Open Data Ecosystem • Sustaining data publishing • Engaging developers, citizens, and politicians: from communities to hackdays to challenges • Ensuring use and economic benefits • Measuring the Benefits • Creating Your Own Open Data Roadmap • Sustaining and Communicating Your Success Monday, July 9, 2012 Time: 12 noon - 16:30 p.m. Where: World Bank Headquarters, 1818 H Street, NW, Washington, DC 20433 Participants: Data stewards, open data managers, chief information officers, open data advocates, and developers Workshop Leaders: Jim Hendler, Tetherless World Constellation Professor, Rensselaer Polytechnic Institute and Jeanne Holm, Evangelist, Data.gov Source: http://semanticommunity.info/AOL_Government/Invitation_to_International_Open_Government_Data_Conference#Agenda

Tutorial:Introduction to Open Government Data • Highlights based on July 9th presentation: • Understanding the Foundations of Open Data - Having some mandate or directive to do so • Making Data Open, Accessible, and Discoverable - Getting people to release their data • Creating an Open Data Architecture - Having a platform to access and discover data and build apps • Creating an Open Data Ecosystem - Dealing with change management (policies, culture, compliance) • Measuring the Benefits - Very difficult to do • Summary and Next Steps - Go out and build your own Data.gov

Tutorial:Making Data a First Class Citizen • Digital Agenda for Europe and Building a Digital Government US Examples • See: • http://semanticommunity.info/AOL_Government/Digital_Agenda_for_Europe • http://cms.aol.com/809/content/posts/edit/20264973/ • See: • http://semanticommunity.info/HealthData.gov • http://gov.aol.com/2012/06/06/health-datapalooza-a-model-of-innovation/ • Recommended APIs and Data Sets (in process) • http://semanticommunity.info/AOL_Government/Data_Services_for_Developers NOTE: More slides to be added for actual tutorial.

Postscript • Presentation to Federal Big Data Senior Steering Group for Big Data, September 27, 2012: • A team comprised of NLM (Tom Rindflesch), Noblis (Victor Pollara), Cray (Steve Reinhardt), and Semantic Community (Brand Niemann), is working to make what Dr. George Strawn refers to as “the killer semantic web application for government”, Semantic Medline, more well-know, and functional for medical research by putting the Semantic Medline RDF database into the new Cray Graph Computer and demonstrating its usefulness. • The background for this project is at: • http://semanticommunity.info/A_NITRD_Dashboard/Semantic_Medline

Making Data a First Class Citizen 38 Degrees: An AOL Gov Conference Series