1 / 30

Federal Big Data Working Group Meetup

Federal Big Data Working Group Meetup. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup May 20, 2014.

felton
Download Presentation

Federal Big Data Working Group Meetup

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Federal Big Data Working Group Meetup Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup May 20, 2014

  2. Mission Statement • Federal: Supports the Federal Big Data Initiative, but not endorsed by the Federal Government or its Agencies; • Big Data: Supports the Federal Digital Government Strategy which is "treating all content as data", so big data = all your content; • Working Group: Data Science Teams composed of Federal Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and • Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House. Co-organizers: Brand Niemann and Kate Goodier

  3. May 6th Meetup: EPA/NASA Climate-Environment­al Data Analytics & A Redesigned, Open Data.gov • How was the Meetup? • Thanks for continually providing a forum facilitating discussion and bringing in speakers with diverse experience. On my drive home NPR was fittingly enough talking about big data. • Just lots of good info on big data; I also am a big fan of data.gov, so it's exciting that so much is happening with government open data. Perhaps we'll see even more APIs? • Jeanne Holm: You can find more of the APIs at https://www.data.gov/developers/apis and http://catalog.data.gov/dataset?res_format=apiThere are about 450 between the two. • Amazing growth in membership: Our 200th member! • Welcome: Inge, Consultant working in the federal/health space. http://www.meetup.com/Federal-Big-Data-Working-Group/events/174975182/

  4. EPA & NASA Climate/Environmental Data Analytics, Dr. Joan Aron, Global Environmental/Climate Change Scientist • Data Analytics Needs Scenario Water Quality: • End User of Big Data: • Perspective of Risk Analysis: • CODATA Integrated Research on Disaster Risk • Continuity of Data: • US EPA Air Data • Linkages of Data: • Conservation International • Linkages of Climate and Water Quality: • US Interagency Chesapeake Bay Program • Answer Three Questions (with sample analytics by Brand Niemann): • How was the data collected? • Where is the data stored? • What are the data results? http://semanticommunity.info/@api/deki/files/29022/JoanAron05062014.pptx

  5. Federating Big Data for Big Innovation and A Redesigned, Open Source Data.gov, Dr. Jeanne Holm, Data.gov Evangelist • Background: • Usability Tests Put Brakes on Data.gov Redesign • Linkedin Discussion • Main Points: • Releasing and using open data is about empowering people to make better decisions • Open data is an ecosystem • Building a federated catalog of national data • Keeping the conversation fresh: Multiple rounds of usability testing found that redesign was needed and now doing monthly builds • A Global Movement has begun to provide transparency and democratization of data • My Note: • See my Tutorial Slides 12-19 http://semanticommunity.info/@api/deki/files/29263/JeanneHolm05062014.pptx

  6. Activities • White Paper for DARPA, NASA, NIH, NIST and NITRD: “Making Big Data Small" using Data Science and Semantics: • See Framework and Questions and Answers • Dan Kaufman, DARPA Director of Innovation, and Paul Cohen, DARPA Big Mechanism Project Director • Drs. FarnamJahanian (NSF Big Data Publications), Phil Bourne (Data Culture at NIH), and John Holdren (Climate Change Impacts) • Health Datapalooza V, June 1-3: • See next slides • CODATA International Society for Digital Earth (ISDE) Workshop on Big Data for International Scientific Programmes: Challenges and Opportunities, June 8-9: • See next slides • Big Data for Government, June 16-17: • Keynote from Dr. George Strawn and Presentation by Dr. Tom Rindflesch and Semantic Medline/YarcData Team • Earth Cube All-Hands Meeting, June 24-26: • Report at July Meetup

  7. Framework for White Paper • Organize a Community of Data Scientists and Related Fields to focus on treating all of your content as "Big Data" • Example: Federal Big Data Working Group Meetup • Follow the Cross Industry Standard Process for Data Mining (CRISP-DM; Shearer, 2000) consisting of Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment • Example: Semantic Community Data Science Knowledge Base (Big Data Science for CODATA)  • Mine prominent scientific journals for data policy, data bases, and data results that can be reused. • Example:​ CODATA Data Science Journal (509 publication by 9 attributes) • Provide data stories and presentation materials for public education and conferences • Example: CODATA International Workshop on Big Data for International Scientific Programmes, June 8-9, in Beijing • Obtain NSF funding for sustained data science for data publications work over a period of years • Example: Critical Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA) • Provide a Data Fairport with “Data Publication in Data Browsers” • Example: Semantic Community Spotfire Cloud Library

  8. Framework Questions & Answers • Is this the Barend Mons Nanopub approach to the data publication of cardinal assertions: No, please see the examples in these slides. • What are the goals of the White Paper and NSF Grant Proposal?: • White Paper documents the Framework for general public relations and marketing purposes. The NSF Grant Proposal is to obtain long-term funding to sustain this Framework and Mission Statement activity. • In essence we know that NSF wants a community that follows standards to produce data science publications that reside in a knowledge base repository and workforce training that supports STEM and data scientists. • What type of Meetup presentations do we want?: • Content that supports the Framework, Mission Statement, and White Paper. But not every presentation does because we leave that to each presenter. All we ask is that they at least answer three fundamental questions in their presentation: • How was the data collected? • Where is the data stored?, and • What are the data results? • So the presentations are not marketing-vendor-organization promoting.

  9. Kaufman and Cohen: A Data Science Big Mechanism for DARPA My Note: Invited to June 2ndMeetup on Reading & Reasoning with Semantic Insights for the DARPA Big Mechanism http://semanticommunity.info/Data_Science/A_Data_Science_Big_Mechanism_for_DARPA

  10. FarnamJahanian: NSF Big Data Publications Answer: This is how the data was collected. http://semanticommunity.info/Data_Science/NSF_Big_Data_Publications#Story

  11. STM Innovations Seminar U.S. 2014 • International Association of Scientific, Technical & Medical Publishers: The Voice of Academic and Professional Publishing • STM is at the leading edge of the latest technology trends within publishing. This annual US-event brings together the industry's most established thinkers and bright up-and-coming future stars to gives attendees an insight into the hottest innovations and vital technological trends and developments which will define STM publishing for years to come. • Annual US Event: Bright Research, Smart Articles and the new Author Ego-System • Opening Keynotes: Analytics and Metrics • David Smith (Baseball) and Kevin Boyack (Mapping & Analytics of Science Publishing) • Plenary: The Smart Article • Increasingly the research article becomes computable, adding research data, algorithms and smart searching. How intelligent will the article become; Can it find you so you no longer need to search for it? Can it test assertions? Generate new hypotheses? Can articles generate new articles without human interference?  Will human analysis be eliminated and, if so, up to what point….where are the new opportunities for publishers. Come and listen to two experts in data mining and actionable articles, both well known from FORCE11. (Larry Hunter and Anita de Waard) http://www.stm-assoc.org/events/stm-innovations-seminar-u-s-2014/

  12. Mined STM 2014 Tweets • Tech trend 1: the machine is the new reader. Highlights from the Future Lab team • Tech trend 2: the return to the author • Tech trend 3: new players changing the game. see http://ow.ly/3jPdvY • Kevin Boyack of SciTech shares data that shows books are 2 to 4x more cited than journal articles in sciences • L Hunter: "With enough data you don't need semantic search. You can just use statistics." • L Hunter: Knowledge Representation (publishers) look at Alzforum collaborative knowledge sharing • A baseball metrics talk to open. With perfect timing, the latest submission to the @writelatex gallery is an article on baseball!: https://www.writelatex.com/articles/professional-baseball-pitchers-performance-and-its-effect-on-salary/ • Anita de Waard: "Looking for Data: Finding New Science“: http://t.co/eok3ma37vO http://semanticommunity.info/Data_Science/NSF_Big_Data_Publications#Story

  13. Analytics and Metrics: Baseball Salaries Answer: This is where the data is stored My Note: All data sets integrated into one spreadsheet. http://semanticommunity.info/@api/deki/files/29262/BaseballSalaries.xlsx?origin=mt-web

  14. Data Science for Baseball Salaries:Spotfire Data Publication Answer: This is where the data is stored and the results. Web Player

  15. Philip Bourne: Changing the Data Culture at NIH Answer: This is how the data was collected. http://semanticommunity.info/Data_Science/Data_Culture_at_the_NIH#Story

  16. Earlier Interactive Visualization of HINI Data in Spotfire Answer: This is where the data is stored and the results. Web Player

  17. NIH Data Publication 1: Spotfire Answer: This is where the data is stored and the results. Web Player

  18. Data Science for Health Datapalooza V • Started by Todd Park, US CTO, in 2010. • I have participated in all of them as a Government Data Scientist (2010) and Private Data Scientist (Contest in 2011: Medicare Zombie Hunter) and Data Journalist (2012-2014). • Like Sessions (4-One is Semantic Medline), Activities (Demos and Code-a-Palooza), and Data Lab (Damon Davis) • Used Centers for Medicare & Medicare Services (CMS) Claims Data without Coding! • The 1.7GB uncompressed with 27 columns and more than 9 million records, was easily downloaded, uncompressed and imported into Spotfire resulting in a 381 MB sized file! • The only problem was that the Web Player display timed-out for the two scatterplots of the data relationships.

  19. Data Science for Health DatapaloozaV: MindTouch Knowledge Base Answer: Data was collected by Methodology. http://semanticommunity.info/Data_Science/Data_Science_for_Health_Datapalooza

  20. Data Science for Health Datapalooza V: Spotfire Data Storage and Results Answer: Data Stored All In-Memory Web Player

  21. Data Science for Health Datapalooza V: Data Storage and Results Answer: Data Dictionary is in Spreadsheet. Answer: Data Results are in the Story. http://semanticommunity.info/Data_Science/Data_Science_for_Health_Datapalooza#Story

  22. CODATA International Workshop on Big Data for International Scientific Programmes • Summary of Data Publications in Data Browsers Products: • Presentation and Tutorial: Big Earth Sciences Data - From Descriptive to Prescriptive Analytics • Meteorite Data Set • Data Science Journal • 509 publication by 9 attributes Data Set • International Journal of Digital Earth • 350 publications by 10 attributes Data Set • Workshops on Extremely Large Databases • Collaboration invited by Michael Stonebraker • Some Highlights in Tutorial for June 2ndMeetup http://semanticommunity.info/Data_Science/Big_Data_Science_for_CODATA#Story

  23. Data Science for Climate Change:MindTouch Data Publication Answer: How was the data collected. http://semanticommunity.info/Data_Science/Data_Science_for_Climate_Change#Story

  24. Data Science for Climate Change:Excel Data Publication Answer: Where the data is stored. http://semanticommunity.info/@api/deki/files/29340/ClimateChangeImpacts.xlsx

  25. Data Science for Climate Change:Spotfire Data Publication Web Player (in progress)

  26. Agenda • 6:30 p.m. Brand Niemann, Introduction and Continue Data Science Tutorials (Refreshments) • 7:00 p.m. Introductions and Announcements (10 seconds per individual depending on the size of the group) • 7:10 p.m. Big Data: Forward - Backward,Charles Randall Howard, Adjunct Professor in the Applied IT Department and Sr. Data Scientist at NovettaSolutions • 7:45 p.m., Stories that Persuade,Anita de Waard, VP Research Data Collaborations at Elsevier Research Data Services/University of Utrecht. Also see Looking for Data: Finding New Scienceand Ten Habits of Highly Effective Data • 8:30 p.m. Networking/Individual Demos (talk among yourselves and look at one another's work) • 9:00 p.m. Continue Your Conversations Elsewhere (We need to clear out of the space)

  27. Next Meetups • June 2nd: In Planning: Ontology Summit 2014 Postmortem and Reading & Reasoning with Semantic Insights for the DARPA Big Mechanism • 6:30 pm Welcome and Introduction Slides • 6:35 pm Continue Data Science Tutorial: Practical Data Science for Data Scientists: Data Science Students and Careers and Sarah Soliman, Rand, and IV MOOC Student Project (invited) • 7:00 p.m. Brief Member Introductions • 7:10 pm Ontology Summit 2014 Postmortem: Big Data with Semantic Web and Applied Ontology, Brand Niemann See Ontology for Big Data • 7:30 pm  Two SIRA-based products: Research Assistant™ and Research Librarian™, Chuck Rehberg, Semantic Insights and Kate Goodier, Xcelerate Solutions  (limited beta test in process). See ​A Data Science Big Mechanism for DARPA • 8:30 p.m. Open Discussion • 8:45 p.m. Networking • 9:00 p.m. Depart • June 30th: MIT Big Data Initiative: bigdata@CAIL and the new Intel Science and Technology Center for Big Data, Sam Madden and Why the current "elephants" are good at nothing, Data Tamer, and data integration issues, Michael Stonebraker • July and August: Once a month to be announced • Silver Line Spring Hill Metro Station Opens in July?

  28. May 20thMeetup:Continue Data Science Tutorial • Practical Data Science for Data Scientists: • Reading Assignments: • Chapter 11: Causality • This chapter will explore the topic of causality, and we have two experts in this area as guest contributors, OriStitelman and David Madigan. In these cases your mentality or goal is not to optimize for predictive accuracy, but rather to be able to isolate causes. • Chapters 12: Epidemiology • The contributor for this chapter is David Madigan, professor and chair of statistics at Columbia. Madigan has over 100 publications in such areas as Bayesian statistics, text mining, Monte Carlo methods, pharmacovigilance, and probabilistic graphical models. • Resources: See 2/25 Specific Data Science Tools and Applications 3 • Team Homework Exercise: • See my work with the KDD Cup data sets where I have updated this to include 2011-2013. • Seemy Research Notes for Project TYCHO Data for Health. • Form Teams (Same or New), Ask Me Questions, and Prepare to Present One of These Next Week.

  29. Practical Data Science for Data Scientists Providing On-Line Class With Private Tutoring Class 6 http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists

  30. KDD Cups Data SetInventory and Metadata http://semanticommunity.info/@api/deki/files/27392/DoingDataScience.xlsx

More Related