Federal Big Data Working Group Meetup

Federal Big Data Working Group Meetup Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup February 4, 2014

Mission Statement • Federal: Supports the Federal Big Data Initiative, but not endorsed by the Federal Government or its Agencies; • Big Data: Supports the Federal Digital Government Strategy which is "treating all content as data", so big data = all your content; • Working Group: Data Science Teams composed of Federal Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and • Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House.

Co-organizers • Brand Niemann and Kate Goodier • Kate Goodier, Host: XcelerateSolutions offices in Tysons Corner: • Capacity about 50 with Skype and WiFiavailable. The Silver Line Spring Hill Metro Stop (planned to open in March) is across the street (Route 7 and Spring Hill Road). • Directions to the building are easy and they have open underground parking: • See photo on Web Site from XcelerateSolutions Office looking south to the Spring Hill Road Silver Line Metro Station (planned to open in March 2014). • Logistics: • Refreshments, restrooms, etc.

Suggested Format • 6:30 p.m. Tutorials (I will start with - Proposed GMU Course, and hope that others would offer to do tutorials as well) and Refreshments • 7:00 p.m. Introductions and Announcements (10 seconds per individual depending on the size of the group) • Remarks by Dr. George Strawn, Director, NITRD/NCO and co-chair of the Federal Big Data Senior Steering Work Group • 7:15 p.m. Featured Presentation/Demonstration (where did you get the data, where did you store the data, and what were your results?) • What Went Wrong with the Obamacare Web Site, and How Can It Be Fixed? and Why the First Rollout of HealthCare.gov Crashed, an Architectural Assessment, Eric Kavanagh, Inside Analysis, and Geoffrey Malafsky, PSIKORS Institute; Healthcare.gov Data Science, Brand Niemann, Semantic Community; and Healthcare.gov Prototype Video, Kees van Mansom, Be Informed • 8:30 p.m. Networking/Individual Demos (talk among yourselves and look at one another's work) • 9:00 p.m. Continue Your Conversations Elsewhere (We need to clear out of the space)

Next Meetups • Third Meetup: February 18, 6:30 p.m. • Continue Data Science Tutorial: Modus Operandi Semantic Knowledge Base • Wave All-Source Semantic Fusion Engine: Eric Little, Modus Operandi: and Department of Defense Metadata Engineers. • Fourth Meetup: March 4, 6:30 p.m. • Hosted at NIH or NSF • Welcome by new NIH Director for Data Science, Dr. Phil Bourne (invited) and NIH Program Director, Dr. Peter Lyster (invited) • Brief demo of NIH Semantic Medline/YarcData by Tom Rindflesch and Aaron Bossett (Accepted) • Brief demo of Watson/IBM by Frank Stein (Accepted) and Chris Welty (Invited) • Presentation by Drs. George Strawn and Barend Mons on A Data Fairport and Semantic Scientific Publishing (Accepted) • Fifth Meetup: March 18, 6:30 p.m. • Continue Data Science Tutorial: Graph Databases and Bigdata SYSTAP Literature Survey of Graph Databases • Bigdata SYSTAP, Bryan Thompson, SYSTAP • Sixth Meetup: April 1, 6:30 p.m., Seventh Meetup: April 15, 6:30 p.m., Eighth Meetup: May 4, 6:30 p.m. and Ninth Meetup: May 18, 6:30 p.m. • 2nd Cloud, SOA, Semantics and Data Science Conference, June (in planning)

Data Science History at GMU • Past (1992-2003): • First University to offer Ph.D. in Computational Sciences and Informatics (aka Data Science) • I took classes in this program to work on a second Ph.D.  • Graduated about 20 Ph.D.s • Present (2003-2013): • Dr. Kirk Borne, NASA Program Manager, Joined GMU to Teach Astrophysics and Computational Science and is Undergraduate Program Advisor for Data Science. Co-creator of Data Science BS Degree Program • He Became a Top Big Data Influencer on Twitter and Professor of Astrophysics and Computational Science at GMU. Dean's Impact Award for Faculty Excellence 2013 • The Department of Computational and Data Sciences was formed in 2006. The School of Physics, Astronomy, and Computational Sciences (SPACS) was formed in 2011 by combining the Department of Physics and Astronomy with the Department of Computational and Data Sciences • Future (2013-2025): • Borne Ultimatum: Data Literacy for All! Teach Learning From Data K-12 • Large Synoptic Survey Telescope (LSST) is Built and Delivers Massive Data! • Need New Algorithms for Massive Data

George Mason University Updates Master’s Program for Data Science • With rising demand for expertise in business-oriented analytics skills, George Mason’s School of Physics, Astronomy and Computational Sciences is preparing to join a raft of other universities by updating its master’s degree programs to include three new areas of emphasis: data science, modeling and simulations, and transportation safety. • “The world is changing and these masters of big data analytics programs are sprouting up everywhere,” Borne said. “It’s clear the workforce need for data science is not for Ph.D.’s. There will be jobs for the Ph.D.’s, but the business world is not demanding Ph.D.’s.” Master’s students can do the work required to earn a degree, without the extra hurdles presented by a doctoral program and with the promise of quality job opportunities, he said. • The school is now seeking a professor of data science with Hadoop programming experience. “We know that is essential,” Borne said. http://data-informed.com/george-mason-university-updates-masters-program-data-science/

Practical Data Science for Data Scientists Class 2 Providing On-Line Class With Private Tutoring http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists

Resources • Required Textbook • Doing Data Science: • http://shop.oreilly.com/product/0636920028529.do • Free Sampler: • http://cdn.oreillystatic.com/oreilly/booksamplers/9781449358655_sampler.pdf (PDF) • Optional Supplemental Reading: • Data Science Starter Kit: • http://shop.oreilly.com/category/get/data-science-kit.do • DC Data Community: • http://datacommunitydc.org/blog/about/ • DC Data Community Calendar: • http://datacommunitydc.org/blog/calendar/ • Technology Requirements • Internet and Free Tools like Spotfire Cloud: • https://spotfire.cloud.tibco.com/tsc/#!/compproductrequest • NodeXL: • http://nodexl.codeplex.com/

Class 2 • 1/28 Finding, Cleaning, Analyzing, and Visualizing Data • Discuss Reading: Chapters 3 and 4, Present and Discuss Team Homework Exercise, Hands-on Class Exercise, and Team Homework Exercise. • My Resources: • Spotfire Cloud Library • Hands-on Class Exercise: • Exercise: Basic Machine Learning Algorithms: Continue with the NYC (Manhattan) Housing dataset you worked with in the preceding chapter: How to Find Recent Sales Data for New York City Real Estate and Rolling Sales Data

Discuss Reading • Chapter 3: • An algorithm is a procedure or set of steps or rules to accomplish a task. Algorithms are one of the fundamental concepts in, or building blocks of, computer science: the basis of the design of elegant and efficient code, data preparation and processing, and software engineering. • Machine learning algorithms are largely used to predict, classify, or cluster. Three Basic Algorithms: linear regression, k-nearest neighbors (k-NN), and k-means. • Chapter 4: • Naive Bayes is another classification method at our disposal that scales well and has nice intuitive appeal. • The example in this chapter where the raw data is text is just the tip of the iceberg of a whole field of research in computer science called natural language processing (NLP) which dates back to the 1950s.

Present and Discuss Team Homework Exercise • How Does RealDirect Make Money? • Doug Perlson, the CEO of RealDirect, has a goal to use all the data he can access about real estate to improve the way people sell and buy houses. • RealDirect is working on real-time feeds on things like when people start searching for a home, what the initial offer is, the time between offer and close, and how people search for a home online. • Realdirect.com is best thought of as a platform for buyers and sellers to manage their sale or purchase process. • RealDirect makes money by selling subscriptions to sellers to access the selling tools and offering RealDirect’sagents at a reduced commission, that hopefully increase volume.

RealDirect Data Strategy • You have been hired as chief data scientist at realdirect.com, and report directly to the CEO. • It’s looking to you to come up with a data strategy. • 1. Explore its existing website and think about how analysis of RealDirect user-behavior data could be used to inform decision-making and product development. • 2. Get some auxiliary data to help gain intuition about this market. • 3. Summarize your findings in a brief report aimed at the CEO. • 4. Have a set of communication strategies for getting to the information you need about the data. • 5. Data scientists are not “domain experts” in real estate or online businesses so understand their vocabulary. • 6.Think about whether there is a set of best practices you would recommend with respect to developing a data strategy for an online business, or in your own domain.

Hands-on Class Exercise • Exercise: Basic Machine Learning Algorithms: • Continue with the NYC (Manhattan) Housing dataset you worked with in the preceding chapter: • How to Find Recent Sales Data for New York City Real Estate and Rolling Sales Data • My Note: Now November 2012-November 2013: • Excel: rollingsales_bronx, rollingsales_brooklyn, rollingsales_manhattan, rollingsales_queens, and rollingsales_statenisland • See Spotfire User's Guide for Data Science, Insert Rows, to merge these five data sets so they are 90,328 rows. • This can be done for the NYT 31 CSV files as well! • “The Best Way to Get BIG DATA is by Starting Small”

TIBCO Spotfire 6 for Data Science Insert Rows http://semanticommunity.info/Data_Science/TIBCO_Spotfire_6_for_Data_Science

Answer the Questions About RealDirect Case Study • Where did we find the data? • Online most recent • Where did we store the data? • Excel spreadsheets • What did we find when we analyzed the data? • See Spotfire dashboard • What is our data story and product? • See Spotfiredashboard and TIBCO Spotfire 6 for Data Science

Where did we find the data? Rolling Sales Update

Where did we store the data? The Data Ecosystem! http://semanticommunity.info/@api/deki/files/27392/DoingDataScience.xlsx

What did we find when we analyzed the data?

What is our data story and product? • Data Ecosystem: • All the Chapters 2-5 Data Sets (7 MB) • Separate Spotfire Files for: • DoingDataScienceChapters2NYTAll-Spotfire (94 MB) • KDDCup19972013-Spotfire (8) (7 GB) • YahooHistoricStockPrices01012014 (0.4 MB) • OSIM2-Spotfire (298 MB) • Individual Tabs: • Chapter 2 EDA NYT Clickstream • One (simulated) day’s worth of ads shown and clicks recorded on the New York Times home page in May 2012 • Chapter 2 EDA NYC (Manhattan) Housing • The most expensive sale ($1.3 B) had no Gross Square Feet! • Chapter 2 EDA NYC Housing Merged • Merged these five data sets so they are 90,328 rows. See TIBCO Spotfire 6 for Data Science Next Slide • Chapter 5 Logistic Regression Media 6 Degrees • There are 16 numeric columns and 6 binary (0,1) columns. • Data Relationships • The Data Relationships tool is used for investigating the relationships between different column pairs.

TIBCO Spotfire 6 for Data Science http://semanticommunity.info/Data_Science/TIBCO_Spotfire_6_for_Data_Science#Story

RealDirect: Rethinking Real Estate http://www.realdirect.com/

Team Homework Exercise • Select One, But Please Present Both: • Jake’s Exercise: Naive Bayes for Article Classification: NYT Data Set (31 CSV files, 151 MB) Already Used • A Spam Filter for Individual Words: To do this yourself, go online and download Enron emails • The cleansed EDRM Enron data set is 18GB in total. The data set has been divided into many smaller files for ease of download. The files are available for download here. 131 ZIP files. • Also see: Mashing Up Structured Data and Unstructured Content: Use free text search to explore and analyze the unstructured content in Enron emails. I downloaded the Spotfire file but could not get it to open. • Form Teams (Same or New), Ask Me Questions, and Prepare to Present Next Week

Preview of What You Are Going To Hear • Remarks by Dr. George Strawn, Director, NITRD/NCO and co-chair of the Federal Big Data Senior Steering Work Group • What Went Wrong with the Obamacare Web Site, and How Can It Be Fixed? and Why the First Rollout of HealthCare.gov Crashed, an Architectural Assessment, Eric Kavanagh, Inside Analysis, and Geoffrey Malafsky, PSIKORS Institute; • Healthcare.gov Data Science, Brand Niemann, Semantic Community; and Healthcare.gov Prototype Video, Kees van Mansom, Be Informed • Update: Accenture to take over for CGI Federal as healthcare.gov lead

Federal Big Data Working Group Meetup