Lifecycle Seminar Series - PowerPoint PPT Presentation

lifecycle seminar series n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Lifecycle Seminar Series PowerPoint Presentation
Download Presentation
Lifecycle Seminar Series

play fullscreen
1 / 48
Lifecycle Seminar Series
79 Views
Download Presentation
justus
Download Presentation

Lifecycle Seminar Series

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Lifecycle Seminar Series Welcome to the Community! Live Tweet to #DSSS2

  2. The Lifecycle Series • #1: July 10 The Scientist, The Team and The Purpose • #2: July 31 Organizing and Feeling Out Your Data Dates and Topics not Finalized, but roughly: • #3: Data / Analytics Preparation • #4: Modeling, Classification, and Decision-Making • #5: The Data Science Team • #6: Telling The Story: Visualizing Results

  3. We Want Contributors! • Looking for people willing to lead one of the Topics in given seminars • Looking for people who have an interesting anecdote or challenge to offer • Want to try integrating with main speaker or kick off networking session • Particularly interested in experiences/anecdotes for Session II (July 31) : Organizing and Feeling Out Your Data

  4. Data Lifecycle = Where we are But!...

  5. Data Science Lifecycle • Tonight, Focus is on Feeling Out Data • Primarily early-stage skill, but a part of all stages • Something everyone can do, increasingly so with modern tools Organizing and Feeling Out your Data

  6. Tonight’s Agenda • The Data Scientist Seminar Series • Followup from Seminar 1 • Participation opportunities • Jason Sroka: “Organizing and Feeling Out your Data” • Wrap-up & Announcements • Networking Session – Buy Jason Tequila!

  7. MarketMeSuite – Our Venue Sponsor MarketMeSuite’s Inbox For Social is how small businesses convert leads and market on social media

  8. Approach & Goals • Walk through steps of organizing and feeling out data • Focus on Data Scientist Survey • Use Survey data and anecdotes to touch on Data Science topics • Not going deep, but trying to give a real feel • Tool Discussion • Tableau and Google Refine

  9. Data Setup • We are all getting our data from somewhere • Personal data • Private data • Public data • Need tool(s) to look at it with • Will see Tableau here, many others available • Focus is on feeling out the data, not managing it • Will only mention some data management challenges • Not dealing with Big Data tonight (when we go international…) • These are topics that will be more central to future Meetup Seminars

  10. What I did • Quick scan of source • Excel File • Nulls in Beige • True flags in Green • 84 Data Rows • Import the data • Tableau reads straight from Excel Source(s) Import Analytics Tool

  11. What a Quick Scan Shows • Organization of Raw Data • Nulls in Beige • True flags in Green

  12. Start with the Basics • The first question • How many data? • 85 records imported • Move to things you know/understand • Simple categories (gender, age, ..) • Check assumptions (e.g. more males than females)

  13. Gender • Simple category • Binary • Meaningful to everyone • Data not quite so simple • What is a Null, compared to a Blank

  14. Message #1: Data is Messy! • Data Scientists have gender issues! • We have a Null and 3 blanks • Back to the source… • Null is a bad record (header?) • Blanks were user option • Clean it up • Don’t re-discover and re-implement • Someone needs to track these! • Null filtered in Tableau • Count now at 84 • Blank relabeled to “N/A” in Excel • Tools Discussion and Seminar 3 will go into Data Cleansing in more detail Before Cleaning After Cleaning

  15. Handedness • Didn’t we just fix the NULL thing? • Yes – this is a new Null • Excel had a cut-and-paste error! • Formula wasn’t used in column – values were hard-coded • Fixed formula, copied throughout Before Cleaning After Cleaning

  16. Data Scientist Ethic • Don’t ignore the warts! • Most warts are meaningless • Of those that aren’t, most are easy to figure out • Of those that aren’t, most are at least easy to fix once you figure it out • Of those that aren’t, most times you can get someone else to help you fix it • Of those that aren’t, you’ll usually improve your implementation skills when you resolve it • Sometimes this line of work sucks • The ones that aren’t help you understand the data • In this case, a problem with the data process • In other cases, interesting quirks and potential insights!

  17. Age • Survey question: Birth Year • Seeing old and new issues • Blanks • Number ranges • Survey did not constrain to YYYY

  18. Age • Survey question: Birth Year • Seeing old and new issues • Blanks • Number ranges • Survey did not constrain to YYYY • Fixed these three entries

  19. Age • Survey question: Birth Year • Seeing old and new issues • Nulls • Turn out to be blanks – valid option in Survey • Number ranges • Survey did not constrain to YYYY • Fixed these three entries

  20. Age, as Age • Birth Year isn’t our interest, Age is • Transform your data to suit your needs • Be as direct between the data and the context as you can Age Birth Year Decade

  21. The Art of Data Science • Message #2: Connect the Data to the Context • Transform the data to suit your needs • Easy investigation/understanding • Analytics goals • Operational goals • This is where Telling the Story feeds back • Effective plots help the data tell their story to you • Try things out!

  22. Favorite Color • Here, I’ve assigned colors near the named color • Sorting by most prevalent to least • Blank isn’t adding anything • Removing

  23. Favorite Color • Now, let’s add Gender • Okay – I see differences! • Something to form an impression from • Something to come back to • Blue is now the Official Data Scientist color!

  24. Check Assumptions • Assumption 1: More Males than Females • Assumption 2: 10-15% Lefties • Underestimate! • Assumption 3: Different color preferences by Gender

  25. Checking Assumptions… • Familiarizes You with the Data • Identifies data issues • Tests your assumptions • Gives you Confidence in the Data… • Confidence in the initial source • Confidence in Extraction, Transformation, Load • …and Your Assumptions • Confidence in your Intuition where it was right • Updates to your Intuition where it was off

  26. Building a Data Model • Data comes in different types • Categorical • Gender, Handedness, Favorite Color, any true/false • Scalar • Age, height, weight • Label/identifier • … • These data types often associate with the purpose to which it will be applied • Categories are dimensions along which we might divide the records • Measurements (Scalars) are facts about specific instances of what we’re modeling • A good data model allows for rapid analytics • Modular construction of sets of dimensions and measurements • Automated investigation of cross-relationships

  27. Survey Duration • Another processed ‘field’ • End Time – Start Time • Plotting it all: sparse info • A lot of short times • A few long times • Outliers are hiding the data! • After filtering out extremely high values, a different picture emerges… Same Data, Different Lenses

  28. Playing with Plots 1:Beware Bad Binners! • How you choose bins and plot a histogram can impact your interpretation Same Data, Different Axes Very flat; One entry per bin Still flat, but the voids in X-axis have meaning

  29. Survey Duration: 1 Second Bins

  30. Survey Duration: 3 Second Bins

  31. Survey Duration: 5 Second Bins

  32. Survey Duration: 10 Second Bins

  33. Survey Duration: 15 Second Bins

  34. Survey Duration: 20 Second Bins

  35. Survey Duration: 30 Second Bins

  36. Survey Duration: 45 Second Bins

  37. Survey Duration: 60 Second Bins

  38. Survey Duration: 1,000 Second Bins

  39. The Practice of Data Science Bin Size (seconds) 1,000 • I just tricked you into looking at a bunch of data! • That is Data Science in action • It is a skill like many others • We all have some ability • We get better with practice • It’s pattern recognition 1 60 3 45 5 10 15 20 30

  40. The Science of Data • Distributions have meaning • Flat: random, fixed • Normal distributions: repeated processes • Exponential: cumulative processes • Over time, we interpret data in terms of known distributions • Survey Duration: Gaussian + Exponential Wikipedia.org Wikipedia.org

  41. Survey Duration • Another processed ‘field’ • End Time – Start Time • Plotting it all: sparse info • A lot of short times • A few long times • Outliers are hiding the data! • After filtering out extremely high values, a different picture emerges • Normal Distribution plus sparse tail • People who start, complete, end • People who start, stop, return, <repeat>, end Same Data, Different Lenses

  42. Tools • I used Tableau here • A lot can be done directly in Excel • Google Refine looks impressive http://www.youtube.com/watch?v=B70J_H_zAWM&feature=player_embedded • Highlights cleansing issues, supports resolution Source(s) Import Analytics Tool

  43. Data Science Lifecycle • Tonight, Focus is on Feeling Out Data • Primarily early-stage skill, but a part of all stages • Something everyone can do, increasingly so with modern tools Organizing and Feeling Out your Data

  44. Closing Thoughts • Message #1: Data is Messy • Don’t ignore the warts • Message #2: Connect the Data to the Context • Translate data so it is expressed in your terms • Message #3: Check Your Assumptions • Explore the data for insights • Message #4: Develop Your Intuition • Look at a lot of data in a lot of ways

  45. Who Rocks? • A HUGE thanks to Peggy Sue for executing the survey and organizing the results! • Super thanks to Tammy for live tweeting and sponsoring us at CIC!

  46. The Lifecycle Series Quick Note: • #6: Telling The Story: Visualizing Results • Speaker: • Hjalmar Gislason • CEO of DataMarket.com • Conference Speaker • Currently writing a book for O’Reilly called Effective Data Visualization

  47. Connect with us!