1 / 36

Introduction to Data Science – INFO 480 – Drexel University’s iSchool

Introduction to Data Science – INFO 480 – Drexel University’s iSchool. Sean P. Goggins, PhD April 2, 2013 Week Three. What is Data Science?. Storytelling Database Theory – How you organize your data has a big influence on what you can do with it.

len
Download Presentation

Introduction to Data Science – INFO 480 – Drexel University’s iSchool

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Data Science – INFO 480 – Drexel University’s iSchool Sean P. Goggins, PhD April 2, 2013 Week Three

  2. What is Data Science? • Storytelling • Database Theory – How you organize your data has a big influence on what you can do with it. • Agile Manifesto – Key thing is iterative development; it’s a technology value system. • Spiral Dynamics – What we view as fact and what we desire emerges from the data presented to us. Credit: http://www.datascientists.net/what-is-data-science

  3. Database Theory • Relational Algebra & Set Theory • Thinking in relations helps you to connect disparate data; • What is the connecting field? • What is the cardinality? • Set Theory Helps you think about summarizing data • What time period? Weeks? Months? • By person? By Group? By Geography?

  4. Agile Manifesto • Individuals and interactions over processes and tools • Working software over comprehensive documentation • Customer collaboration over contract negotiation • Responding to change over following a plan http://www.agilemanifesto.org

  5. Spiral Dynamics New research unveiled at this year’s AERA conference documents a disturbing trend among the nation’s secondary schools: Between 2001 and 2012, high school graduation rates regularly spiked in late May and early June, ballooning from near zero to a staggering average of 78 percent.

  6. What you’ll need for this course • Interest in learning data analysis tools • R • Python • Curiosity • A laptop to bring to class (see me if this is a problem) • Persistence • A Github Account • Willingness to do weekly homeworks and participate in online iteration of data products you and your course mates develop • A dropbox account will be helpful 

  7. Discuss Homework • Analysis Questions. Write up a short essay with tables or graphs if needed to describe how you would: • Build a network using the scripts from week1 against the mention connections? Reply-To connections? In this sample data. What transformations are required? How would you filter the data? Use the actual data to ground your thinking. Feel free to actually write or modify the R code samples from the first two weeks to experiment. Some of you will be more comfortable doing this; some will be more comfortable addressing the question conceptually. This is OK.

  8. Using GitHub for Software Sharing • Creating a GitHub Account • Creating a GitHub Project • Using the GitHub Desktop client • Committing & Syncing • The Pull Request

  9. Anarchist in the Library • Chapter Four – The Music Industry • Take five minutes to prepare a one minute summary of how you use file sharing and distribution to use music and videos • I will mute the class capture when you share your stories

  10. A personal story of Music

  11. Who is ZooeyDeschanel?

  12. To Spotify!

  13. Multiple Variable Distributions

  14. Examples in Python

  15. Two Variables • Where are the data points located, and how far do they spread? What are typical, as • well as minimal and maximal, values? • • How are the points distributed? Are they spread out evenly or do they cluster in certain areas? • • How many points are there? Is this a large data set or a relatively small one? Health Expenditures

  16. Plotting 2 variables • Health Expenditures vs life expectancy • Health Expenditures vs Doctor Visits

  17. Interpolation

  18. Spline

  19. Polynomial Intertpolation

  20. Draft Lottery LOESS Curve

  21. LOESS in R

  22. LOESS & Well Water Testing • This is a modern statistical method that is useful when the relationship between x & y are unknown and complicated • “Locally weighted polynomial regression • Basic regression • Localized regression • Localized subsets of data • Q – smoothing parameter • Bigger q = more smoothing • Assumption: Any local model can be well approximated in a small neighborhood • Models False Positives & False Negatives Credit to “Davids Statistics”

  23. The Data

  24. Simple LOESS

  25. Detects & Non-Detects

  26. First Curve

  27. Second Curve

  28. Smoothing Curve ggplot2 library

  29. Detects & ND’s With Sep Curves & CI’s

  30. Activity • Download the GitHub Project • Get the Python code to run inside of Canopy (Week 4 Folder) • Draft Data • R-Code • Twitter Data – @j_tsar • Looking at various types of interpolation? • How might interpolated data help tell a story? • Now, what other data sets do you want to get?

  31. Motivation Underpants Gnomes With much discourtesy from the US TV Program “South Park”

  32. Motivation Underpants Gnomes

  33. Addressing The Underpants Gnome Postulate

  34. Group Informatics Described Identify Key Information Brokers Weight Connections Based on Time Distance, Grouped By Topic and informed by analysis of time distance between posts. Methodological Approach

  35. Week Five – Assignment Two • Software Sharing #1 (Share scripts produced in week 3 using an open source software configuration management tool). • Students will refine and then share their scripts with other students • Included in the assignment is a 500 word explanation of how their script could be improved, optimized and adapted to other data of a similar type. • The “read me” file distributed with the script will explain to another user how to apply the script to the data distributed in assignment one. This will include specific, technical specifications.

  36. upcoming • Readings • Week 5: Data Presentation Tools (We’re a week behind on assignments) • Software Sharing #1 (Share scripts produced in week 3 using an open source software configuration management tool). • Readings and Assignments Due: • Data Visualization Example Presentation • Part Three of “Data Analysis with Open Source Tools”.

More Related