1 / 48

Analyzing data with python

Analyzing data with python. Sarah Guido @ sarah_guido Reonomy OSCON 2014. About me. Data scientist at Reonomy University of Michigan graduate NYC Python organizer PyGotham organizer. About this talk. Bird’s-eye overview: not comprehensive explanation of these tools!

adie
Download Presentation

Analyzing data with python

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analyzing data with python Sarah Guido @sarah_guido Reonomy OSCON 2014

  2. About me • Data scientist at Reonomy • University of Michigan graduate • NYC Python organizer • PyGotham organizer

  3. About this talk • Bird’s-eye overview: not comprehensive explanation of these tools! • Take data from start-to-finish • Preprocessing: Pandas • Analysis: scikit-learn • Analysis: nltk • Data pipeline: MRjob • Visualization: matplotlib • What next?

  4. Why python? • So many tools • Preprocessing, analysis, statistics, machine learning, natural language processing, network analysis, visualization, scalability • Community support • “Easy” language to learn • Both a scripting and production-ready language

  5. From point A to point…x? • How to find the best tool(s)? • The 90/10 rule • Simple is better than complex

  6. Why I chose these tools • Available resources • Documentation, tutorials, books, videos • Ease of use(with a grain of salt) • Community support and continuous development • Widely used

  7. Preprocessing • The importance of data preprocessing • AKA wrangling, munging, manipulating, and so on • Preprocessing is also getting to know your data • Missing values? Categorical/continuous? Distribution?

  8. Pandas • Data analysis and modeling • Similar to R and Excel • Easy-to-use data structures • DataFrame • Data wrangling tools • Merging, pivoting, etc

  9. Pandas • Keep everything in Python • Community support/resources • Use for preprocessing • File I/0, cleaning, manipulation, etc • Combinable with other modules • NumPy, SciPy, statsmodel, matplotlib

  10. Pandas • File I/O

  11. Pandas • Finding missing values

  12. Pandas • Removing missing values

  13. Pandas • Pivoting

  14. Pandas • Other things • Statistical methods • Merge/join like SQL • Time series • Has some visualization functionality

  15. Machine Learning • Application of algorithms that learn from examples • Representation and generalization • Useful in everyday life • Especially useful in data analysis

  16. Machine learning • Supervised learning • Classification and regression • Unsupervised learning • Clustering and dimensionality reduction

  17. Scikit-learn • Machine learning module • Open-source • Built-in datasets • Good resources for learning

  18. Scikit-learn • Scikit-learn: your data has to be continuous • Here’s what one observation/label looks like:

  19. Scikit-learn • Transform categorical values/labels

  20. Scikit-learn • Classification

  21. Scikit-learn • Classification

  22. Scikit-learn • Other things • Very comprehensive of machine learning algorithms • Preprocessing tools • Methods for testing the accuracy of your model

  23. Natural Language Processing • Concerned with interactions between computers and human languages • Derive meaning from text • Many NLP algorithms are based on machine learning

  24. nltk • Natural Language ToolKit • Access to over 50 corpora • Corpus: body of text • NLP tools • Stemming, tokenizing, etc • Resources for learning

  25. NLTK • Stopword removal

  26. NLTK • Stopword removal

  27. NLTK • Stemming

  28. NLTK • Other things • Lemmatizing, tokenization, tagging, parse trees • Classification • Chunking • Sentence structure

  29. Processing Large Data • Data that takes too long to process on your machine • Not “big data” but larger data • Solution: MapReduce! • Processing large datasets with a parallel, distributed algorithm • Map step • Reduce step

  30. Processing Large Data • Map step • Takes series of key/value pairs • Ex. Word counts: break line into words, return word and count within line • Reduce step • Once for each unique key: iterates through values associated with that key • Ex. Word counts: returns word and sum of all counts

  31. MRJOB • Write MapReduce jobs in Python • Test code locally without installing Hadoop • Lots of thorough documentation • A few things to know • Keep everything in one class • MRJob program in a separate file • Output to new file if doing something like word counts

  32. mrjob • Stemmed file • Line 1: (‘miss’, 2), (‘taylor’, 1) • Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1) • And so on…

  33. MRJob Map Reduce (‘miss’, 2) (‘taylor’, 2) (‘first’, 2) (‘wed’, 2) (‘father’, 2) • Line 1: (‘miss’, 2), (‘taylor’, 1) • Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’, 1) • Line 3: (‘first’, 1), (‘wed’, 1) • Line 4: (‘father’, 1) • Line 5: (‘father’, 1)

  34. MRJob • Let’s count all words in the Gutenberg file • Map step

  35. MRJob • Reduce (and run) step

  36. MRJob • Results • Mapped counts reduced • Key/val pairs

  37. MRJob • Other things • Run on Hadoop clusters • Can write highly complex jobs • Works with Elasticsearch

  38. Data Visualization • The “final step” • Conveying your results in a meaningful way • Literally see what’s going on

  39. Matplotlib • 2D visualization library • Very VERY widely used • Wide variety of plots • Easy to feed in results from other modules (like Pandas, scikit-learn, NumPy, SciPy, etc)

  40. Matplotlib • Remember this?

  41. Matplotlib • Bar chart of distribution

  42. Matplotlib • Let’s graph our word count frequencies • (Hint: It’s a power law distribution!)

  43. matplotlib • High frequency of low numbers, low frequency of high numbers

  44. Matplotlib • Other things • Many different kinds of graphs • Customizable • Time series

  45. What next? • Phew! • Which tool to choose depends on your needs • Workflow: • Preprocess • Analyze • Visualize

  46. Resources • Pandas • http://pandas.pydata.org/ • scikit-learn • http://scikit-learn.org/ • NLTK • http://www.nltk.org/ • MRJob • http://mrjob.readthedocs.org/ • matplotlib • http://matplotlib.org/

  47. Contact Me! • Twitter • @sarah_guido • LinkedIn • https://www.linkedin.com/in/sarahguido • NYC Python • http://www.meetup.com/nycpython/

  48. The End! Questions?

More Related