1 / 15

Introduction to Data Mining with Weka

Introduction to Data Mining with Weka. Data Science and Business Analytics Denver Meetup Nancy Abramson Principal Data Scientist. Agenda. Introduction What does Open Source mean? Data Science and Data Mining Open Source Data Mining Tools Weka Overview Profiling Demonstration

lani
Download Presentation

Introduction to Data Mining with Weka

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Data Mining with Weka Data Science and Business Analytics Denver Meetup Nancy Abramson Principal Data Scientist

  2. Agenda • Introduction • What does Open Source mean? • Data Science and Data Mining • Open Source Data Mining Tools • Weka • Overview • Profiling Demonstration • Analysis Demonstration • Summary

  3. Introduction – Who am I? • Datasource Consulting Employee for past 3 year developing, using and evaluating open source and enterprise Business Intelligence tools • New hire to spotXchange as Principal Data Scientist • Bachelor of Science degree in Computer Science & Mathematics • Masters in Applied Statistics • Experience with databases, ETL, and analytics • Using “Open Source” or “free software” more than 25 years • Market analysis in aerospace, financial, telephony, and retail

  4. What is Open Source? • A software development project in which code is developed by peer production and collaboration, with the end-product, source-code and documentation available at no cost to the public. • Free Access to Source Code • Free Redistribution • Strong development community • Examples: • Linux • Hadoop • Apache/Tomcat • MySQL • Weka

  5. Data Science and Data Mining • Data Science process defined by Dr. DJ Patil, previous head of Data Analytics at LinkedIn • Clean-up and preparation of data • Create measurable levers to increase the value of the business • Monitor if state of metrics for changes • Experiment with the results of the models • Traditional Data Mining is used for… • Profiling data to check for quality e.g. max, min, data types, and patterns between variables • Finding relationships between variables or independent variables, e.g. clusters, regressions • Checking variance of a measure over time • Determine the level an experiment produced significant results

  6. Profiling and Heavy Lifting • Fun Stuff • See what you never thought possible • Name: Mr. Ed • Genus: Equus • Address: Apt 302, Manhattan, NY 10033

  7. Data Mining Tools • Reference: http://www.phiresearchlab.org/downloads/OpenSourceDataMining.pdf

  8. Weka Introduction • Waikato Environment for Knowledge Analysis (WEKA) • Developed by the University of Waikato, New Zealand • Java based distributed under the GNU Public License • Explorer • Preprocessing, attribute selection, learning, visualization • Experimenter • Testing and evaluating machine learning algorithms • Knowledge Flow • Data-flow interface to WEKA • SimpleCLI

  9. load filter analyze

  10. Weka Pre-process Demo • Load and view csv data • Compare pairs of attributes • Examine min/max data value • Compare nominal and numeric values • Save in ARFF format • Derived from census bureau database found at • | http://www.census.gov/ftp/pub/DES/www/welcome.html

  11. Attribute-Relation File Format @relation workers @attribute age numeric @attribute workclass {' State-gov',' Self-emp-not-inc',' Private',' Federal-gov',' Local-gov',' ?',' Self-emp-inc',' Without-pay',' Never-worked'} @attribute ' fnlwgt' numeric : @attribute ' wage' {' <=50K',' >50K'} @data 39,' State-gov',77516,' Bachelors',13,' Never-married',' Adm-clerical',' Not-in-family',' White',' Male',2174,0,40,' United-States',' <=50K' 50,' Self-emp-not-inc',83311,' Bachelors',13,' Married-civ-spouse',' Exec-managerial',' Husband',' White',' Male',0,0,13,' United-States',' <=50K' 38,' Private',215646,' HS-grad',9,' Divorced',' Handlers-cleaners',' Not-in-family',' White',' Male',0,0,40,' United-States',' <=50K'

  12. Weka Classify Features • 49 data preprocessing tools • 76 classification/regression algorithms • 8 clustering algorithms • 15 attribute/subset evaluators + 10 search algorithms for feature selection. • 3 algorithms for finding association rules • Derived from census bureau database found at • | http://www.census.gov/ftp/pub/DES/www/welcome.html

  13. Linear Regression • Predicted attribute is continuous • Correlation Coefficient determines fit of data • measures the strength and the direction of a linear relationship • -1 < r < +1 • A correlation greater than 0.8 is generally described as strong, depending on the type of data • Uses • Forecasting • Exploring factor effects • Demo: cpu.arff

  14. Classification • Predicted attribute is categorical • Implemented methods • Naïve Bayes • decision trees and rules • neural networks • support vector machines • Demo: J48 decision tree with weather.arff

  15. That’s All Nancy Abramson nabramson@ieee.org 720-468-1796 ?

More Related