data mining term project n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Mining Term Project PowerPoint Presentation
Download Presentation
Data Mining Term Project

Loading in 2 Seconds...

play fullscreen
1 / 42

Data Mining Term Project - PowerPoint PPT Presentation


  • 210 Views
  • Uploaded on

Data Mining Term Project. Machine Learning with WEKA Weka Explorer Tutorial for Version 3.4.3 Svetlana S. Aksenova Department of Computer Science California State University, Sacramento Fall 2004. Machine learning methods for data mining.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data Mining Term Project' - derek-finch


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data mining term project

Data Mining Term Project

Machine Learning with WEKA

Weka Explorer Tutorial

for Version 3.4.3

Svetlana S. Aksenova

Department of Computer Science

California State University, Sacramento

Fall 2004

machine learning methods for data mining
Machine learning methods for data mining
  • use techniques from computer science, statistics and probability, and data visualization to search for patterns and relationships in large data sets
  • Allow automatically analyze a large amount of data
  • The result of analysis automatically makes predictions faster and more accurately
  • The result of analysis makes decisions faster and more accurately
about weka
About WEKA
  • Developed by University of Waikato in New Zealand
  • open source software issued under the GNU General Public License
  • WEKA is a data mining system written in Java
  • implements data mining algorithms
  • compatible with most of computer platforms
  • applied to the dataset by choosing either command line or graphic user interface
introduction to the tutorial
Introduction to the Tutorial
  • Created to help in learning process
  • Consists of 8 parts:
    • Introduction
    • Launching WEKA
    • Preprocessing Data
    • Building “Classifiers”
    • Clustering Data
    • Finding Associations
    • Attribute Selection
    • Data Visualization
slide7

Preprocessing

  • Data can be read from a
    • Local filesystem (in ARFF, CSV, C4.5, binary formats)
    • URL
    • SQL database (using JDBC)
  • File conversion
  • Preprocessing window
  • Preprocessing tools - “filters”
file conversion

File Conversion

CSV

ARFF

Excel

open file from a website
Open File (from a website)

http://gaia.ecs.csus.edu/~aksenovs/

weather.arff

setting filters
Setting Filters
  • WEKA contains filters for discretization,

normalization, resampling, attribute selection,

transformation and combination of attributes.

  • Some techniques, such as association rule mining,

can only be performed on categorical data.

filter configuration options
Filter Configuration Options

Right-click on on filter

building classifiers
Building “Classifiers”
  • Choosing a classifier J48 (C4.5)
output the result
Output the Result

Used weather data in “weather.arff” for classification

exercise
Exercise
  • Given at the end of the section

Classification Exercise

Use ID3 algorithm to classify weather data

from the “weather.arff” file. Perform initial

preprocessing and create a version of the

initial dataset in which all numeric attributes

should be converted to categorical data.

clustering data

Clustering Data

The clustering schemes available in WEKA are

k-Means, EM, Cobweb, X-means, FarthestFirst.

Used customer data for clustering in “customers.arff”

clustering data cont d
Clustering Data (cont’d)
  • Choosing clustering scheme
    • K- means
    • 5 clusters
  • Setting test options
  • Analyzing results
exercise1
Exercise
  • Given at the end of the section

Clustering Exercise

Use k-means algorithm to bank data from

the “bank.arff” file. Perform initial

preprocessing and create a version of the

initial data set in which the ID field should

be removed and the "children" attribute

should be converted to categorical data.

finding associations
Finding Associations
  • Apriori
  • works only with discrete data
  • identifies statistical dependencies between

groups of attributes

  • used grocery store data

from “grocery.arff” file with

confidence 40% and

support 30%.

  • Setting test options
  • Analyzing Results
exercise2
Exercise
  • Given at the end of the section

Association Rules Exercise

Use Apriori algorithm to generate association

rules for Iris data from the “iris.arff” file.

Perform initial preprocessing and create a

version of the initial data set in which the

numeric attributes should be converted to

categorical data.

attribute selection
Attribute Selection
  • searches through all possible combinations of attributes
  • finds which subset of attributes works best for prediction.
  • contain two parts:
    • a search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking,
    • evaluation method: correlation-based, wrapper, information gain, chi-squared.
  • used weather data from “weather.arff” file
data visualization
Data Visualization
  • visualize a 2-D plot of the current working relation
  • determine difficulty of the learning problem
selecting instances
Selecting Instances
  • A group of points on the graph can be selected in
  • four ways:
  • Select Instance
  • Rectangle
  • Polygon
  • Polyline
why should we use weka
Why should we use WEKA
  • You can solve a machine learning problem with a minimum programming
  • WEKA includes
    • reading of data,
    • implementation of filtering,
    • result evaluation
performance
Performance
  • Has not been evaluated in this project
  • Can it process large ARFF files (GB)?
  • An answer has been found in “wekalist”
  • It can process some schemes that are
  • either incrementally trainable or can be
  • made to be.
future work
Future Work
  • Has not been done due to time constraints
  • ‘Simple CLI’ provides a simple command-line interface and allows direct execution of Weka commands.
  • ‘KnowledgeFlow’ is a Java-Beans-based interface for setting up and running machine learning experiments.
references

References

I. Witten, E. Frank, Data Mining, Practical Machine.

Learning Tools and Techniques with Java

Implementation, Morgan Kaufmann Publishers, 2000.

2. R. Kirkby, WEKA Explorer User Guide for version 3-3-4, University of Weikato, 2002.

3. Weka Machine Learning Project, http://www.cs.waikato.ac.nz/~ml/index.html.

Machine Learning With WEKA, E.Frank, University of Waikato, New Zealand.

5. B. Mobasher, Data Preparation and Mining with WEKA, http://maua.cs.depaul.edu/~classes/ect584/WEKA/association_rules.html, DePaul University, 2003.

6. M. H. Dunham, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.