exploratory data mining and data preparation l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Exploratory Data Mining and Data Preparation PowerPoint Presentation
Download Presentation
Exploratory Data Mining and Data Preparation

Loading in 2 Seconds...

play fullscreen
1 / 40

Exploratory Data Mining and Data Preparation - PowerPoint PPT Presentation


  • 314 Views
  • Uploaded on

Exploratory Data Mining and Data Preparation. Data evaluation. Data preparation. Modeling. Evaluation. The Data Mining Process. Business understanding. Data. Deployment. Exploratory Data Mining. Preliminary process Data summaries Attribute means Attribute variation

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Exploratory Data Mining and Data Preparation' - Patman


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the data mining process

Data

evaluation

Data

preparation

Modeling

Evaluation

The Data Mining Process

Business

understanding

Data

Deployment

Data Mining

exploratory data mining
Exploratory Data Mining
  • Preliminary process
  • Data summaries
    • Attribute means
    • Attribute variation
    • Attribute relationships
  • Visualization

Data Mining

slide4

Summary Statistics

  • Possible Problems:
  • Many missing values (16%)
  • No examples of one value

Appears to be a good predictor of the class

Visualization

Select an attribute

Data Mining

exploratory dm process
Exploratory DM Process
  • For each attribute:
    • Look at data summaries
      • Identify potential problems and decide if an action needs to be taken (may require collecting more data)
    • Visualize the distribution
      • Identify potential problems (e.g., one dominant attribute value, even distribution, etc.)
      • Evaluate usefulness of attributes

Data Mining

weka filters
Weka Filters
  • Weka has many filters that are helpful in preprocessing the data
    • Attribute filters
      • Add, remove, or transform attributes
    • Instance filters
      • Add, remove, or transform instances
  • Process
    • Choose for drop-down menu
    • Edit parameters (if any)
    • Apply

Data Mining

data preprocessing
Data Preprocessing
  • Data cleaning
    • Missing values, noisy or inconsistent data
  • Data integration/transformation
  • Data reduction
    • Dimensionality reduction, data compression, numerosity reduction
  • Discretization

Data Mining

data cleaning
Data Cleaning
  • Missing values
    • Weka reports % of missing values
    • Can use filter called ReplaceMissingValues
  • Noisy data
    • Due to uncertainty or errors
    • Weka reports unique values
    • Useful filters include
      • RemoveMisclassified
      • MergeTwoValues

Data Mining

data transformation
Data Transformation
  • Why transform data?
    • Combine attributes. For example, the ratio of two attributes might be more useful than keeping them separate
    • Normalizing data. Having attributes on the same approximate scale helps many data mining algorithms(hence better models)
    • Simplifying data. For example, working with discrete data is often more intuitive and helps the algorithms(hence better models)

Data Mining

weka filters11
Weka Filters
  • The data transformation filters in Weka include:
    • Add
    • AddExpression
    • MakeIndicator
    • NumericTransform
    • Normalize
    • Standardize

Data Mining

discretization
Discretization
  • Discretization reduces the number of values for a continuous attribute
  • Why?
    • Some methods can only use nominal data
      • E.g., in Weka ID3 and Apriori algorithms
    • Helpful if data needs to be sorted frequently (e.g., when constructing a decision tree)

Data Mining

unsupervised discretization
Unsupervised Discretization
  • Unsupervised - does not account for classes
  • Equal-interval binning
  • Equal-frequency binning

Data Mining

supervised discretization

1 yes 8 yes & 5 no

9 yes & 4 no 1 no

F

E

D

C

B

A

Supervised Discretization
  • Take classification into account
  • Use “entropy” to measure information gain
  • Goal: Discretizise into 'pure' intervals
  • Usually no way to get completely pure intervals:

64

65

68

69

70

71

72

75

80

81

83

85

Yes

No

Yes

Yes

Yes

No

No

Yes

No

Yes

Yes

No

Yes

Yes

Data Mining

error based discretization
Error-Based Discretization
  • Count number of misclassifications
    • Majority class determines prediction
    • Count instances that are different
  • Must restrict number of classes.
  • Complexity
    • Brute-force: exponential time
    • Dynamic programming: linear time
  • Downside: cannot generate adjacent intervals with same label

Data Mining

weka filter
Weka Filter

Data Mining

attribute selection
Attribute Selection
  • Before inducing a model we almost always do input engineering
  • The most useful part of this is attribute selection (also called feature selection)
    • Select relevant attributes
    • Remove redundant and/or irrelevant attributes
  • Why?

Data Mining

reasons for attribute selection
Reasons for Attribute Selection
  • Simpler model
    • More transparent
    • Easier to interpret
  • Faster model induction
    • What about overall time?
  • Structural knowledge
    • Knowing which attributes are important may be inherently important to the application
  • What about the accuracy?

Data Mining

filters
Filters
  • Results in either
    • Ranked list of attributes
      • Typical when each attribute is evaluated individually
      • Must select how many to keep
    • A selected subset of attributes
      • Forward selection
      • Best first
      • Random search such as genetic algorithm

Data Mining

filter evaluation examples
Filter Evaluation Examples
  • Information Gain
  • Gain ration
  • Relief
  • Correlation
    • High correlation with class attribute
    • Low correlation with other attributes

Data Mining

wrappers
Wrappers
  • “Wrap around” the learning algorithm
  • Must therefore always evaluate subsets
  • Return the best subset of attributes
  • Apply for each learning algorithm
  • Use same search methods as before

Select a subset of attributes

Induce learning algorithm on this subset

Evaluate the resulting model (e.g., accuracy)

Stop?

No

Yes

Data Mining

how does it help
How does it help?
  • Naïve Bayes
  • Instance-based learning
  • Decision tree induction

Data Mining

scalability
Scalability
  • Data mining uses mostly well developed techniques (AI, statistics, optimization)
  • Key difference: very large databases
  • How to deal with scalability problems?
  • Scalability: the capability of handling increased load in a way that does not effect the performance adversely

Data Mining

massive datasets
Massive Datasets
  • Very large data sets (millions+ of instances, hundreds+ of attributes)
  • Scalability in space and time
    • Data set cannot be kept in memory
      • E.g., processing one instance at a time
    • Learning time very long
      • How does the time depend on the input?
      • Number of attributes, number of instances

Data Mining

two approaches
Two Approaches
  • Increased computational power
    • Only works if algorithms can be sped up
    • Must have the computing availability
  • Adapt algorithms
    • Automatically scale-down the problem so that it is always approximately the same difficulty

Data Mining

computational complexity
Computational Complexity
  • We want to design algorithms with good computational complexity

exponential

Time

polynomial

linear

logarithm

Number of instances

(Number of attributes)

Data Mining

example big oh notation
Example: Big-Oh Notation
  • Define
    • n =number of instances
    • m =number of attributes
  • Going once through all the instances has complexity O(n)
  • Examples
    • Polynomial complexity: O(mn2)
    • Linear complexity: O(m+n)
    • Exponential complexity: O(2n)

Data Mining

classification
Classification
  • If no polynomial time algorithm exists to solve a problem it is called NP-complete
  • Finding the optimal decision tree is an example of a NP-complete problem
  • However, ID3 and C4.5 are polynomial time algorithms
    • Heuristic algorithms to construct solutions to a difficult problem
    • “Efficient” from a computational complexity standpoint but still have a scalability problem

Data Mining

decision tree algorithms
Decision Tree Algorithms
  • Traditional decision tree algorithms assume training set kept in memory
  • Swapping in and out of main and cache memory expensive
  • Solution:
    • Partition data into subsets
    • Build a classifier on each subset
    • Combine classifiers
    • Not as accurate as a single classifier

Data Mining

other classification examples
Other Classification Examples
  • Instance-Based Learning
    • Goes through instances one at a time
    • Compares with new instance
    • Polynomial complexity O(mn)
    • Response time may be slow, however
  • Naïve Bayes
    • Polynomial complexity
    • Stores a very large model

Data Mining

data reduction
Data Reduction
  • Another way is to reduce the size of the data before applying a learning algorithm (preprocessing)
  • Some strategies
    • Dimensionality reduction
    • Data compression
    • Numerosity reduction

Data Mining

dimensionality reduction
Dimensionality Reduction
  • Remove irrelevant, weakly relevant, and redundant attributes
  • Attribute selection
    • Many methods available
    • E.g., forward selection, backwards elimination, genetic algorithm search
  • Often much smaller problem
  • Often little degeneration in predictive performance or even better performance

Data Mining

data compression
Data Compression
  • Also aim for dimensionality reduction
  • Transform the data into a smaller space
  • Principle Component Analysis
    • Normalize data
    • Compute c orthonormal vectors, or principle components, that provide a basis for normalized data
    • Sort according to decreasing significance
    • Eliminate the weaker components

Data Mining

pca example
PCA: Example

Data Mining

numerosity reduction
Numerosity Reduction
  • Replace data with an alternative, smaller data representation
    • Histogram

1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,

15,15,15,18,18,18,18,18,18,18,18,20,20,20,20,20,

20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30

count

1-10 11-20 21-30

Data Mining

other numerosity reduction
Other Numerosity Reduction
  • Clustering
    • Data objects (instance) that are in the same cluster can be treated as the same instance
    • Must use a scalable clustering algorithm
  • Sampling
    • Randomly select a subset of the instances to be used

Data Mining

sampling techniques
Sampling Techniques
  • Different samples
    • Sample without replacement
    • Sample with replacement
    • Cluster sample
    • Stratified sample
  • Complexity of sampling actually sublinear, that is, the complexity is O(s) where s is the number of samples and s<<n

Data Mining

weka filters40
Weka Filters
  • PrincipalComponents is under the Attribute Selection tab
  • Already talked about filters to discretize the data
  • The Resample filter randomly samples a given percentage of the data
    • If you specify the same seed, you’ll get the same sample again

Data Mining