Exploratory Data Mining and Data Preparation

325 Views

Download Presentation
## Exploratory Data Mining and Data Preparation

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Exploratory Data Mining and Data Preparation**Data Mining**Data**evaluation Data preparation Modeling Evaluation The Data Mining Process Business understanding Data Deployment Data Mining**Exploratory Data Mining**• Preliminary process • Data summaries • Attribute means • Attribute variation • Attribute relationships • Visualization Data Mining**Summary Statistics**• Possible Problems: • Many missing values (16%) • No examples of one value Appears to be a good predictor of the class Visualization Select an attribute Data Mining**Exploratory DM Process**• For each attribute: • Look at data summaries • Identify potential problems and decide if an action needs to be taken (may require collecting more data) • Visualize the distribution • Identify potential problems (e.g., one dominant attribute value, even distribution, etc.) • Evaluate usefulness of attributes Data Mining**Weka Filters**• Weka has many filters that are helpful in preprocessing the data • Attribute filters • Add, remove, or transform attributes • Instance filters • Add, remove, or transform instances • Process • Choose for drop-down menu • Edit parameters (if any) • Apply Data Mining**Data Preprocessing**• Data cleaning • Missing values, noisy or inconsistent data • Data integration/transformation • Data reduction • Dimensionality reduction, data compression, numerosity reduction • Discretization Data Mining**Data Cleaning**• Missing values • Weka reports % of missing values • Can use filter called ReplaceMissingValues • Noisy data • Due to uncertainty or errors • Weka reports unique values • Useful filters include • RemoveMisclassified • MergeTwoValues Data Mining**Data Transformation**• Why transform data? • Combine attributes. For example, the ratio of two attributes might be more useful than keeping them separate • Normalizing data. Having attributes on the same approximate scale helps many data mining algorithms(hence better models) • Simplifying data. For example, working with discrete data is often more intuitive and helps the algorithms(hence better models) Data Mining**Weka Filters**• The data transformation filters in Weka include: • Add • AddExpression • MakeIndicator • NumericTransform • Normalize • Standardize Data Mining**Discretization**• Discretization reduces the number of values for a continuous attribute • Why? • Some methods can only use nominal data • E.g., in Weka ID3 and Apriori algorithms • Helpful if data needs to be sorted frequently (e.g., when constructing a decision tree) Data Mining**Unsupervised Discretization**• Unsupervised - does not account for classes • Equal-interval binning • Equal-frequency binning Data Mining**1 yes 8 yes & 5 no**9 yes & 4 no 1 no F E D C B A Supervised Discretization • Take classification into account • Use “entropy” to measure information gain • Goal: Discretizise into 'pure' intervals • Usually no way to get completely pure intervals: 64 65 68 69 70 71 72 75 80 81 83 85 Yes No Yes Yes Yes No No Yes No Yes Yes No Yes Yes Data Mining**Error-Based Discretization**• Count number of misclassifications • Majority class determines prediction • Count instances that are different • Must restrict number of classes. • Complexity • Brute-force: exponential time • Dynamic programming: linear time • Downside: cannot generate adjacent intervals with same label Data Mining**Weka Filter**Data Mining**Attribute Selection**• Before inducing a model we almost always do input engineering • The most useful part of this is attribute selection (also called feature selection) • Select relevant attributes • Remove redundant and/or irrelevant attributes • Why? Data Mining**Reasons for Attribute Selection**• Simpler model • More transparent • Easier to interpret • Faster model induction • What about overall time? • Structural knowledge • Knowing which attributes are important may be inherently important to the application • What about the accuracy? Data Mining**Attribute Selection Methods**Data Mining**Filters**• Results in either • Ranked list of attributes • Typical when each attribute is evaluated individually • Must select how many to keep • A selected subset of attributes • Forward selection • Best first • Random search such as genetic algorithm Data Mining**Filter Evaluation Examples**• Information Gain • Gain ration • Relief • Correlation • High correlation with class attribute • Low correlation with other attributes Data Mining**Wrappers**• “Wrap around” the learning algorithm • Must therefore always evaluate subsets • Return the best subset of attributes • Apply for each learning algorithm • Use same search methods as before Select a subset of attributes Induce learning algorithm on this subset Evaluate the resulting model (e.g., accuracy) Stop? No Yes Data Mining**How does it help?**• Naïve Bayes • Instance-based learning • Decision tree induction Data Mining**Scalability**• Data mining uses mostly well developed techniques (AI, statistics, optimization) • Key difference: very large databases • How to deal with scalability problems? • Scalability: the capability of handling increased load in a way that does not effect the performance adversely Data Mining**Massive Datasets**• Very large data sets (millions+ of instances, hundreds+ of attributes) • Scalability in space and time • Data set cannot be kept in memory • E.g., processing one instance at a time • Learning time very long • How does the time depend on the input? • Number of attributes, number of instances Data Mining**Two Approaches**• Increased computational power • Only works if algorithms can be sped up • Must have the computing availability • Adapt algorithms • Automatically scale-down the problem so that it is always approximately the same difficulty Data Mining**Computational Complexity**• We want to design algorithms with good computational complexity exponential Time polynomial linear logarithm Number of instances (Number of attributes) Data Mining**Example: Big-Oh Notation**• Define • n =number of instances • m =number of attributes • Going once through all the instances has complexity O(n) • Examples • Polynomial complexity: O(mn2) • Linear complexity: O(m+n) • Exponential complexity: O(2n) Data Mining**Classification**• If no polynomial time algorithm exists to solve a problem it is called NP-complete • Finding the optimal decision tree is an example of a NP-complete problem • However, ID3 and C4.5 are polynomial time algorithms • Heuristic algorithms to construct solutions to a difficult problem • “Efficient” from a computational complexity standpoint but still have a scalability problem Data Mining**Decision Tree Algorithms**• Traditional decision tree algorithms assume training set kept in memory • Swapping in and out of main and cache memory expensive • Solution: • Partition data into subsets • Build a classifier on each subset • Combine classifiers • Not as accurate as a single classifier Data Mining**Other Classification Examples**• Instance-Based Learning • Goes through instances one at a time • Compares with new instance • Polynomial complexity O(mn) • Response time may be slow, however • Naïve Bayes • Polynomial complexity • Stores a very large model Data Mining**Data Reduction**• Another way is to reduce the size of the data before applying a learning algorithm (preprocessing) • Some strategies • Dimensionality reduction • Data compression • Numerosity reduction Data Mining**Dimensionality Reduction**• Remove irrelevant, weakly relevant, and redundant attributes • Attribute selection • Many methods available • E.g., forward selection, backwards elimination, genetic algorithm search • Often much smaller problem • Often little degeneration in predictive performance or even better performance Data Mining**Data Compression**• Also aim for dimensionality reduction • Transform the data into a smaller space • Principle Component Analysis • Normalize data • Compute c orthonormal vectors, or principle components, that provide a basis for normalized data • Sort according to decreasing significance • Eliminate the weaker components Data Mining**PCA: Example**Data Mining**Numerosity Reduction**• Replace data with an alternative, smaller data representation • Histogram 1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15, 15,15,15,18,18,18,18,18,18,18,18,20,20,20,20,20, 20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30 count 1-10 11-20 21-30 Data Mining**Other Numerosity Reduction**• Clustering • Data objects (instance) that are in the same cluster can be treated as the same instance • Must use a scalable clustering algorithm • Sampling • Randomly select a subset of the instances to be used Data Mining**Sampling Techniques**• Different samples • Sample without replacement • Sample with replacement • Cluster sample • Stratified sample • Complexity of sampling actually sublinear, that is, the complexity is O(s) where s is the number of samples and s<<n Data Mining**Weka Filters**• PrincipalComponents is under the Attribute Selection tab • Already talked about filters to discretize the data • The Resample filter randomly samples a given percentage of the data • If you specify the same seed, you’ll get the same sample again Data Mining