1 / 13

Chapter 7: Transformations

Chapter 7: Transformations. Attribute Selection. Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer and separate-and-conquer algorithms suffer from this; Naïve Bayes does not suffer

lizina
Download Presentation

Chapter 7: Transformations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 7: Transformations

  2. Attribute Selection • Adding irrelevant attributes confuses learning algorithms---so avoid such attributes • Both divide-and-conquer and separate-and-conquer algorithms suffer from this; Naïve Bayes does not suffer • So first choose the attributes to be considered and then proceed---dimensionality reduction • Scheme independent selection: • Just enough attributes to divide up the instance space in a way that separates all the training instances: For example, in Table 1, if we were to drop outlook, instance 1 and 4 will be inseparable-not good. --- very tedious procedure

  3. Using machine learning algorithms for attribute selection • Decision tree: Apply DT on all attributes, and select only those that are actually used in the decisions---the selected attributes can then be used in another chosen learning algorithm • Use linear SVM algorithm that ranks attributes based on weights to choose the attributes---recursive feature elimination • Using instance-based learning methods • Sample instances randomly from the training set • Check neighboring records of the same and different classes (near hits and near misses) • If a near hit has a different value for a certain attribute, that attribute appears to be irrelevant---reduce its weight • If a near miss, has a different value, the attribute appears to be relevant and its weight should be increased • After repeating this procedure many times, selection takes place---only attributes with +ve weights are chosen.

  4. Searching the attribute space: • Fig 7.1 • Forward selection (start with empty set and keep expanding) • Backward elimination (start with all, and start eliminating one by one) • Bidirectional search---combination of the above two • Scheme-specific selection • Cross-validation is used to measure the effectiveness of a subset of attributes

  5. Discretizing Numeric Attributes • Global discretization: Used in 1R learning scheme: Sort the instances by the attribute’s value and assign the value into ranges at the points that class value changes---keeping some minimum instance coverage criteria • Local discretization: Used in decision trees: When a specific attribute is used to split a node, a decision is made on the value at which this break could take place • Transforming numeric attribute into k binary variables • Unsupervised discretization: Not taking the classes of the training set---break the value range into some intervals---e,g., equal-interval binning or equal-frequency binning---runs the risk of destroying distinctions within an interval or bin • Supervised discretization---takes classes into account while making intervals • Proportional k-interval discretization: #of bins chosen in a data-dependent fashion by setting it to the square root of #of instances with equal-frequency binning.

  6. 64 Y 65 N 68 Y 69 Y 70 Y 71 N 72 N 72 Y 75 Y 75 Y 80 N 81 Y 83 Y 85 N Proportional binning Number of bins = 4 64-68 Bin1 2Y 1N 69-71 Bin2 2Y 1N 72-75 Bin3 3Y 1N 80-85 Bin4 2Y 2N Equal Frequency binning Number of bins = 3 64-70 4Y 1N 71-75 3Y 2N 80-85 2Y 2N

  7. Entropy-based Discretization • One example: Order the values of the attribute, and for each possible break-point determine the information gain (p. 298-299). Split at the point where this value is the smallest. • For all values, find the smallest (A); • Repeat this procedure for each of the parts formed by the breaking at A; • Repeat this step recursively until a stopping criteria is met

  8. Some Useful Transformations • Examples: • Subtracting one date attribute from another to obtain a new age attribute • Converting two attributes A and B to A/B, a new attribute representing the ratio • Reduce several nominal attributes to one by concatenating their vales, producing a single k1xk2 value attribute • Principal component analysis: Use a special coordinate system that depends on the given cloud of points as follows: place the first axis in the direction of greatest variance of the points to maximize the variance along that axis; the 2nd axis in perpendicular to it; in multi-dimensional case, choose the 2nd axis that maximizes variance along that axis; and so on; finally, choose the ones that contribute to the highest variance---the principal components • http://en.wikipedia.org/wiki/Principal_components_analysis

  9. Random Projections • Since PCA is expensive (cubic in the #of dimensions), alternative is to a random projection of the data into a subspace with a predetermined number of dimensions

  10. Text to attribute vector • Convert a document to a vector of words that occur in the document---it could be the frequency of the words or just the absence/presence of the word • In other words, a document is characterized by the words that appear often in it.

  11. Time series • Some times, we may replace the attributes by the difference in successive values, etc. This is time series.

  12. Automatic Data Cleansing • Data mining techniques themselves can sometimes help to solve the problem of cleansing the corrupted data • By discarding misclassified instances from the training set, relearning, and then repeating until there are no more misclassified instances, decision trees induced from data can be improved • Robust regression---by removing outliers, linear regression is improved

  13. Combining Multiple Models • Bagging, boosting, and stacking are prominent methods to combine multiple models • Bagging: Models receive equal weight---output of each model is a majority value, for example. • Boosting: Similar to bagging except that it assigns different weights to different model outputs • Option tree (Fig. 7.10) and Fig. 7.11 (-ve means play=yes; + ve means play=no;)

More Related