Lecture 3 Gonca Gulser. Data Pre-processing. What is it?. Ideas????. Definition: Series of actions to improve the quality of data for making it ready to any kind of analysis. Possible Problems. Identifying INCOMPLETE data Missing attribute Lack of Attribute Values
Definition: Series of actions to improve the quality of data for making it ready to any kind of analysis
What is Noise? - Random error or variance in the measured data
Sorted data for price= 4,8,15,21,21,24,25,28,34
Partition into Bins:
Smoothing By means
Smoothing By boundaries
2) Combined Human and Computer Power
By any given algorithm let computer produce an outlier or noise list called “surprise”
Then go over the list and remove the irrelevant data by hand...
It is easier and time saving than go through all data set
4) Other methods
Data reduction involving discretization (divide data into sub-categories like low\medium\high) such as decision tree reduce the data step by step
Concept Hierarchies- a form of discretization also used for noise
INTEGRATION: Merge Data from multiple data sources
TRANSFORMATION: Transform data into an appropriate format for any given data mining algorithm.
Meta Data can solve the problem... ex: Cut_id and cust_number are same thing
Because of different metrics and different perceptions on data, multiple sources have same data in totally different formats and logic.
e.x Suppose that the min and max values for the attribute income are $12,000 and $98,000 we would like to map the income to the range 0.0, 1.0. By min-max normalization a value of $73,600 for income is transformed to
e.x Suppose that the mean and the standard deviation of income are $54.000 and $16,000 respectively. With z-score normalization, a value for $73,600 is transformed to
(73,600-54,000)/16,000 = 1.225
Vnormalize= where, j is the smallest integer that max(vnormalize)=1
e.x. Suppose that the value range for A is -986 – 917. The maximum absolute value for A is 986. To normalize based on decimal scaling we need to divide each value by 1000 (j=3) so that -986 normalizes to -0.986
e.x adding attribute area to data set by using height and width
Golden Rule – Reduction Time > Saved Time
Equiwidth – the width of the bucket range is uniform.
Equidepth – the buckets are created so that, roughly, the frequency of each bucket is constant (each bucket contains the sane number of contiguous data samples)
V-optimal – Histogram with the least variance Histogram variance is a weighted some of the original values that each bucket represents, where bucket weight is equal to the number of values in the bucket. (if data is one dimensional, V-optimal is K-means)
MaxDiff – The difference between each pair of adjacent values. A bucket boundary is established between each pair for pairs having the K-1 largest difference, where K is specified by user
Other than histograms also the following used for numerosity reduction