1 / 48

Data Preprocessing

Data Preprocessing. Data – Things to consider. Type of data: determines which tools to analyze the data Quality of data: Tolerate some levels of imperfection Improve quality of data improves the quality of the results Preprocessing: modify the data to better fit data mining tools:

riona
Download Presentation

Data Preprocessing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Preprocessing

  2. Data – Things to consider • Type of data: determines which tools to analyze the data • Quality of data: • Tolerate some levels of imperfection • Improve quality of data improves the quality of the results • Preprocessing: modify the data to better fit data mining tools: • Change length into short, medium, long • Reduce number of attributes

  3. Data • Collection of objects or records • Each object described by a number of attributes • Attribute: property of the object, whose value may vary from object to object or from time to time

  4. What kind of Data? • Any data as long as it is meaningful for the target application • Database data • Data warehouse data • Data streams • Sequence data • Graph • Spatial data • Text data

  5. Database data • Data store in a database management system (DBMS) • Collection of interrelated data and a software system to manage and access the data • Relational model: • Customer (id, firstname, lastname, address, city, state, phone) • Product (id, description, weight, unit) • Order(id, order_number, customer_id, product_id, quantity, price)

  6. Data Warehouses • Data collected from multiple data sources • Stored under a unified schema • Usually residing at a single site • Provide historical information • Used for reporting

  7. Data Streams • Sensor data: collecting gps/enviroment data and sending reading every tenth of a second • Image data: satellite data, surveillance cameras • Web traffic: a node on the Internet receives streams of IP packets Output stream 1, 3, 2, 5, 7, 0, 1 Stream Processor a, r, b, c, a, l, u, 1, 0, 1, 1, 1, 0, 0, 1 Archive

  8. Sequence Data • Data obtained by: • connecting all records associated with a certain object together • and sorting them in increasing order of their timestamps • for all objects • Examples: • Web pages visited by a user (object): • {<Homepage>, <Electronics>, <Cameras and Camcorders>, <Digital Cameras>, …, <Shopping Cart>, <Order Confirmation>}, {….} • Transactions made by a customer over a period of time: • {t1, t18, t500, t721}, {t11, t38, t43, t621, t3005}

  9. Graph Data • Data structure represented by nodes (entities) and edges (relationships) • Example: • Protein subsequences • Web pages and links a e b c d

  10. Attribute Types • Nominal: Differentiates between values based on names • Gender, eye color, patient ID • Ordinal: Allows a rank order for sorting data but does not describe the degree of difference • {low, medium, high}, grades {A, B, C, D, F}, • Interval: Describes the degree of difference between values • Dates, temperatures in C and F • Ratio: Both degree or difference and ratio are meaningful • Temperatures in K, lengths, age, mass, …

  11. Attribute Properties • Distinctness: = and ≠ • Order: <, ≤, ≥, and > • Addition: + and - • Multiplication: * and /

  12. Transformations

  13. Discrete and Continuous Attributes • Discrete Attributes: • Finite or countably infinite set of values • Categorical (zipcode, empIDs) or numeric (counts) • Often represented as integers • Special Case: binary attributes (yes/no, true/false, 0/1) • Continuous Attributes: • Real numbers • Examples: temperatures, height, weight, … • Practically, can be measured with limited precision

  14. Asymmetric Attributes • Only presence of attribute is considered important • Can be binary, discrete or continuous • Examples: • Courses taken by students • Items purchased by customers • Words in document

  15. Handling non record data • Most data mining algorithm are designed for record data • Non-record data can be transformed into record data Example: Input: Chemical structures data Expert knowledge: common substructures Transformation: For each compound, add a record with common substructures as binary attributes

  16. Handling non record data May lose some information during transformation Example: spacio-temporal data: time series for each data on a grid Are T1 and T2 independent? L1 and L2? The transformation does not capture time/location relationships space time space

  17. Data quality Issues • Data collected for purposes other than data mining • May not be able to fix quality issues at the source • To improve quality of data mining results: • Detect and correct the data quality issues • Use algorithms that tolerate poor data quality

  18. Data quality Issues • Caused by human errors, limitations of measuring devices, flaws in the data collection process, duplicates, missing data, inconsistent data • Measurement error: the value recorded is different from true value • Data collection error: an attribute/object is omitted or an object is inappropriately included.

  19. Noise • Random component of a measurement error • Distortion of the value or addition of spurious objects

  20. Artifacts • Deterministic distortion of the value • Example: streak in the same place of a set of photograph

  21. Outliers • Data objects with characteristics different than most other data objects • Attributes values unusual with respect typical values • Legitimate objects or values • Applications: Intrusion detection, fraud detection

  22. Missing Values • Not collected • Not applicable

  23. Missing Values • Not collected • Not applicable

  24. Missing Values • Eliminate data objects or attributes: • Too many eliminated objects make the analysis unreliable • Eliminated attributes may be critical to the analysis • Estimate missing values • Regression • Average of nearest neighbors (if continuous) • Mean for samples in same class • Most common attribute (if categorical)

  25. Inapplicable Values • Categorical Attributes: introduce special value ‘N/A’ as a new category • Continuous Attributes: introduce special value outside the domain for example -1 for sales commission

  26. Inconsistent Data • Inconsistent Values: • Entry error, reading error, multiple sources • Negative values for weight • Zip code and city not matching • Correction:requires information from external source or redundant information

  27. Duplicate Data • Duplicate Data • Deduplication: the process of dealing with duplicate data issues • Inconsistencies in duplicate records • Identify duplicate records

  28. Deduplication Methods • Data preparation step: Data transformation and standardization • Transform attributes from one data type or form to another, rename fields, standardize units, … • Distance based: • Edit distance: how many edits (insert/delete/replace) needed to transform one string to another • Affine gap distance: edits include extend gap/open gap • Smith-Waterman Distance: gaps at beginning and end have smaller costs than gaps in the middle John R. Smith Johnathan Richard Smith • * Duplication Record Detection: A survey”, TKDE 2007

  29. Deduplication Methods • Clustering based: extends distance/similarity-based beyond pairwise comparison

  30. Deduplication Methods • Classification based: • Create a training set of duplicate and non-duplicate pairs of records • Train a classifier to distinguish duplicate from non-duplicate pairs • L. Breiman, L. Friedman, and P. Stone, (1984). Classification and Regression. Wadsworth, Belmont, CA. • Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth and Brooks/Cole, 1984 • R. Agrawal, R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB-94, 1994. • RakeshAgrawal and RamakrishnanSrikant. Fast Algorithms for Mining Association Rules In Proc. Of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994 • * Sarawagiand Bhamidipaty, Interactive Deduplication using Active Learning, KDD 2002

  31. Deduplication Methods • Crowdsourcing-based • Uses a combination of human worker and machine to identify duplicate data records • Given n records, there are n(n-2)/2 pairs to detect for duplicates • Apply classification approach first to filter pairs that, with high probability, are non-duplicates; use human workers to label the remaining (uncertain) pairs • Examples of crowdsourcing services include Amazon Mechanical Turk and Crowdflower * Wang et al., “CrowdER: Crowdsourcing Entity Resolution”, VLDB 2012

  32. Application Related Issues • Timeliness: • Some data age quickly after it is collected • Example: Purchasing behavior of online shoppers • All patterns/models based on expired data are out of date • Relevance: • Data must contain relevant information to the application • Knowledge about the data • Correlated attributes • Special values (Example: 9999 for unavailable) • Attribute types (Example: Categorical {1, 2, 3})

  33. Data Preprocessing • Aggregation: combine multiple objects into one • Reduce number of objects: less memory, less processing time • Quantitative attributes are combined using sum or average • Qualitative attributes are omitted or listed as a set • Sampling: • select a subset of the objects to be analyzed • Discretization: • transform values into categorical form • Dimensionality reduction • Variable transformation

  34. The Curse of Dimensionality • Data sets with large number of features • Document matrix • Time series of daily closing stock prices

  35. The Curse of Dimensionality • Data analysis becomes harder as the dimensionality of data increases • As dimensionality increases, data becomes increasingly sparse • Classification: not enough data to create a reliable model • Clustering: distances between points less meaningful

  36. Why Feature Reduction? • Less data so the data mining algorithm learns faster • Higher accuracy so the model can generalize better • Simple results so they are easier to understand • Fewer features for improving the data collection process • Techniques: • Feature selection • Feature composition/creation

  37. Feature Selection • Selecting attributes that are a subset of old attributes • Redundant features: contain duplicate information • (price, tax amount) • Irrelevant features: no useful information • Phone number, employee id • Exhaustive enumeration is not practically feasible • Approaches: • Embedded: as part of the data mining algorithm • Filter: before the data mining algorithm is run • Wrapper approaches: by trying different subsets of attributes

  38. Feature Selection • General Approach: • Generate a subset of features • Evaluate • Using an evaluation function for filter approaches • Using the data mining algorithm for wrapper approaches • If good, stop • Otherwise, update subset and repeat Need: subset generation strategy, a measure for evaluation and a stopping criterion

  39. Principal Component Analysis • Identifies strong patterns in the data • Expresses data in a way that highlights differences and similarities • Reduces the number of dimensions without much loss of information

  40. Principal Component Analysis • First dimension is chosen to capture the variability of the data • Second dimension is chosen orthogonal to the first one so it captures as much of the remaining variability, and so on … • Each pair of new attributes has covariance 0 • The attributes are ordered in decreasing order of the variance in data that they capture • Each attribute: • accounts for max variance not accounted for by previous attributes • is uncorrelated with preceding attributes

  41. PCA – Step 1 • Subtract the mean from each of the data dimensions • Obtain a new data set DM • This produces data with mean 0 Original data matrix Vector of means New data matrix

  42. PCA – Step 2 • Calculate the covariance matrix for DM: cov(DM)

  43. PCA – Step 3 • Calculate the unit eigenvectors and eigenvalues of the covariance matrix • That gives us lines that characterize the data

  44. PCA – Step 4 • Choose components: • The eigenvector with the largest eigenvalue is the principal component of the data set • Order eigenvectors by eigenvalue from largest to smallest, this gives the components by order of significance • Select the first p eigenvectors, the final data will have p dimensions • Construct the feature vector Af= (eigv1 eigv2 … eigvp)

  45. PCA – Step 5 • Derive the new data set: • DataNew = AfTxDM • Expresses the original data solely in terms of the vectors chosen

  46. PCA Assumptions and characteristics • Linearity: frames the problem as change of basis • Large variances have important dynamics • Effective: most variability captured by a small fraction of dimensions • Principal components are orthogonal • Noise becomes weaker

  47. Summary • Data quality affects quality of knowledge discovered • Preprocessing needed: • Improve quality • Transform into suitable form • Issues: • Missing values: not available / not applicable • Inconsistent values • Noise • Duplicate data • Inapplicable form requiring transformation • High dimensionality: PCA

More Related