1 / 62

Data Mining: Concepts and Techniques — Chapter 2 —

Data Mining: Concepts and Techniques — Chapter 2 —. What is about Data?. General data characteristics Basic data description and exploration Measuring data similarity. What is Data?. Attributes. Collection of data objects and their attributes

Download Presentation

Data Mining: Concepts and Techniques — Chapter 2 —

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining: Concepts and Techniques— Chapter 2 — Data Mining: Concepts and Techniques

  2. What is about Data? • General data characteristics • Basic data description and exploration • Measuring data similarity Data Mining: Concepts and Techniques

  3. What is Data? Attributes • Collection of data objects and their attributes • An attribute is a property or characteristic of an object • Examples: eye color of a person, temperature, etc. • Attribute is also known as variable, field, characteristic, or feature • A collection of attributes describe an object • Object is also known as record, point, case, sample, entity, or instance Objects Data Mining: Concepts and Techniques

  4. Important Characteristics of Structured Data • Dimensionality • Curse of dimensionality • Sparsity • Only presence counts • Resolution • Patterns depend on the scale • Similarity • Distance measure Data Mining: Concepts and Techniques

  5. Attribute Values • Attribute values are numbers or symbols assigned to an attribute • Distinction between attributes and attribute values • Same attribute can be mapped to different attribute values • Example: height can be measured in feet or meters • Different attributes can be mapped to the same set of values • Example: Attribute values for ID and age are integers • But properties of attribute values can be different • ID has no limit but age has a maximum and minimum value Data Mining: Concepts and Techniques

  6. Types of Attribute Values • Nominal • E.g., profession, ID numbers, eye color, zip codes • Ordinal • E.g., rankings (e.g., army, professions), grades, height in {tall, medium, short} • Binary • E.g., medical test (positive vs. negative) • Interval • E.g., calendar dates, body temperatures • Ratio • E.g., temperature in Kelvin, length, time, counts Data Mining: Concepts and Techniques

  7. Properties of Attribute Values • The type of an attribute depends on which of the following properties it possesses: • Distinctness: =  • Order: < > • Addition: + - • Multiplication: * / • Nominal attribute: distinctness • Ordinal attribute: distinctness & order • Interval attribute: distinctness, order & addition • Ratio attribute: all 4 properties Data Mining: Concepts and Techniques

  8. Attribute Type Description Examples Operations Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation Data Mining: Concepts and Techniques

  9. Discrete vs. Continuous Attributes • Discrete Attribute • Has only a finite or countably infinite set of values • E.g., zip codes, profession, or the set of words in a collection of documents • Sometimes, represented as integer variables • Note: Binary attributes are a special case of discrete attributes • Continuous Attribute • Has real numbers as attribute values • Examples: temperature, height, or weight • Practically, real values can only be measured and represented using a finite number of digits • Continuous attributes are typically represented as floating-point variables Data Mining: Concepts and Techniques

  10. Types of data sets • Record • Data Matrix • Document Data • Transaction Data • Graph • World Wide Web • Molecular Structures • Ordered • Spatial Data • Temporal Data • Sequential Data • Genetic Sequence Data Data Mining: Concepts and Techniques

  11. Important Characteristics of Structured Data • Dimensionality • Curse of Dimensionality • Sparsity • Only presence counts • Resolution • Patterns depend on the scale Data Mining: Concepts and Techniques

  12. Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes Data Mining: Concepts and Techniques

  13. Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute Data Mining: Concepts and Techniques

  14. Document Data • Each document becomes a `term' vector, • each term is a component (attribute) of the vector, • the value of each component is the number of times the corresponding term occurs in the document. Data Mining: Concepts and Techniques

  15. Transaction Data • A special type of record data, where • each record (transaction) involves a set of items. • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. Data Mining: Concepts and Techniques

  16. Graph Data • Examples: Generic graph and HTML Links Data Mining: Concepts and Techniques

  17. Chemical Data • Benzene Molecule: C6H6 Data Mining: Concepts and Techniques

  18. Ordered Data • Sequences of transactions Items/Events An element of the sequence Data Mining: Concepts and Techniques

  19. Ordered Data • Genomic sequence data Data Mining: Concepts and Techniques

  20. Ordered Data • Spatio-Temporal Data Average Monthly Temperature of land and ocean Data Mining: Concepts and Techniques

  21. General data characteristics • Basic data description and exploration • Measuring data similarity Data Mining: Concepts and Techniques

  22. Mining Data DescriptiveCharacteristics • Motivation • To better understand the data: central tendency, variation and spread • Data dispersion characteristics • median, max, min, quantiles, outliers, variance, etc. • Numerical dimensions correspond to sorted intervals • Data dispersion: analyzed with multiple granularities of precision • Boxplot or quantile analysis on sorted intervals • Dispersion analysis on computed measures • Folding measures into numerical dimensions • Boxplot or quantile analysis on the transformed cube Data Mining: Concepts and Techniques

  23. Measuring the Central Tendency • Mean (algebraic measure) (sample vs. population): • Weighted arithmetic mean: • Trimmed mean: chopping extreme values • Median: A holistic measure • Middle value if odd number of values, or average of the middle two values otherwise • Estimated by interpolation (for grouped data): • Mode • Value that occurs most frequently in the data • Unimodal, bimodal, trimodal • Empirical formula: Data Mining: Concepts and Techniques

  24. Symmetric vs. Skewed Data • Median, mean and mode of symmetric, positively and negatively skewed data symmetric positively skewed negatively skewed Data Mining: Concepts and Techniques

  25. Measuring the Dispersion of Data • Quartiles, outliers and boxplots • Quartiles: Q1 (25th percentile), Q3 (75th percentile) • Inter-quartile range: IQR = Q3 –Q1 • Five number summary: min, Q1, M,Q3, max • Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually • Outlier: usually, a value higher/lower than 1.5 x IQR • Variance and standard deviation (sample:s, population: σ) • Variance: (algebraic, scalable computation) • Standard deviation s (or σ) is the square root of variance s2 (orσ2) Data Mining: Concepts and Techniques

  26. Boxplot Analysis • Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum • Boxplot • Data is represented with a box • The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR • The median is marked by a line within the box • Whiskers: two lines outside the box extend to Minimum and Maximum Data Mining: Concepts and Techniques

  27. Histogram Analysis • Graph displays of basic statistical class descriptions • Frequency histograms • A univariate graphical method • Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data Data Mining: Concepts and Techniques

  28. Histograms Often Tells More than Boxplots • The two histograms shown in the left may have the same boxplot representation • The same values for: min, Q1, median, Q3, max • But they have rather different data distributions Data Mining: Concepts and Techniques

  29. Quantile Plot • Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) • Plots quantile information • For a data xidata sorted in increasing order, fiindicates that approximately 100 fi% of the data are below or equal to the value xi Data Mining: Concepts and Techniques

  30. Quantile-Quantile (Q-Q) Plot • Graphs the quantiles of one univariate distribution against the corresponding quantiles of another • Allows the user to view whether there is a shift in going from one distribution to another Data Mining: Concepts and Techniques

  31. Scatter plot • Provides a first look at bivariate data to see clusters of points, outliers, etc • Each pair of values is treated as a pair of coordinates and plotted as points in the plane Data Mining: Concepts and Techniques

  32. Loess Curve • Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence • Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression Data Mining: Concepts and Techniques

  33. Positively and Negatively Correlated Data • The left half fragment is positively correlated • The right half is negative correlated Data Mining: Concepts and Techniques

  34. Not Correlated Data Data Mining: Concepts and Techniques

  35. Data Visualization and Its Methods • Why data visualization? • Gain insight into an information space by mapping data onto graphical primitives • Provide qualitative overview of large data sets • Search for patterns, trends, structure, irregularities, relationships among data • Help find interesting regions and suitable parameters for further quantitative analysis • Provide a visual proof of computer representations derived • Typical visualization methods: • Geometric techniques • Icon-based techniques • Hierarchical techniques Data Mining: Concepts and Techniques

  36. Geometric Techniques • Visualization of geometric transformations and projections of the data • Methods • Landscapes • Projection pursuit technique • Finding meaningful projections of multidimensional data • Scatterplot matrices • Prosection views • Hyperslice • Parallel coordinates Data Mining: Concepts and Techniques

  37. Scatterplot Matrices Matrix of scatterplots (x-y-diagrams) of the k-dim. data Used byermission of M. Ward, Worcester PolytechnicInstitute Data Mining: Concepts and Techniques

  38. Landscapes • Visualization of the data as perspective landscape • The data needs to be transformed into a (possibly artificial) 2D spatial representation which preserves the characteristics of the data news articlesvisualized asa landscape Used by permission of B. Wright, Visible Decisions Inc. Data Mining: Concepts and Techniques

  39. Parallel Coordinates • n equidistant axes which are parallel to one of the screen axes and correspond to the attributes • The axes are scaled to the [minimum, maximum]: range of the corresponding attribute • Every data item corresponds to a polygonal line which intersects each of the axes at the point which corresponds to the value for the attribute Data Mining: Concepts and Techniques

  40. Parallel Coordinates of a Data Set Data Mining: Concepts and Techniques

  41. Icon-based Techniques • Visualization of the data values as features of icons • Methods: • Chernoff Faces • Stick Figures • Shape Coding: • Color Icons: • TileBars: The use of small icons representing the relevance feature vectors in document retrieval Data Mining: Concepts and Techniques

  42. A way to display variables on a two-dimensional surface, e.g., let x be eyebrow slant, y be eye size, z be nose length, etc. The figure shows faces produced using 10 characteristics--head eccentricity, eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening): Each assigned one of 10 possible values, generated using Mathematica (S. Dickson) REFERENCE: Gonick, L. and Smith, W. The Cartoon Guide to Statistics. New York: Harper Perennial, p. 212, 1993 Weisstein, Eric W. "Chernoff Face." From MathWorld--A Wolfram Web Resource. mathworld.wolfram.com/ChernoffFace.html Chernoff Faces Data Mining: Concepts and Techniques

  43. Hierarchical Techniques • Visualization of the data using a hierarchical partitioning into subspaces. • Methods • Dimensional Stacking • Worlds-within-Worlds • Treemap • Cone Trees • InfoCube Data Mining: Concepts and Techniques

  44. Tree-Map • Screen-filling method which uses a hierarchical partitioning of the screen into regions depending on the attribute values • The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes) MSR Netscan Image Data Mining: Concepts and Techniques

  45. Tree-Map of a File System (Schneiderman) Data Mining: Concepts and Techniques

  46. General data characteristics • Basic data description and exploration • Measuring data similarity(Sec. 7.2) Data Mining: Concepts and Techniques

  47. Similarity and Dissimilarity • Similarity • Numerical measure of how alike two data objects are • Value is higher when objects are more alike • Often falls in the range [0,1] • Dissimilarity (i.e., distance) • Numerical measure of how different are two data objects • Lower when objects are more alike • Minimum dissimilarity is often 0 • Upper limit varies • Proximity refers to a similarity or dissimilarity Data Mining: Concepts and Techniques

  48. Data Matrix and Dissimilarity Matrix • Data matrix • n data points with p dimensions • Two modes • Dissimilarity matrix • n data points, but registers only the distance • A triangular matrix • Single mode Data Mining: Concepts and Techniques

  49. Example: Data Matrix and Distance Matrix Data Matrix Distance Matrix (i.e., Dissimilarity Matrix) for Euclidean Distance Data Mining: Concepts and Techniques

  50. Minkowski Distance • Minkowski distance: A popular distance measure where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is the order • Properties • d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness) • d(i, j) = d(j, i)(Symmetry) • d(i, j)  d(i, k) + d(k, j)(Triangle Inequality) • A distance that satisfies these properties is a metric Data Mining: Concepts and Techniques

More Related