Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques— Chapter 2 — Data Mining: Concepts and Techniques

What is about Data? • General data characteristics • Basic data description and exploration • Measuring data similarity Data Mining: Concepts and Techniques

What is Data? Attributes • Collection of data objects and their attributes • An attribute is a property or characteristic of an object • Examples: eye color of a person, temperature, etc. • Attribute is also known as variable, field, characteristic, or feature • A collection of attributes describe an object • Object is also known as record, point, case, sample, entity, or instance Objects Data Mining: Concepts and Techniques

Important Characteristics of Structured Data • Dimensionality • Curse of dimensionality • Sparsity • Only presence counts • Resolution • Patterns depend on the scale • Similarity • Distance measure Data Mining: Concepts and Techniques

Attribute Values • Attribute values are numbers or symbols assigned to an attribute • Distinction between attributes and attribute values • Same attribute can be mapped to different attribute values • Example: height can be measured in feet or meters • Different attributes can be mapped to the same set of values • Example: Attribute values for ID and age are integers • But properties of attribute values can be different • ID has no limit but age has a maximum and minimum value Data Mining: Concepts and Techniques

Types of Attribute Values • Nominal • E.g., profession, ID numbers, eye color, zip codes • Ordinal • E.g., rankings (e.g., army, professions), grades, height in {tall, medium, short} • Binary • E.g., medical test (positive vs. negative) • Interval • E.g., calendar dates, body temperatures • Ratio • E.g., temperature in Kelvin, length, time, counts Data Mining: Concepts and Techniques

Properties of Attribute Values • The type of an attribute depends on which of the following properties it possesses: • Distinctness: =  • Order: < > • Addition: + - • Multiplication: * / • Nominal attribute: distinctness • Ordinal attribute: distinctness & order • Interval attribute: distinctness, order & addition • Ratio attribute: all 4 properties Data Mining: Concepts and Techniques

Attribute Type Description Examples Operations Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation Data Mining: Concepts and Techniques

Discrete vs. Continuous Attributes • Discrete Attribute • Has only a finite or countably infinite set of values • E.g., zip codes, profession, or the set of words in a collection of documents • Sometimes, represented as integer variables • Note: Binary attributes are a special case of discrete attributes • Continuous Attribute • Has real numbers as attribute values • Examples: temperature, height, or weight • Practically, real values can only be measured and represented using a finite number of digits • Continuous attributes are typically represented as floating-point variables Data Mining: Concepts and Techniques

Types of data sets • Record • Data Matrix • Document Data • Transaction Data • Graph • World Wide Web • Molecular Structures • Ordered • Spatial Data • Temporal Data • Sequential Data • Genetic Sequence Data Data Mining: Concepts and Techniques

Important Characteristics of Structured Data • Dimensionality • Curse of Dimensionality • Sparsity • Only presence counts • Resolution • Patterns depend on the scale Data Mining: Concepts and Techniques

Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes Data Mining: Concepts and Techniques

Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute Data Mining: Concepts and Techniques

Document Data • Each document becomes a `term' vector, • each term is a component (attribute) of the vector, • the value of each component is the number of times the corresponding term occurs in the document. Data Mining: Concepts and Techniques

Transaction Data • A special type of record data, where • each record (transaction) involves a set of items. • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. Data Mining: Concepts and Techniques

Graph Data • Examples: Generic graph and HTML Links Data Mining: Concepts and Techniques

Chemical Data • Benzene Molecule: C6H6 Data Mining: Concepts and Techniques

Ordered Data • Sequences of transactions Items/Events An element of the sequence Data Mining: Concepts and Techniques

Ordered Data • Genomic sequence data Data Mining: Concepts and Techniques

Ordered Data • Spatio-Temporal Data Average Monthly Temperature of land and ocean Data Mining: Concepts and Techniques

General data characteristics • Basic data description and exploration • Measuring data similarity Data Mining: Concepts and Techniques

Mining Data DescriptiveCharacteristics • Motivation • To better understand the data: central tendency, variation and spread • Data dispersion characteristics • median, max, min, quantiles, outliers, variance, etc. • Numerical dimensions correspond to sorted intervals • Data dispersion: analyzed with multiple granularities of precision • Boxplot or quantile analysis on sorted intervals • Dispersion analysis on computed measures • Folding measures into numerical dimensions • Boxplot or quantile analysis on the transformed cube Data Mining: Concepts and Techniques

Measuring the Central Tendency • Mean (algebraic measure) (sample vs. population): • Weighted arithmetic mean: • Trimmed mean: chopping extreme values • Median: A holistic measure • Middle value if odd number of values, or average of the middle two values otherwise • Estimated by interpolation (for grouped data): • Mode • Value that occurs most frequently in the data • Unimodal, bimodal, trimodal • Empirical formula: Data Mining: Concepts and Techniques

Symmetric vs. Skewed Data • Median, mean and mode of symmetric, positively and negatively skewed data symmetric positively skewed negatively skewed Data Mining: Concepts and Techniques

Measuring the Dispersion of Data • Quartiles, outliers and boxplots • Quartiles: Q1 (25th percentile), Q3 (75th percentile) • Inter-quartile range: IQR = Q3 –Q1 • Five number summary: min, Q1, M,Q3, max • Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually • Outlier: usually, a value higher/lower than 1.5 x IQR • Variance and standard deviation (sample:s, population: σ) • Variance: (algebraic, scalable computation) • Standard deviation s (or σ) is the square root of variance s2 (orσ2) Data Mining: Concepts and Techniques

Boxplot Analysis • Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum • Boxplot • Data is represented with a box • The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR • The median is marked by a line within the box • Whiskers: two lines outside the box extend to Minimum and Maximum Data Mining: Concepts and Techniques

Histogram Analysis • Graph displays of basic statistical class descriptions • Frequency histograms • A univariate graphical method • Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data Data Mining: Concepts and Techniques

Histograms Often Tells More than Boxplots • The two histograms shown in the left may have the same boxplot representation • The same values for: min, Q1, median, Q3, max • But they have rather different data distributions Data Mining: Concepts and Techniques

Quantile Plot • Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) • Plots quantile information • For a data xidata sorted in increasing order, fiindicates that approximately 100 fi% of the data are below or equal to the value xi Data Mining: Concepts and Techniques

Quantile-Quantile (Q-Q) Plot • Graphs the quantiles of one univariate distribution against the corresponding quantiles of another • Allows the user to view whether there is a shift in going from one distribution to another Data Mining: Concepts and Techniques

Scatter plot • Provides a first look at bivariate data to see clusters of points, outliers, etc • Each pair of values is treated as a pair of coordinates and plotted as points in the plane Data Mining: Concepts and Techniques

Loess Curve • Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence • Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression Data Mining: Concepts and Techniques

Positively and Negatively Correlated Data • The left half fragment is positively correlated • The right half is negative correlated Data Mining: Concepts and Techniques

Not Correlated Data Data Mining: Concepts and Techniques

Data Visualization and Its Methods • Why data visualization? • Gain insight into an information space by mapping data onto graphical primitives • Provide qualitative overview of large data sets • Search for patterns, trends, structure, irregularities, relationships among data • Help find interesting regions and suitable parameters for further quantitative analysis • Provide a visual proof of computer representations derived • Typical visualization methods: • Geometric techniques • Icon-based techniques • Hierarchical techniques Data Mining: Concepts and Techniques

Geometric Techniques • Visualization of geometric transformations and projections of the data • Methods • Landscapes • Projection pursuit technique • Finding meaningful projections of multidimensional data • Scatterplot matrices • Prosection views • Hyperslice • Parallel coordinates Data Mining: Concepts and Techniques

Scatterplot Matrices Matrix of scatterplots (x-y-diagrams) of the k-dim. data Used byermission of M. Ward, Worcester PolytechnicInstitute Data Mining: Concepts and Techniques

Landscapes • Visualization of the data as perspective landscape • The data needs to be transformed into a (possibly artificial) 2D spatial representation which preserves the characteristics of the data news articlesvisualized asa landscape Used by permission of B. Wright, Visible Decisions Inc. Data Mining: Concepts and Techniques

Parallel Coordinates • n equidistant axes which are parallel to one of the screen axes and correspond to the attributes • The axes are scaled to the [minimum, maximum]: range of the corresponding attribute • Every data item corresponds to a polygonal line which intersects each of the axes at the point which corresponds to the value for the attribute Data Mining: Concepts and Techniques

Parallel Coordinates of a Data Set Data Mining: Concepts and Techniques

Icon-based Techniques • Visualization of the data values as features of icons • Methods: • Chernoff Faces • Stick Figures • Shape Coding: • Color Icons: • TileBars: The use of small icons representing the relevance feature vectors in document retrieval Data Mining: Concepts and Techniques

A way to display variables on a two-dimensional surface, e.g., let x be eyebrow slant, y be eye size, z be nose length, etc. The figure shows faces produced using 10 characteristics--head eccentricity, eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening): Each assigned one of 10 possible values, generated using Mathematica (S. Dickson) REFERENCE: Gonick, L. and Smith, W. The Cartoon Guide to Statistics. New York: Harper Perennial, p. 212, 1993 Weisstein, Eric W. "Chernoff Face." From MathWorld--A Wolfram Web Resource. mathworld.wolfram.com/ChernoffFace.html Chernoff Faces Data Mining: Concepts and Techniques

Hierarchical Techniques • Visualization of the data using a hierarchical partitioning into subspaces. • Methods • Dimensional Stacking • Worlds-within-Worlds • Treemap • Cone Trees • InfoCube Data Mining: Concepts and Techniques

Tree-Map • Screen-filling method which uses a hierarchical partitioning of the screen into regions depending on the attribute values • The x- and y-dimension of the screen are partitioned alternately according to the attribute values (classes) MSR Netscan Image Data Mining: Concepts and Techniques

Tree-Map of a File System (Schneiderman) Data Mining: Concepts and Techniques

General data characteristics • Basic data description and exploration • Measuring data similarity(Sec. 7.2) Data Mining: Concepts and Techniques

Similarity and Dissimilarity • Similarity • Numerical measure of how alike two data objects are • Value is higher when objects are more alike • Often falls in the range [0,1] • Dissimilarity (i.e., distance) • Numerical measure of how different are two data objects • Lower when objects are more alike • Minimum dissimilarity is often 0 • Upper limit varies • Proximity refers to a similarity or dissimilarity Data Mining: Concepts and Techniques

Data Matrix and Dissimilarity Matrix • Data matrix • n data points with p dimensions • Two modes • Dissimilarity matrix • n data points, but registers only the distance • A triangular matrix • Single mode Data Mining: Concepts and Techniques

Example: Data Matrix and Distance Matrix Data Matrix Distance Matrix (i.e., Dissimilarity Matrix) for Euclidean Distance Data Mining: Concepts and Techniques

Minkowski Distance • Minkowski distance: A popular distance measure where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is the order • Properties • d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness) • d(i, j) = d(j, i)(Symmetry) • d(i, j)  d(i, k) + d(k, j)(Triangle Inequality) • A distance that satisfies these properties is a metric Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques — Chapter 2 —