1 / 53

ICS 278: Data Mining Exploratory Data Analysis and Visualization

ICS 278: Data Mining Exploratory Data Analysis and Visualization. Lecture 4. Project proposals due next Thursday Today’s lecture Exploratory Data Analysis and Visualization Summary statistics 1 and 2 dimensional data visualization Higher dimensional visualization

vila
Download Presentation

ICS 278: Data Mining Exploratory Data Analysis and Visualization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICS 278: Data MiningExploratory Data Analysis and Visualization Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  2. Lecture 4 • Project proposals due next Thursday • Today’s lecture • Exploratory Data Analysis and Visualization • Summary statistics • 1 and 2 dimensional data visualization • Higher dimensional visualization • Examples of complex visualization • Reading: Chapter 3 in the text • Note: slides at end of last lecture on covariance, Mahalanobis distance, etc, will be revisited in next lecture Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  3. Exploratory Data Analysis (EDA) • get a general sense of the data • interactive and visual • (cleverly/creatively)exploit human visual power to see patterns • 1 to 5 dimensions (e.g. spatial, color, time, sound) • e.g. plot raw data/statistics, reduce dimensions as needed • data-driven (model-free) • especially useful in early stages of data mining • detect outliers (e.g. assess data quality) • test assumptions (e.g. normal distributions or skewed?) • identify useful raw data & transforms (e.g. log(x)) • http://www.itl.nist.gov/div898/handbook/eda/eda.htm • Bottom line: it is always well worth looking at your data! Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  4. Summary Statistics • not visual • sample statistics of data X • mean:  = i Xi / n {  minimizes i (Xi - )2 } • mode: most common value in X • median: X=sort(X), median = Xn/2 (half below, half above) • quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n • interquartile range: value(Q3) - value(Q1) • range: max(X) - min(X) = Xn - X1 • variance: 2 = i (Xi - )2 / n • skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ] • zero if symmetric; right-skewed more common (e.g. you v. Bill Gates) • number of distinct values for a variable (see unique.m in MATLAB) • Note: all of these are estimates based on the sample at hand – they may be different from the “true” values (e.g., median age in US). Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  5. Exploratory Data Analysis Tools for Displaying Single Variables Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  6. Histogram • Most common form: split data range into equal-sized bins Then for each bin, count the number of points from the data set that fall into the bin. • Vertical axis: Frequency (i.e., counts for each bin) • Horizontal axis: Response variable • The histogram graphically shows the following: • center (i.e., the location) of the data; • spread (i.e., the scale) of the data; • skewness of the data; • presence of outliers; and • presence of multiple modes in the data. Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  7. Issues with Histograms • For small data sets, histograms can be misleading. Small changes in the data or to the bucket boundaries can result in very different histograms. • Interactive bin-width example (online applet) • http://www.stat.sc.edu/~west/javahtml/Histogram.html • For large data sets, histograms can be quite effective at illustrating general properties of the distribution. • Can smooth histogram using a variety of techniques • E.g., kernel density estimation (pages 59-61 in text) • Histograms effectively only work with 1 variable at a time • Difficult to extend to 2 dimensions, not possible for >2 • So histograms tell us nothing about the relationships among variables Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  8. Histogram Example classical bell-shaped, symmetric histogram with most of the frequency counts bunched in the middle and with the counts dying off out in the tails. From a physical science/engineering point of view, the Normal/Gaussian distribution often occurs in nature (due in part to the central limit theorem). Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  9. K = 50 K = 500 K = 50 ZipCode Data: Population Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  10. ZipCode Data: Population • MATLAB code: X = zipcode_data(:,2) % second column from zipcode array histogram(X, 50) % histogram of X with 50 bins • histogram(X, 500) % 500 bins index = X < 5000; % identify X values lower than 5000 • histogram(X(index),50) % now plot just these X values Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  11. blood pressure = 0 ? Histogram Detecting Outlier (Missing Data) Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  12. Right Skewness Example: Credit Card Usage similarly right-skewed are Power law distributions (Pi ~ 1/ia, where a >= 1) e.g. for a = 1 we have “Zipf’s law” For word frequencies in text Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  13. Box (and Whisker) Plots: Pima Indians Data plots all data outside whiskers Q3-Q1 box contains middle 50% of data up to 1.5 x Q3-Q1 (or shorter, if no data that far above Q3) Q2 (median) healthy diabetic Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  14. Time Series Example 1 annual fees introduced in UK (many users cutback to 1 credit card) Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  15. Time Series Example 2 summer bifurcations in air travel (favor early/late) summer peaks steady growth trend New Year bumps Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  16. Time-Series Example 3 mean weight vs mean age for 10k control group Scotland experiment: “ milk in kid diet  better health” ? 20,000 kids: 5k raw, 5k pasteurize, 10k control (no supplement) Possible explanations: Grow less early in year than later? No steps in height plots; so why height  uniformly, weight  spurts? Kids weighed in clothes: summer garb lighter than winter? Would expect smooth weight growth plot. Visually reveals unexpected pattern(steps), not apparent from raw data table. Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  17. Non-Stationarity • Stationarity: • (loose definition) A probability distribution p (x | t) is stationary with respect to t if p (x | t ) = p (x) for all t, where x is the set of variables of interest, and t is some other varying quantity (e.g., usually t = time, but could represent spatial information, group information, etc) • Examples: • p(customer demographics today) = p(customer demographics next month)? • p(weights in Scotland) = p(weights in US) ? • p(income of customers in Bank 1) = p(income of customers in Bank 2)? • Non-stationarity is common in real data sets • Solutions? • Model stationarity (e.g., increasing trend over time) and extrapolate • Build model only on most recent/most similar data Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  18. Exploratory Data Analysis Tools for Displaying Pairs of Variables Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  19. 2D Scatter Plots • standard tool to display relation between 2 variables • e.g. y-axis = response, x-axis = suspected indicator • useful to answer: • x,y related? • no • linearly • nonlinearly • variance(y) depend on x? • outliers present? • MATLAB: • plot(X(1,:),X(2,:),’.’); credit card repayment: low-low, high-high Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  20. Scatter Plot: No apparent relationship Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  21. Scatter Plot: Linear relationship Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  22. Scatter Plot: Quadratic relationship Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  23. Scatter plot: Homoscedastic Variation of Y Does Not Depend on X Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  24. Scatter plot: Heteroscedastic variation in Y differs depending on the value of X e.g., Y = annual tax paid, X = income Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  25. (from US Zip code data: each point = 1 Zip code) units = dollars Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  26. Problems with Scatter Plots of Large Data appears: later apps older; reality: downward slope (more apps, more variance) 96,000 bank loan applicants scatter plot degrades into black smudge ... Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  27. Contour Plots Can Help recall: (same 96,000 bank loan apps as before) shows variance(y)  with x  is indeed due to horizontal skew in density unimodal skewed  skewed  Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  28. Problems with Scatter Plots of Large Data # weeks credit card buys gas vs groceries (10,000 customers) actual correlation (0.48) higher than appears (overprinting) also demands explanation Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  29. A simple data set Data X 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00 Y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, pp. 195-199. Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  30. A simple data set Data X 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00 Y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 Summary Statistics N = 11Mean of X = 9.0Mean of Y = 7.5Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816 Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  31. A simple data set Data X 10.00 8.00 13.00 9.00 11.00 14.00 6.00 4.00 12.00 7.00 5.00 Y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  32. 3 more data sets X2 Y2 X3 Y3 X4 Y4 10.00 9.14 10.00 7.46 8.00 6.58 8.00 8.14 8.00 6.77 8.00 5.76 13.00 8.74 13.00 12.74 8.00 7.71 9.00 8.77 9.00 7.11 8.00 8.84 11.00 9.26 11.00 7.81 8.00 8.47 14.00 8.10 14.00 8.84 8.00 7.04 6.00 6.13 6.00 6.08 8.00 5.25 4.00 3.10 4.00 5.39 19.00 12.50 12.00 9.13 12.00 8.15 8.00 5.56 7.00 7.26 7.00 6.42 8.00 7.91 5.00 4.74 5.00 5.73 8.00 6.89 Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  33. Summary Statistics Summary Statistics of Data Set 2 N = 11Mean of X = 9.0Mean of Y = 7.5Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816 Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  34. Summary Statistics Summary Statistics of Data Set 2 N = 11Mean of X = 9.0Mean of Y = 7.5Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816 Summary Statistics of Data Set 3 N = 11Mean of X = 9.0Mean of Y = 7.5Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816 Summary Statistics of Data Set 4 N = 11Mean of X = 9.0Mean of Y = 7.5Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816 Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  35. Visualization really helps! Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  36. Exploratory Data Analysis Tools for Displaying More than 2 Variables Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  37. Multivariate Visualization • Multivariate -> multiple variables • 2 variables: scatter plots, etc • 3 variables: • 3-dimensional plots • Look impressive, but often not used • Can be cognitively challenging to interpret • Alternatives: overlay color-coding (e.g., categorical data) on 2d scatter plot • 4 variables: • 3d with color or time • Can be effective in certain situations, but tricky • Higher dimensions • Generally difficult • Scatter plots, icon plots, parallel coordinates: all have weaknesses • Alternative: “map” data to lower dimensions, e.g., PCA or multidimensional scaling • Main problem: high-dimensional structure may not be apparent in low-dimensional views Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  38. Scatter Plot Matrix For interactive visualization the concept of “linked plots” is generally useful Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  39. Trellis Plot Older Younger Male Female Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  40. Using Icons to Encode Information, e.g., Star Plots • Each star represents a single observation. Star plots are used to examine the relative values for a single data point • The star plot consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. • Useful for small data sets with up to 10 or so variables • Limitations? • Small data sets, small dimensions • Ordering of variables may affect perception 1 Price 2 Mileage (MPG) 3 1978 Repair Record (1 = Worst, 5 = Best) 4 1977 Repair Record (1 = Worst, 5 = Best) 5 Headroom 6 Rear Seat Room 7 Trunk Space 8 Weight 9 Length Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  41. Chernoff’s Faces • described by ten facial characteristic parameters: head eccentricity, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, eye spacing, eye size, mouth length and degree of mouth opening • Chernoff faces applet http://people.cs.uchicago.edu/~wiseman/chernoff/ • more icon plots http://www.statsoft.com/textbook/glosi.html Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  42. Parallel Coordinates (epileptic seizure data from text) 1 (of n) cases dimensions (possibly all p of them!) often (re)ordered to better distinguish among interesting subsets of n total cases (this case is a “brushed” one, with a darker line, to standout from the n-1 other cases) interactive “brushing” is useful for seeing such distinctions Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  43. More elaborate parallel coordinates example (from E. Wegman, 1999). 12,000 bank customers with 8 variables Additional “dependent” variable is profit (green for positive, red for negative) Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  44. Interactive “Grand Tour” Techniques • “Grand Tour” idea • Cycle continuously through multiple projections of the data • Cycles through all possible projections (depending on time constraints) • Projects can be 1, 2, or 3d typically (often 2d) • Can link with scatter plot matrices (see following example) • Asimov (1985) • e.g. XGOBI visualization package (available on the Web) • http://public.research.att.com/~stat/xgobi/ • Example on following 2 slides • 7dimensional physics data, color-coded by group, shown with • Standard scatter matrix • static snapshot of grand tour Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  45. Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  46. Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  47. Example of displaying 4d categorical data, e.g., as used in OLAP/databases Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  48. Other aspects (not discussed) • Cognitive and human-factors aspects of visualization • In creating visualizations of data it is important to be aware of how the human brain perceives visual information • E.g., “Rules and principles of scientific data visualization” • http://www.siggraph.org/education/materials/HyperVis/percept/visrules.htm • Artistic aspects of visualization • Classic books by Edward Tufte: http://www.edwardtufte.com/tufte/ • Visualization of other data • 2d, 3d, 4d “volume” data (fluid flow, brain images, etc) • Network/graph data • Issues: graph layout/drawing, issues of graph size • Many others…., e.g., • http://www.cybergeography.org/ • CHI conference, etc Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  49. Visualization of weatherstates for KenyaDaily data from 20 year historyclustered into 3 differentweather “states”Mean image for each state - wind direction (arrows) - wind intensity (size of arrows) - rainfall (size of circles) - pressure (contours)S. Kirshner, A. Robertson, P. Smyth, 2004. Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

  50. Summary of 189k newsgroups and 257M postings Green = increase in postings in 2004 over 2003, red = decrease Uses “treemap” technique. Details at http://jcmc.indiana.edu/vol10/issue4/turner.html Data Mining Lectures Lecture on EDA and Visualization Padhraic Smyth, UC Irvine

More Related