1 / 31

Lecture 06: Data Transform I

Lecture 06: Data Transform I. September 23, 2010 COMP 150-12 Topics in Visual Analytics. Lecture Outline. Data Retrieval Methods for increasing retrieval speed: Pre-computation Pre-fetching and Caching Levels of Detail (LOD) Hardware support Data transform (pre-processing)

lanai
Download Presentation

Lecture 06: Data Transform I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 06:Data Transform I September 23, 2010 COMP 150-12Topics in Visual Analytics

  2. Lecture Outline • Data Retrieval • Methods for increasing retrieval speed: • Pre-computation • Pre-fetching and Caching • Levels of Detail (LOD) • Hardware support • Data transform (pre-processing) • Aggregate (clustering) • Sampling (sub-sampling, re-sampling) • Simplification (dimension reduction) • Appropriate representation (finding underlying mathematical representation)

  3. Problem Statement • All tricks from lecture 5 have been implemented. • However, the sheer size of the data is so large that the tricks themselves alone cannot support <0.1 fps interactivity. • Example: all search queries at Google, all transactions at Bank of America.

  4. General Concept • If the data size is truly too large, we can find ways trim the data: • By reducing the number of rows • Subsampling • Clustering • By reducing the number of columns • Dimension reduction • Fit an underlying representation (linear and non-linear)

  5. Challenge • How to maintain the general “characteristics” of the original data • How much can be trimmed? • Analysis based on the trimmed data, does it still apply to the original raw data?

  6. Disclaimer • Many of these methods are related to or based on machine learning. • Often referred to as “automated analysis” • As opposed to “interactive analysis”

  7. Keim’s visual analytics model interactions Pre-process input interactions Image source: Visual Analytics Definition, Process, and Challenges, Keim et al, LNCS vol 4950, 2008

  8. Dirty Data • Missing values, or data with uncertainty • Discard bad records • Assign a sentinel value (e.g. -1) • Assign the average value • Assign value based on nearest neighbors • Matrix completion problem • e.g. assuming a low rank matrix

  9. From Lecture 3: Data Definition • A typical dataset in visualization consists of n records • (r1, r2, r3, … , rn) • Each record ri consists of m (m >=1) observations or variables • (v1, v2, v3, … , vm) • A variable may be either independent or dependent • Independent variable (iv) is not controlled or affected by another variable • For example, time in a time-series dataset • Dependent variable (dv) is affected by a variation in one or more associated independent variables • For example, temperature in a region • Formal definition: • ri = (iv1, iv2, iv3, … , ivmi, dv1, dv2, dv3, … , dvmd) • where m = mi + md

  10. Rank vs. Dimensionality • How many dimensions is in your data? • What is its true rank? • Example… Pig Chewing

  11. Example • Adobe Photoshop Content-Aware Fill • http://www.youtube.com/watch?v=NH0aEp1oDOI • Netflix challenge

  12. Questions?

  13. Aggregation / Clustering • Very much related to LOD and its supporting structures. • The idea is to “group” similar data items together

  14. Clustering Algorithms • There are numerous clustering algorithms out there… • Here we look at two popular ones • K-means • Agglomerative hierarchical • Clustering always needs a distance function

  15. K-Means (2) (1) (3) (4) • Inputs: • K: number of clusters • distance function: d(xi, xj)

  16. K-Means • http://www.youtube.com/watch?v=74rv4snLl70 • Notes about k-means: • Convergence could be slow (but it’s guaranteed to converge!) • Need to specify k • Adaptive k-means • Lots of variations

  17. Questions?

  18. Agglomerative Hierarchical Clustering Input: distance function: d(xi, xj)

  19. Dendrogram

  20. Variations • Agglomerative: a bottom-up approach • Divisive: a top-down approach • Linkage of two clusters A and B: • Complete Link: • Single Link: • Average Link:

  21. Examples and Intuitions • Shape of single-link vs. shape of complete-link • What happens with single link? • What happens with complete link?

  22. Questions?

  23. Sampling • Challenge • Can we find a smaller population n’ in the original population n such that n’ exhibits the same (or similar) characteristic as n? • Re-sampling • Sub-sampling

  24. Re-Sampling • Given the original data, create a new (smaller) dataset that replaces the original • Image Processing: • Linear Interpolation • Bilinear Interpolation • Nonlinear (cubic) Interpolation

  25. Linear Interpolation 10 2 15 8 x=0.0 x=0.3333 x=0.6666 x=1.0 10 ?? ?? ?? 4 x=0.0 x=0.25 x=0.5 x=0.75 x=1.0 Example:

  26. Bilinear Interpolation 10 2 15 8 ?? ?? ?? 4 25 9 18 Similar to linear interpolation. The new sampled values are a weighted average of the surrounding 4 vertices

  27. Catmull-Rom Interpolation

  28. Catmull-Rom Interpolation

  29. Sub-Sampling • Use random sampling • Simple random sampling • Systematic sampling • Etc. • Key point is that each element must have an equal non-zero chance of being selected • e.g. Selecting individuals from households • Remember that there could still be potential sampling error

  30. Sub-Sampling • If we assume that the population follows a normal distribution • Further assume that the variability of the population is known (as measured by standard deviation σ) • Then the standard error of the sample mean is given by: • (where n = sampling size)

  31. Questions?

More Related