1 / 50

Data & Vis Basics

Data & Vis Basics. Visualization Pipeline. Data Definition. A typical dataset in visualization consists of n records (r 1 , r 2 , r 3 , … , r n ) Each record r i consists of m (m >=1) observations or variables (v 1 , v 2 , v 3 , … , v m )

wilma-allen
Download Presentation

Data & Vis Basics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data & Vis Basics

  2. Visualization Pipeline

  3. Data Definition • A typical dataset in visualization consists of n records • (r1, r2, r3, … , rn) • Each record ri consists of m (m >=1) observations or variables • (v1, v2, v3, … , vm) • A variable may be either independent or dependent • Independent variable (iv) is not controlled or affected by another variable • For example, time in a time-series dataset • Dependent variable (dv) is affected by a variation in one or more associated independent variables • For example, temperature in a region • Formal definition: • ri = (iv1, iv2, iv3, … , ivmi, dv1, dv2, dv3, … , dvmd) • where m = mi + md

  4. Basic Data Types Def: A set of not-ordered and non-numeric values For example: Categorical (finite) data {apple, orange, pear} {red, green, blue} Arbitrary (infinite) data {“12 Main St. Boston MA”, “45 Wall St. New York NY”, …} {“John Smith”, “Jane Doe”, …} • Nominal • Ordinal • Scale / Quantitative • Interval • ratio

  5. Basic Data Types Def: A tuple (an ordered set) For example: Numeric <2, 4, 6, 8> Binary <0, 1> Non-numeric <G, PG, PG-13, R> • Nominal • Ordinal • Scale / Quantitative • Interval • ratio

  6. Basic Data Types Def: A numeric range Interval Ordered numeric elements on a scale that can be mathematically manipulated, but cannot be compared as ratios For example: date, current time (Sept 14, 2010 cannot be described as a ratio of Jan 1, 2011) Ratio where there exists an “absolute zero” For example: height, weight • Nominal • Ordinal • Scale / Quantitative • Interval • ratio

  7. Dimensionality • Scalar • A single value • Vector • A collection of scalars • Matrix • 2-dimensional array • Tensor • A collection of matrices

  8. Dimensionality (Programming) • Scalar • 0-dimensional array • Vector • 1-dimensional array • Matrix • 2-dimensional array • Tensor • 3 or more dimensional array

  9. Dimensionality (Technically) • Scalar • 0th order tensor • Vector • 1st order tensor • Matrix • 2nd order tensor • Tensor • n-d tensor

  10. Data Transform

  11. General Concept • If the data size is truly too large, we can find ways trim the data: • By reducing the number of rows • Subsampling • Clustering • By reducing the number of columns • Dimension reduction • Fit an underlying representation (linear and non-linear)

  12. Challenge • How to maintain the general “characteristics” of the original data • How much can be trimmed? • Analysis based on the trimmed data, does it still apply to the original raw data?

  13. Keim’s visual analytics model interactions Pre-process input interactions Image source: Visual Analytics Definition, Process, and Challenges, Keim et al, LNCS vol 4950, 2008

  14. Dirty Data • Missing values, or data with uncertainty • Discard bad records • Assign a sentinel value (e.g. -1) • Assign the average value • Assign value based on nearest neighbors • Matrix completion problem • e.g. assuming a low rank matrix

  15. 8 Visual Variables • Position • Mark • Size • Brightness • Color • Orientation • Texture • Motion

  16. Position

  17. Mark

  18. Size(length, Area and Volume)

  19. Brightness

  20. Color

  21. Orientation

  22. Texture

  23. Jacques Bertin “Semiology of Graphics” [1967]

  24. Jacques Bertin 标记形式 点 线 面 通道 位置 尺寸 灰阶值 纹理 色彩 方向 形状

  25. Mackinlay

  26. Tableau

  27. Why Dimension Reduction Computation: The complexity grows exponentially with the dimension. Visualization: projection of high-dimensional data to 2D or 3D. Interpretation: the intrinsic dimension maybe small.

  28. Dimension Reduction • Lots of possibilities, but can be roughly categorized into two groups: • Linear dimension reduction • Non-linear dimension reduction • Related to machine learning…

  29. * * Second principal component * * First principal component * * * * * * * * * * * * * * * * * * * * Data points Principal Components Analysis (PCA) Original axes Principal Components Analysis (PCA): approximating a high-dimensional data setwith a lower-dimensional linear subspace

  30. y bar-y x bar-x • imagine a two dimensional scatter of points that show a high degree of correlation … orthogonal regression…

  31. Why bother? • more “efficient” description • 1st var. captures max. variance • 2nd var. captures the max. amount of residual variance, at right angles (orthogonal) to the first • the 1st var. may capture so much of the information content in the original data set that we can ignore the remaining axis

  32. Principal Components Analysis (PCA) why: • clarify relationships among variables • clarify relationships among cases when: • significant correlations exist among variables how: • define new axes (components) • examine correlation between axes and variables • find scores of cases on new axes

  33. Philosophy of PCA • A PCA is concerned with explaining the variance-covariance sturcture of a set of variables through a few linear combinations. • We typically have a data matrix of n observations on p correlated variables x1,x2,…xp • PCA looks for a transformation of the xiinto p new variables yithat are uncorrelated. • Want to present x1,x2,…xp with a few yi’s without lossing much information.

  34. PCA • Looking for a transformation of the data matrix X (nxp) such that Y= TX=1 X1+ 2 X2+..+ p Xp • Where =(1 , 2 ,.., p)Tis a column vector of wheights with 1²+ 2²+..+ p²=1

  35. Maximize the variance of the projection of the observations on the Y variables • Find  so that Var(T X)= T Var(X)  is maximal • Var(X) is the covariance matrix of the Xivariables

  36. PCA gives • New variables Yi that are linear combination of the original variables (xi): • Yi= ei1x1+ei2x2+…eipxp ; i=1..p • The new variables Yiare derived in decreasing order of importance; • they are called ‘principal components’

  37. Principle Component Analysis • Pseudo code • Pose data such that each column is a dimension, and each row is a data entry (a nxm matrix, n = rows, m = cols) • Subtract the mean of a dimension from each value • Compute the covariance matrix (M) • Compute the eigenvectors and eigenvalues of (M) • Use singular value decomposition (SVD) • where and are mxm matrices, • is an mxndiagnoal matrix (of positive real numbers) • Sort the eigenvectors in based on their associated eigenvalues in from highest eigenvalue to lowest • Project your original data onto the first (highest) eigenvectors

  38. Multidimensional scaling (MDS)

  39. Multidimensional scaling (MDS) Suppose we are giving the distance structure of the following 10 cities. And we have no knowledge of the city location/map of the US. Can we map these cities to a 2D space to best present their distance structure?

  40. Multidimensional scaling (MDS) MDS deals with the following problem: for a set of observed similarities (or distances) between every pair of N items, find a representation of the items in few dimensions such that the interitem proximities “nearly match” the original similarities (or distance). The numerical measure of how close the original distances and the distances at lower dimensional coordinate is called stress.

  41. MDS

  42. MDS

  43. MDS Mapping to 3D is possible but more difficult to visualize and interpret.

  44. MDS • MDS attempts to map objects to a visible 2D or 3D Euclidean space. The goal is to best preserve the distance structure after the mapping. • The original data can be of high-dimensional or even non-metric space. The method only cares the distance (dissimilarity) structure. • The resulting mapping is not unique. Any rotation or reflection of a mapping solution is also a solution. • It could be shown that the results of PCA are exactly those of classical MDS if the distances calculated from the data matrix are Euclidean.

  45. Self-Organizing Maps • Pseudo code • Assume input of n rows of m dimensional data • Define some number of nodes (e.g. 40x40 grid) • Give each node m values (vector of size m) • Randomize those values • Loop k number of times: • Select one of the n rows of data as “input vector” • Find within the 40x40 grid nodes the one most similar to the input vector (call this node Best Matching Unit – BMU) • Find the neighbors of the BMU on the grid • Update the BMU and its neighbors based on the following equation: • where is the gaussian function of distance (decays over time) • is the learning function (decays over time) • is the input vector, and is the grid node’s vector

  46. Isomap Image courtesy of Wikipedia: Nonlinear Dimensionality Reduction

  47. Many Others! • To name a few: • Latent Semantic Indexing • Support Vector Machine • Linear Discriminant Analysis (LDA) • Locally Linear Embedding • “manifold learning” • Etc.

More Related