FODAVA-Lead: Dimension Reduction and Data Reduction: Foundations for Visualization

FODAVA-Lead: Dimension Reduction and Data Reduction: Foundations for Visualization Haesun Park Division of Computational Science and Engineering College of Computing Georgia Institute of Technology FODAVA Kick-off Meeting, Sep. 2008

FODAVA-Lead Proposed Research Fundamental Challenges: two important constraints on Data and Visual Analytics system • Speed: necessary for real-time, interactive use • Even back-end data analysis and transformation operations must appear to be essentially instantaneous to users, massive size poses challenges • Screen Space: number of available pixels fundamentally limiting constraint • Effective representation and efficient transformation of large data sets by data reduction and dimension reduction

FODAVA-Lead Research Goals Development of Fundamental Theory and Algorithms in Data Representations and Transformations to enable Visual Understanding • Dimension Reduction • Feature selection by sparse recovery • Manifold learning • Dimension reduction with prior info/interpretability constraints … • Data Reduction • Multi-resolution data approximation • Anomaly cleaning and detection • Data Fusion … • Fast Algorithms • Large-scale optimization problems/matrix decompositions • Dynamic and time-varying data • Integration with DAVA systems(e.g.Text Analysis and Jigsaw)

Research Interests (H. Park) • Efficient and Effective Numerical Algorithm Development and Analysis • Algorithms for Massive Data Analysis • Dimension Reduction • Clustering and Classification • Adaptive Methods • Applications • Microarray analysis: gene selection, missing value estimation • Protein structure prediction • Biometric Recognition • Text Analysis Effective Dimension Reduction with Prior Knowledge • Dimension Reduction for Clustered Data: Linear Discriminant Analysis (LDA), Generalized LDA (LDA/GSVD), Orthogonal Centroid Method (OCM), Fast Adaptive algorithms • Dimension Reduction for Nonnegative Data: Nonnegative Matrix Factorization (NMF) • Applications: Text Classification, Face Recognition, Fingerprint Classification, Gene Clustering in Microarray Analysis …

2D Representation Utilize Cluster Structure if Known LDA+PCA(2) SVD(2) PCA(2) 2D representation of 700x1000 data with 7 clusters: LDA vs. SVD vs. PCA

Dimension Reduction for Clustered Data (LDA/GSVD) (Howland, Jeon, Park SIMAX 03, Howland & Park TPAMI 04) Measure for Cluster Quality A = [a1...an] :mxn, clustered data Ni = items in class i, | Ni | = ni ,total r classesci= centroid, c = global centroid • Sw = ∑1≤ i≤ r ∑ j∈Ni (aj – ci ) (aj – ci )T Sb = ∑1≤ i≤ r ∑ j ∈Ni (ci – c) (ci – c)T St = ∑1≤ i≤ n(ai – c ) (ai – c)T High quality clusters have small trace(Sw) & large trace(Sb) Want: G : mxq s.t. min trace(GT SwG) & max trace(GT SbG) Sw-1Sb x = l x Sbx=lSwx  a 2Hb HbTx = b 2Hw HwTx GSVD: UT HbT X = D1 , VT HwT X = D2

QRD Preprocessing in Dim. Reduction (Distance Preserving Dim. Redution) For under-sampled data A:mxn, m>>n A Q1 R Q1 Q2 R = = 0 Q1:orthonormal basis for range(A) when rank(A)=n Dimension reduction of A by Q1T, Q1T A = R: nxn Q1Tpreserves distance in L2 norm: || ai ||2 = || Q1Tai ||2 || ai - aj ||2 = || Q1T (ai - aj )||2 in cos distance: cos(ai, aj) = cos(Q1Tai, Q1T aj) • Applicable to PCA, LDA, LDA/GSVD, regLDA, Isomap, LLE, … • Updating and Downdating can be done fast, important for iterative vis.

Speed Up with QRD Preprocessing(computation time)

LDA for Data with Sub-clusters: Facial Recognition Cross-Language Processing Sports Sentiment #1 Sentiment #2 Technology Person #1 English Person #2 Korean Person #3 • Unimodal Gaussian assumption for each cluster in LDA may not hold when sub-cluster structure exists.

Dimension Reduction for Visualization of Clustered Data max trace ((GT SwG)-1 (GT SbG))  LDA (Fisher 36, Rao 48) max trace (GT SbG)  Orthogonal Centroid(Park et al. 03) IN-SPIRE: OC with rank(G)=2, can be updated easily and nonlinearized max trace (GT(Sw+Sb)G)  PCA (Hotelling 33) max trace (GT(AAT )G)  LSI (Deerwester et al. 90)(

Nonlinear Discriminant Analysis by Kernel Functions F 2D Left LoopRight Loop Whorl Arch Tented Arch Construction of Directional Images by DFT 1. Compute directionality in local neighborhood by FFT 2. Compute the dominant direction 3. Find core point for unified centering of fingerprints within the same class

Fingerprint Classification Results on NIST Fingerprint Database 4 (C. Park and H. Park , Pattern Recognition, 06) KDA/GSVD: Nonlinear Extension of LDA/GSVD based on Kernel Functions Rejection rate(%) 0 1.8 8.5 KDA/GSVD 90.7 91.3 92.8 kNN & NN Jain et al., 99 - 90.0 91.2 SVM Yao et al., 03 - 90.0 92.2 4000 fingerprint images of size 512x512 By KDA/GSVD, dimension reduced from 105x105 to 4

Nonnegativity Preserving Dim. Reduction Nonnegative Matrix Factorization (Paatero&Tappa 94, Lee&Seung NATURE 99, Pauca et al. SIAM DM 04, Hoyer 04, Lin 05, Berry 06, Kim and Park 06 Bioinformatics, Kim and Park 08 SIAM Journal on Matrix Analysis and Applications, …) A W H • min || A – WH ||F W>=0, H>=0 ~ = Why Nonnegativity Constraints? • Better Approx. vs. Better Representation/Interpretation • Nonnegative Constraints often physically meaningful • Interpretation of analysis results possible • Fastest Algorithm for NMF, with theoretical convergence • Can be used as a clustering algorithm

How this research will influence FODAVA Better Representation and Transformation of Data: Improved theory and methods that more accurately incorporates prior knowledge Capacity to Process More Data Faster: Fast and scalable algorithms that can represent and transform larger data sets in shorter time Improved Visual Interaction Capability: Fast algorithms for efficient handling of dynamic and transient data Information Synthesis: Visual representation of information of different types on one map

Developing New Understanding • Dimension reduction in DAVA requires new modeling, optimization criteria, algorithms • Design efficient and effective algorithms for data representation and transformation. Balance between speed and accuracy • Will address more on community building plans tomorrow. Thank you!

FODAVA-Lead: Dimension Reduction and Data Reduction: Foundations for Visualization

FODAVA-Lead: Dimension Reduction and Data Reduction: Foundations for Visualization

Presentation Transcript

Introduction to Model Order Reduction

Dislocation and Fracture Reductions

Chapter 3: Data Mining and Data Visualization

Basic Communication Operations

CS294-32: Dynamic Partial Order Reduction

Ch. 20: Oxidation-Reduction Reactions

STATISTIK DESKRIPTIF

Chapter 20 Oxidation-Reduction Reactions (Redox Reactions)

DISASTER RISK REDUCTION. REMEMBERING 2OO7

500 Cost Reduction Strategies for Local Education Agencies

Oxidation and Reduction

Anemias

Topic 9

Data Mining: Data Preprocessing

Dr. Yukun Bao School of Management, HUST

Multimedia Indexing and Dimensionality Reduction

CS490D: Introduction to Data Mining Chris Clifton

4 Principal Component Analysis (PCA)

Oxidation and Reduction Reactions

Hyperspectral Imagery (HSI) Dimensionality Reduction

Feature Selection