Estimating Dependency and Significance for High-Dimensional Data Analysis

Estimating Dependency and Significance for High-Dimensional Data Michael R. Siracusa* Kinh Tieu*, Alexander T. Ihler §, John W. Fisher *§, Alan S. Willsky § * Computer Science and Artificial Intelligence Laboratory § Laboratory for Information and Decision Systems

Do these depend on each other (and how)?

Premise : In many high-dimensional data sources, statistical dependency can be well explained by a lower dimensional latent variable: • Intuition: The complexity of the problem is influenced more by the the hypothesis rather than the data. • How do we estimate the dependency? • From a single realization? • How do we avoid strong modeling assumptions? • How do we estimate significance?

Dependency Structure(Graphical Model) Parameterization(Nuisance)

VS Dependence: An example

Factorization Test (In General)

Asymptotics Statistical Dependence Model Differences Model Differences Statistical Dependence Independent vs Some Dependency: 1. : data is independent 2. We don’t have the true distributions 3. We are only give a single realization

Factorization Test (cont) • Questions: • How do we obtain samples under each factorization? • How do we estimate D(||) when x is high dimensional? • How do we estimate significance?

Drawing Samples From a single realization • Only have 1 realization to estimate the joint But, • Can obtain N! sample draws from H0 permutations

High Dimensional Data VS From the Data Processing Inequality:

High Dimensional Data (cont) Sufficiency: For High dimensional data Maximize left side of bound • Gaussian w/ Linear Projections • Close form solution (Eigenvalue problem): Kullback 68 • Nonparametric • Gradient descent : Ihler and Fisher 03

Swiss Roll PCA 2D Projection MaxKL 2D Optimization 3D Data

Measuring significance p-value

Synthetic data Noise in High Dim Space High Dim Obs Distracter Low Dim Latent Var Dependency via M: Controls that number of dimensions dependency info is uniformly distributed over D: Controls the total dimensionality of our K observations

Experiments • 100 Trial w/ Samples of Dependent Data • 100 Trials w/ Samples of Independent Data • Each trial gives a statistic and significance p-value

Gaussian Data

Gaussian

3D Ball Data

Significance Results

Multi-camera

Conclusions • We presented a method for estimating statistical dependency across high-dimensional measurements via factorization tests. • Exploited a bound on lower dimensional projections. • We made use of permutations for drawing from the alternate hypothesis given a single realization. • We also made use of permutations to get reliable significance estimates. • This was done using a small number of samples relative to the dimensionality of the data • Finally we presented some brief analysis on synthetic and real data.

Thank You Questions?

Problem Statement Given N i.i.d. observations for K sources Determine if the K sources are independent or not: • Obtain a dependency measure • Estimate the significance of this measurement

Applications

Hypothesis Test Two Hypotheses: Assuming we know the distributions: Given N i.i.d. observations:

Factorization Test Two Factorizations: But we don’t we know the distributions: Our best approximation (like GLR): Notation Simplification:

Factorization Test (cont) True Joint Dist Est Joint True Independent Dist Est Prod Est Joint True Independent Dist Est Prod True Independent Dist

Significance

Applications • What Vision Problems Can We Solve w/ Accurate Measures of Dependency? • Data Association, Correspondence • Feature Selection • Learning Structure • We will specifically discuss: • Correspondence (for multi-camera tracking) • Audio-visual Association

Audio-Visual Association • Useful For: • Speaker Localization • - Help improve Human-Computer Interaction • - Help Source Separation • Automatic Transcription of Archival Video • - Who is speaking? • - Are they seen by the camera?

Multi-camera Tracking

VS Hypotheses Camera X Camera Y

Maximal Correspondence

Distributions of Transition Times Transition time

Discussion and Future Work • Dependence underlies various vision related problems. • We studied a framework for measuring dependence. • Measure significance (how confident are you) • Make it more robust.

For 2 variable case Math (oh no!)

Outline • Applications: (for computer vision) • Problem Formulation: (Hypothesis Testing) • Computation: (Non-parametric entropy estimation) • Curse of Dimensionality: (Informative Statistics) • Correspondence: (Markov Chain Monte Carlo)

Previous Talks • Greg: Model dependence between features and class • Kristen: Model dependence between features and a scene Ariadna: Model dependency between intra-class features • Wanmei: Dependency between protocol signal and voxel response • Chris: Audio and video dependence with events • Antonio: Contextual Dependence • Corey: “Inferring Dependencies”

Estimating Dependency and Significance for High-Dimensional Data Analysis

Estimating Dependency and Significance for High-Dimensional Data Analysis

Presentation Transcript

Handling of High-Dimensional Data Sets

Seeking Interpretable Models for High Dimensional Data

Automatic Subspace Clustering Of High Dimensional Data For Data Mining Application

A Spatial Index Structure for High Dimensional Point Data

Biometrics and High Dimensional Data

High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis

Significance Testing of High-Throughput Data

High dimensional genomic data, identifiability , and query-response

High-Dimensional Data

Entropic graphs for high dimensional data analysis Alfred Hero

Dimension Reduction for Under-sampled High Dimensional Data

HDDVis: An Interactive Tool for High Dimensional Data Visualization

High Dimensional Data Analysis

High-dimensional data analysis: Microarrays and multiple testing

Seeking Interpretable Models for High Dimensional Data

Finding Local Correlations in High Dimensional Data

Clustering High Dimensional Data Using SVM

Privacy Preserving Approaches for High Dimensional Data

Booster in High Dimensional Data Classification

Foundation of High-Dimensional Data Visualization

Clustering and Testing in High-Dimensional Data

High Dimensional Data