**Dimensionality Reduction in Unsupervised Learning of** Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In IEEE Trans. on PAMI, 23(6), 2001. Summarized by Kyu-Baek Hwang

**Abstract** • Feature selection for unsupervised learning of Gaussian networks • Unsupervised learning for Bayesian networks? • Which feature is good for the learning task? • Assessment of the relevance of the feature for learning process • How to determine the threshold for cutting? • Accelerate the learning time and still obtain reasonable models • Two artificial datasets • Two benchmark datasets from the UCI repository

**Unsupervised Learning for Conditional Gaussian Networks** • Data clustering learning the probabilistic graphical model from the unlabeled data • Cluster membership a hidden variable • Conditional Gaussian networks • Cluster variable is the ancestor for all the other variables. • The joint probability distribution over all the other variables given the cluster membership is multivariate Gaussian. • Feature selection in classification feature selection in clustering • Consider all the features eventually, to describe the domain.

**Conditional Gaussian Distribution** • Data clustering • X = (Y, C) = (Y1, …, Yn, C) • Conditional Gaussian distribution • Pdf for Y given C = c is, • whenever p(c) = p(C = c) > 0 Positive definite

**Conditional Gaussian Networks** • Factorization of the conditional Gaussian distribution • Conditional independencies among all the variables is encoded by the network structure s. • Local probability distribution

**An Example of CGNs** C

**n** 1 O H N Learning CGNs from Data • Incomplete dataset d • Structural EM algorithm

**Structural EM Algorithm** Expected score Relaxed version:

**Scoring Metricsfor the Structural Search** • The log marginal likelihood of the expected complete data

**Feature Selection** • Large databases • Many instances • Many attributes • Dimensionality reduction required • Select features based on some criterion. • The criterion differs from the purpose of learning. • Learning speed, accurate predictions, and the comprehensibility of the learned models • Non exhaustive search (2n) • Sequential selection (forward or backward) • Evolutionary, population-based, randomized search based on the EDA.

**Wrapper and Filter** • Wrapper • Feature subsets tailored to the performance function of learning process • Predictive accuracy on the test data set. • Filter • Based on the intrinsic properties of the data set. • Correlation between the class label and each attribute • Supervised learning • Two problems in unsupervised learning • Absence of the class label different criterion for the feature selection • No standard accepted performance task multiple predictive accuracy or class prediction

**Feature Selection in Learning CGNs** • Data analysis (clustering) description, not prediction • All the features are necessary for the description. • CGN learning with many features is a time-consuming task. • Preprocessing: feature selection • Learning CGNs • Postprocessing: addition of the other features as conditionally independent given the cluster membership • The goal how to measure the relevance • Fast learning time • Accuracy log likelihood for the test data

**Relevance** • Those features that exhibit low correlation with the rest of the features can be considered irrelevant for the learning process. • Conditionally independent given the cluster membership. • First trial in the continuous domain

**Relevance Measure** • The relevance measure: • Null hypothesis (edge exclusion test) • r2ij|rest • The sample partial correlation of Yi and Yj • The maximum likelihood estimates (mles) of the elements of the inverse variance matrix

**Graphical Gaussian Models (1/2)**

**Graphical Gaussian Models (2/2)**

**Relevance Threshold** • Distribution of the test statistic • G(x): pdf of a 12 random variable • 5 percent test • The resolution of the above equation optimization

**Learning Scheme**

**C** Experimental Settings • Model speicifications • Tree augmented Naïve Bayes (TANB) models • Predictive attributes may have, at most, one other predictive attribute as a parent. • An example

**Data Sets** • Synthetic data sets (4000:1000) • TANB model with 25 (15:14[-1, 1]) attributes, (0, 4, 8), 1 • C: uniform, (0, 1) • TANB model with 30 (15:14[-1, 1]) attributes, (0, 4, 8), 2 • C:uniform, (0, 5) • Waveform (artificial data) (4000:1000) • 3 clusters, 40 attributes, the last 19 are noise attributes • Pima • 768 cases (700:68) • 8 attributes

**Performance Criteria** • The log marginal likelihood of the training data • The multiple predictive accuracy • A probabilistic approach to the standard multiple predictive accuracy • Runtime • 10 independent runs for the synthetic data sets and the waveform data • 50 independent runs for the pima data • On a Pentium 366 machine

**Relevance Ranking**

**Likelihood Plots for Synthetic Data**

**Likelihood Plots for Real Data**

**Runtime**

**Automatic Dimensionality Reduction**

**Conclusions and Future Work** • Relevance assessment for feature selection in unsupervised learning and continuous domain • Reasonable learning performance • Extension to categorical domain • Redundant feature problem • Relaxation of the model structure • More realistic data set