Handling of High-Dimensional Data Sets

Handling of High-Dimensional Data Sets Yen-Jen Oyang Dept. of Computer Science and Information Engineering

Importance of Feature Selection • Inclusion of features that are not correlated to the classification decision may make the problem even more complicated. • For example, in the data set shown on the following page, inclusion of the feature corresponding to the Y-axis causes incorrect prediction of the test instance marked by “”, if a 3NN classifier is employed.

y • It is apparent that “o”s and “x” s are separated by x=10. If only the attribute corresponding to the x-axis was selected, then the 3NN classifier would predict the class of “” correctly. x x=10

Feature Selection for Microarray Data Analysis • In microarray data analysis, it is highly desirable to identify those genes that are correlated to the classes of samples. • For example, in the Leukemia data set, there are 7129 genes. We want to identify those genes that lead to different disease types.

Test of Equality of Several Means • Assume that we conduct k experiments and all the outcomes of the k experiments are normally distributed with a common variance. Our concern now is whether these k normal distributions, N(1,2), N(2,2),…, N(k,2), have a common mean, i.e. 1= 2=…= k.

One application of this type of statistical tests is to determine whether the students in several schools have similar academic performance. • The hypothesis of the test is . 1= 2=…= k.

Let ni denote the number of smaples that we take from distribution N(i,2). As a result, we have the following radom variables: X11, X12,…, X1n1 : samples from N(1,2). X21, X22,…, X2n2 : samples from N(2,2). … … … … … Xk1, Xk2,…, Xknk : samples from N(k,2).

Feature Selection Based on Univariate Analysis Class 1 Class 2 Class 3

T Distribution • Let X1, X2,…, Xn be random samples from a normal distribution with mean µ and unknown variance. Then, the random variable defined by has the so-called T distribution.

r is called the degree of freedom of the T distribution. • Note that when r→∞, T→N(0,1).

Test of the Equality of Two Normal Distributions • Let X and Y have normal distributions and , respectively. Assume that X1, X2,…, Xn and Y1, Y2,…, Ym are random samples of X and Y, respectively. • Then, is N(0,1).

We know that is . • Therefore, has a T distribution with n+m-2 degrees of freedom.

The hypothesis of the statistical test is that • Accordingly,we can determine whether a feature should be removed, based on the following T statitic:

Blind Spot of the Univariate Analysis • The univariate analysis is not able to identify crucial features in the following two examples: y x

Multivariate Analysis • Due to the observations addressed above, people have been investigating multivariate analysis for feature selection.

Handling of High-Dimensional Data Sets

Handling of High-Dimensional Data Sets

Presentation Transcript

High Dimensional Chaos

Sets of Digital Data

Automatic Subspace Clustering Of High Dimensional Data For Data Mining Application

Biometrics and High Dimensional Data

Data Sets

High-Dimensional Data

ICS 278: Data Mining Lecture 5: Low-Dimensional Representations of High-Dimensional Data

Efficient Clustering of High-Dimensional Data Sets

Inductive Sets of Data

High Dimensional Data Analysis

Inductive Sets of Data

Seeking Interpretable Models for High Dimensional Data

Inductive Sets of Data

Finding Local Correlations in High Dimensional Data

Dynamics of High-Dimensional Systems

Clustering High Dimensional Data Using SVM

On the Anonymization of Sparse High-Dimensional Data

Privacy Preserving Approaches for High Dimensional Data

Booster in High Dimensional Data Classification

Foundation of High-Dimensional Data Visualization

Clustering and Testing in High-Dimensional Data

High Dimensional Data