handling of high dimensional data sets n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Handling of High-Dimensional Data Sets PowerPoint Presentation
Download Presentation
Handling of High-Dimensional Data Sets

Loading in 2 Seconds...

play fullscreen
1 / 19

Handling of High-Dimensional Data Sets - PowerPoint PPT Presentation


  • 111 Views
  • Uploaded on

Handling of High-Dimensional Data Sets. Yen-Jen Oyang Dept. of Computer Science and Information Engineering. Importance of Feature Selection. Inclusion of features that are not correlated to the classification decision may make the problem even more complicated.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Handling of High-Dimensional Data Sets' - taji


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
handling of high dimensional data sets

Handling of High-Dimensional Data Sets

Yen-Jen Oyang

Dept. of Computer Science and Information Engineering

importance of feature selection
Importance of Feature Selection
  • Inclusion of features that are not correlated to the classification decision may make the problem even more complicated.
  • For example, in the data set shown on the following page, inclusion of the feature corresponding to the Y-axis causes incorrect prediction of the test instance marked by “”, if a 3NN classifier is employed.
slide3

y

  • It is apparent that “o”s and “x” s are separated by x=10. If only the attribute corresponding to the x-axis was selected, then the 3NN classifier would predict the class of “” correctly.

x

x=10

feature selection for microarray data analysis
Feature Selection for Microarray Data Analysis
  • In microarray data analysis, it is highly desirable to identify those genes that are correlated to the classes of samples.
  • For example, in the Leukemia data set, there are 7129 genes. We want to identify those genes that lead to different disease types.
test of equality of several means
Test of Equality of Several Means
  • Assume that we conduct k experiments and all the outcomes of the k experiments are normally distributed with a common variance. Our concern now is whether these k normal distributions, N(1,2), N(2,2),…, N(k,2), have a common mean, i.e. 1= 2=…= k.
slide6
One application of this type of statistical tests is to determine whether the students in several schools have similar academic performance.
  • The hypothesis of the test is . 1= 2=…= k.
slide7
Let ni denote the number of smaples that we take from distribution N(i,2).

As a result, we have the following radom variables:

X11, X12,…, X1n1 : samples from N(1,2).

X21, X22,…, X2n2 : samples from N(2,2).

… … … … …

Xk1, Xk2,…, Xknk : samples from N(k,2).

t distribution
T Distribution
  • Let X1, X2,…, Xn be random samples from a normal distribution with mean µ and unknown variance. Then, the random variable defined by

has the so-called T distribution.

test of the equality of two normal distributions
Test of the Equality of Two Normal Distributions
  • Let X and Y have normal distributions

and , respectively. Assume that X1, X2,…, Xn and Y1, Y2,…, Ym are random samples of X and Y, respectively.

  • Then,

is N(0,1).

slide16
We know that

is .

  • Therefore,

has a T distribution with n+m-2 degrees of freedom.

slide17
The hypothesis of the statistical test is that
  • Accordingly,we can determine whether a feature should be removed, based on the following T statitic:
blind spot of the univariate analysis
Blind Spot of the Univariate Analysis
  • The univariate analysis is not able to identify crucial features in the following two examples:

y

x

multivariate analysis
Multivariate Analysis
  • Due to the observations addressed above, people have been investigating multivariate analysis for feature selection.