Loading in 2 Seconds...
Loading in 2 Seconds...
Dimension Augmenting Vector Machine (DAVM): A new General Classifier System for Large p Small n problem. Dipak K. Dey Department of Statistics University of Connecticut, Currently Visiting SAMSI. This is a joint work with S. Ghosh, IUPUI and Y. Wang University of Connecticut.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Dipak K. Dey
Department of Statistics
University of Connecticut, Currently Visiting SAMSI
This is a joint work with S. Ghosh, IUPUI
and Y. Wang
University of Connecticut
Linear (LDA, LR) Non-Linear (QDA, KLR)
Few points to note:
is a kernel function
is the second order normIntroduction to RKHS (in one page)
Suppose our training data set, ,
A general class of regularization problem is given by
Where λ(>0) is a regularization parameter and is a space of function in which J(f) is defined.
By the Representer theorem of Kimeldorf and Wahba the solution to the above problem is finite dimensional,
Hinge lossSVM based Classification
In SVM we have a special loss and roughness penalty,
By the Representer theorem of Kimeldorf and Wahba the optimal solution to the above problem,
However for SVM most of the are zero, resulting huge data compression.
In short, kernel based SVM perform classification by representing the original function as a linear combination of the basis functions in the higher dimension space.
To overcome drawbacks of SVM, Zhu & Hastie(2005)
introduced IVM (import vector machine) based on KLR
In IVM we replace hinge loss with the NLL of binomial distribution. Then we get natural estimate of classification as,
Previous advantages come at a cost,
They employ an algorithm to filter out only few significant
observations (n) which will help the classification most.
However for n<<p
It is much more meaningful if compression is in p. (why?)
Essentially we are talking about simultaneous variable selection and classification in high dimension
Are there existing methods which already do that
What about LASSO?
LASSO is a popular L1 penalized least square method
proposed by Tibshirani (1997) in regression context.
Owing to the nature of penalty and choice of t(≥0), LASSO produces the threshold rule by making many small β’s zero.
Replacing squared error loss by NLL of binomial distribution LASSO can do probabilistic classification.
Nonzero β’s are selected dimensions (p).
This is a severe restriction.
We are going to propose a method based on KLR
and IVM which does not suffer from this drawback.
We will essentially change the role of n and p in IVM problem to achieve compression in terms of p.
For high dimensional problem many dimensions are
just noise hence filtering them makes sense. (but how ?)
Theorem 1 : If the training data is separable in S then it will be separable in any .
x1→Rough Sketch of Proof
Completely separable case
For non-linear kernel based transformation this is not so
obvious and the proof is little technical.
Maximal Separating hyperplane in only one dimension
Maximal Separating hyperplane in two dimension
Theorem 2 : Distance (and so does the margin) between any two points is a non-decreeing function of the dimensions.
Proof is straight forward
where Lis the binomial deviance (negative log-likelihood).
For an arbitrary dimensional space we may define Gaussian Kernel as,
Our optimization problem for dimension selection,
Starting from , we go towards until desired accuracy is obtained.
In the heart of DAVM lies KLR, so more specifically
To find optimum value of we may adopt any optimization method (e.g. NR) until some convergence criteria is satisfied
Optimality Theorem:If training data is separable in and the solution for the equivalent KLR problem in S and L are respectively and then as
Note: To show optimality of submodel, we are assuming the kernel
is reach enough to completely separate training data.
Convergence criteria used in IVM is not suitable for our purpose.
For k-th iteration define = the proportion of correctly classified training observations with k many imported dimensions.
The algorithm stops if the ratio
(a prechosen small number 0.001 )
We choose optimal value of λλ(regularization parameter) by decreasing it from a larger value to a smaller value until we hit optimum (smallest) misclassification error rate in the training set.
We have tested our algorithm for three data sets
We deliberately add eight more dimensions and filled them with white noise.
We choose ε=0.05 and For θ we searched over [2-6,26], for λwe searched over [2-10,210],
Only those dimensions (i.e. first two ) selected by DAVM are used for final classification of test data
DAVM correctly select two informative dimensions.
to study the performance of DAVM.
DAVM performs better than SVM in all occasions.
Key Features of DAVM