New Algorithms for Efficient High-Dimensional Nonparametric Classification

Download Presentation

New Algorithms for Efficient High-Dimensional Nonparametric Classification

Loading in 2 Seconds...

- 87 Views
- Uploaded on
- Presentation posted in: General

New Algorithms for Efficient High-Dimensional Nonparametric Classification

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

New Algorithms for Efficient High-Dimensional Nonparametric Classification

Ting Liu, Andrew W. Moore, and Alexander Gray

- Introduction
- k Nearest Neighbors (k-NN)
- KNS1: conventional k-NN search

- New algorithms for k-NN classification
- KNS2: for skewed-class data
- KNS3: ”are at least t of k-NN positive”?

- Results
- Comments

- k-NN
- Nonparametric classification method.
- Given a data set of n data points, it finds the k closest points to a query point , and chooses the label corresponding to the majority.
- Computational complexity is too high in many solutions, especially for the high-dimensional case.

- KNS1:
- Conventional k-NN search with ball-tree.
- Ball-Tree (binary):
- Root node represents full set of points.
- Leaf node contains some points.
- Non-leaf node has two children nodes.
- Pivot of a node: one of the points in the node, or the centroid of the points.
- Radius of a node:

- Bound the distance from a query point q:
- Trade off the cost of construction against the tightness of the radius of the balls.

- recursive procedure: PSout=BallKNN (PSin, Node)
- PSin consists of the k-NN of q in V ( the set of points searched so far)
- PSout consists of
the k-NN of q in

V and Node

- KNS2:
- For skewed-class data: one class is much more frequent than the other.
- Find the # of the k NN in the positive class without explicitly finding the k-NN set.
- Basic idea:
- Build two ball-trees: Postree (small), Negtree
- “Find Positive”: Search Postree to find k-nn set Possetk using KNS1;
- “Insert negative”: Search Negtree, use Possetk as bounds to prune nodes far away and to estimate the # of negative points to be inserted to the true nearest neighbor set.

- Definitions:
- Dists={Dist1,…, Distk}: the distance to the k nearest positive neighbors of q, sorted in increasing order.
- V: the set of points in the negative balls visited so far.
- (n, C): n is the # of positive points in k NN of q.
C ={C1,…,Cn},

Ciis # of the negative points in V closer than the ith positive neighbor to q.

- and

Step 2 “insert negative” is implemented by the recursive function

(nout, Cout)=NegCount(nin, Cin, Node, jparent, Dists)

(nin, Cin) sumarize interesting negative points for V;

(nout, Cout) sumarize interesting negative points for V and Node;

- KNS3
- “are at least t of k nearest neighbors positive?”
- No constraint of skewness in the class.
- Proposition:
- Instead of directly compute the exact values, we compute the lower and upper bound, since

m+t=k+1

P is a set of balls from Postree, N consists of balls from Negtree.

- Real data

k=9, t=ceiling(k/2),

Randomly pick 1% negative records and 50% positive records as test (986 points)

Train on the reaming 87372 data points

- Why k-NN? Baseline
- No free lunch:
- For uniform high-dimensional data, no benefits.
- Results mean the intrinsic dimensionality is much lower.