1 / 14

# New Algorithms for Efficient High-Dimensional Nonparametric Classification - PowerPoint PPT Presentation

New Algorithms for Efficient High-Dimensional Nonparametric Classification. Ting Liu, Andrew W. Moore, and Alexander Gray. Overview. Introduction k Nearest Neighbors ( k -NN) KNS1: conventional k -NN search New algorithms for k -NN classification KNS2: for skewed-class data

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about ' New Algorithms for Efficient High-Dimensional Nonparametric Classification' - rue

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### New Algorithms for Efficient High-Dimensional Nonparametric Classification

Ting Liu, Andrew W. Moore, and Alexander Gray

Overview Classification

• Introduction

• k Nearest Neighbors (k-NN)

• KNS1: conventional k-NN search

• New algorithms for k-NN classification

• KNS2: for skewed-class data

• KNS3: ”are at least t of k-NN positive”?

• Results

Introduction: Classificationk-NN

• k-NN

• Nonparametric classification method.

• Given a data set of n data points, it finds the k closest points to a query point , and chooses the label corresponding to the majority.

• Computational complexity is too high in many solutions, especially for the high-dimensional case.

Introduction: ClassificationKNS1

• KNS1:

• Conventional k-NN search with ball-tree.

• Ball-Tree (binary):

• Root node represents full set of points.

• Leaf node contains some points.

• Non-leaf node has two children nodes.

• Pivot of a node: one of the points in the node, or the centroid of the points.

Introduction: ClassificationKNS1

• Bound the distance from a query point q:

• Trade off the cost of construction against the tightness of the radius of the balls.

Introduction: ClassificationKNS1

• recursive procedure: PSout=BallKNN (PSin, Node)

• PSin consists of the k-NN of q in V ( the set of points searched so far)

• PSout consists of

the k-NN of q in

V and Node

KNS2 Classification

• KNS2:

• For skewed-class data: one class is much more frequent than the other.

• Find the # of the k NN in the positive class without explicitly finding the k-NN set.

• Basic idea:

• Build two ball-trees: Postree (small), Negtree

• “Find Positive”: Search Postree to find k-nn set Possetk using KNS1;

• “Insert negative”: Search Negtree, use Possetk as bounds to prune nodes far away and to estimate the # of negative points to be inserted to the true nearest neighbor set.

KNS2 Classification

• Definitions:

• Dists={Dist1,…, Distk}: the distance to the k nearest positive neighbors of q, sorted in increasing order.

• V: the set of points in the negative balls visited so far.

• (n, C): n is the # of positive points in k NN of q.

C ={C1,…,Cn},

Ciis # of the negative points in V closer than the ith positive neighbor to q.

• and

KNS2 Classification

Step 2 “insert negative” is implemented by the recursive function

(nout, Cout)=NegCount(nin, Cin, Node, jparent, Dists)

(nin, Cin) sumarize interesting negative points for V;

(nout, Cout) sumarize interesting negative points for V and Node;

KNS3 Classification

• KNS3

• “are at least t of k nearest neighbors positive?”

• No constraint of skewness in the class.

• Proposition:

• Instead of directly compute the exact values, we compute the lower and upper bound, since

m+t=k+1

KNS3 Classification

P is a set of balls from Postree, N consists of balls from Negtree.

Experimental results Classification

• Real data

Experimental results Classification

k=9, t=ceiling(k/2),

Randomly pick 1% negative records and 50% positive records as test (986 points)

Train on the reaming 87372 data points

• Why k-NN? Baseline

• No free lunch:

• For uniform high-dimensional data, no benefits.

• Results mean the intrinsic dimensionality is much lower.