1 / 65

Presented by: Ding-Ying Chiu Date: 2008/10/24

This presentation discusses index methods, the curse of dimensionality, classifiers, and their applications in classification and image retrieval. The focus is on dimension pruning and the use of RCE-network as a classifier. Further, experiments, including active learning and feedback, are explored.

jbullock
Download Presentation

Presented by: Ding-Ying Chiu Date: 2008/10/24

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Index methods, Curse of Dimensionality, and classifiers Presented by: Ding-Ying Chiu Date: 2008/10/24

  2. Outline • Motivation • Index methods • Estimate a lower bound • Classifier: RCE-network • Curse of Dimensionality • VA-file • Our method - dimension pruning • Experiments

  3. Classification & Image Retrieval [16]Active Learning • Feedback

  4. Classification & Image Retrieval [16]Two classes

  5. Classification & Image Retrieval [16]Repeatedly

  6. Classification & Image Retrieval [16]Concept

  7. Classification & Image Retrieval [16]Terminate Photo Photo Photo Photo Photo Photo Photo Photo

  8. Related workIndex methods • Coordinate-based • Space foundation • K-D-tree[12], K-D-B-tree[3] • Data foundation • R-tree[12], SS-tree[5] • Distance-based • Multiple reference points[7][8], M-tree[9]

  9. block Coordinate-basedSpace foundation[12] Main idea Estimate a lower bound • Grid • Pruning rate = 30/40 = 0.75 Find 1-nearest neighbor Disk access

  10. query Coordinate-basedData foundation[12] • The number of the contained data in a node: m~M • m  M/2 • m:3, M:6 Main idea Estimate a lower bound Find 1-nearest neighbor

  11. X1 d(r,x1)  d(r,q) Distance-base Multiple reference points[7][8] Main idea Estimate a lower bound r q Dimension = 30

  12. X2 d(r,x2)  d(r,q) Distance-base Multiple reference points[7][8] Main idea Estimate a lower bound r q Dimension = 30

  13. Xi d(r,xi)  d(r,q) Distance-base Multiple reference points[7][8] Main idea Estimate a lower bound r q Dimension = 30

  14. Distance-base Multiple reference points[7][8] Main idea Estimate a lower bound r q Dimension = 30

  15. Advantages & DisadvantageR-tree & Grid – static data Grid R-tree m:2, M:5

  16. Advantages & DisadvantageR-tree & Grid – dynamic data Grid R-tree m:2, M:5 Application: Cars

  17. The dimension of a space F, to which the instances are projected, can be very high, possibly infinite. space F The advantages of Distance-base indexUnknown coordinates [11] • Distance-based index methods can be used in a space in which the coordinates are unknown.

  18. Similarity between any two instances measured by a kernel function is: 1 1 The advantages of Distance-base indexCharacteristic of the space F [11] • Characteristic • lie on the surface of a unit hypersphere • No coordinate • Operator : Kn(x1, x2)

  19. The advantages of Distance-base indexQuery processing [11] 1-nearest neighbor

  20. d- f(x) random f(x+x) d- x y=1 x’ Lagrange Multipliers: (new)x’=(old)x’+x d+ y=-1 Margin = |d+|+|d-| Classifier and index methods • Neural network • SVM

  21. Test datum Test datum RCE-networkConstraints • The RCE-network uses circles to cover training data with the following constraints: • (a) For each datum, it must be covered by a circle. • (b) The training data which are covered by a circle must be in the same class.

  22. (w11, w21, …, wn1) C1 r1 Input layer Hidden layer Output layer RCE-networkStructure

  23.  RCE-network algorithmRajan’s algorithm [1] Input: training data, initial radius , radius reduction rate  (0<1) Output: An RCE-network

  24. The algorithm needs to scan the data many times RCE-network algorithmRajan’s algorithm [1] Input: training data, initial radius , radius reduction rate  (0<1) Output: An RCE-network

  25. =0.5 RCE-network algorithm Mu’s algorithm [2] Input: training data, radius reduction rate  (0<1) Output: An RCE-network The algorithm will produce a large number of circles

  26. 1-NN query range query q1 p1 p2 q2 RCE-network algorithmOur method - Radius Expansion algorithm (RE)

  27. When dimensionality increases, data becomes increasingly sparse in the space that it occupies Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful (Max-Min)/MinCurse of Dimensionality (From Introduction to data mining) • Randomly generate 500 points • Compute difference between max and min distance between any pair of points

  28. Randomly generate 11 nodes on one dimensional space Max = 10, Min=1 Max-Min = 9 10 9 8 7 6 r2 = (3.14159)*(1.414)2 = 6.28128 6.28128/4  1.57 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 (Max-Min)/Min1D & 2D  11 nodes 0 1 2 3 4 5 6 7 8 9 10 • 2D: Max  14.14 • (14.14-x)/x = 9  x = 1.414 1.57/(10*10) = 0.0157

  29. ………… Randomly generate 101 nodes on one dimensional space 0 1 2 3 4 5 6 7 8 9 10 Max = 10, Min=0.1 Max-Min = 9.9 9.9/0.1=99 10 9 8 • 2D: Max  14.14 • (14.14-x)/x = 99  x = 0.1414 7 6 r2 = (3.14159)*(0.1414)2 = 0.0628128 0.0628128/4  0.0157 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 (Max-Min)/Min1D & 2D  101 nodes 0.0157/(10*10) = 0.000157

  30. Randomly generate11 nodes on one dimensional space Max = 10, Min=1 Max-Min = 9 • 3D: Max  17.32 • (17.32-x)/x = 9  x = 1.732 10 (4/3)*r3 = (4/3)*(3.14159)*(1.732)3 = 21.7637 21.7637/8  2.72 0 10 (Max-Min)/Min1D & 3D 0 1 2 3 4 5 6 7 8 9 10 2.72/(10*10*10) = 0.00272

  31. (Max-Min)/MinCoordinate-based - data foundation (Max-Min)/Min  0 R-tree

  32. X1 d(r,x1) d(r,q) q  (Max-Min)/MinDistance-based (Max-Min)/Min  0 Reference point

  33. Related work • A Survey on High Dimensional Spaces and Indexing • COMP 530: Database Architecture and Implementation (The Hong Kong University of Science and Technology) • Wu Hai Liang, Lam Man Lung, Lo Ming Fun, Yuen Chi Kei, Ng Chun Bong

  34. Index methodDistance-based • 10,000 uniform data

  35. VA-file[13]Main idea: Bit Vector • For every partitioning and clustering method there is a dimensionality d such that, on average, all blocks are accessed if the number of dimensions exceeds d. • Linear scan

  36. 0.75 0.2 VA-file[13]Bounds • Linear scan • Bound: without disk access Low bound = 0.45 (0.7, 0.3) 0.5

  37. Training dataNo disk access

  38. A Our pruning method • Step 1: • Find an approximate nearest neighbor. • Step 2: • Dimension pruning Q Find the 1-nearest neighbor from the training data whose classes are different from that of A.

  39. The smallest k is called the number of the computed dimensions (NCD) of P. • Example: Q(1, 1, 1, 1, 1, 1, 1) A(2, 2, 2, 2, 2, 2, 2) DIS(Q, A)= P(1, 3, 2, 1, 3, 5, 1) The NCD of P is 5. Dimension pruningThe number of the computed dimensions • Given a query Q and its approximate nearest neighbor A. • When , where 1kn, P cannot become the nearest datum of Q since DIS(Q, A) is smaller than DIS(Q, P).

  40. Dimension pruningANCD • Given a data set X. The Average Number of the Computed Dimensions (ANCD) of X is the sum of the NCD’s of the data in X divided by the number of the data in X. • Example: Q(1, 1, 1, 1, 1, 1, 1) A(2, 2, 2, 2, 2, 2, 2) DIS(Q, A)= P1(1, 3, 2, 1, 3, 5, 1) P2(5, 1, 4, 2, 7, 1, 2) P3(1, 3, 1, 3, 7, 1, 1) P4(3, 3, 1, 8, 1, 2, 5) ANCD of X is (5+1+4+2)/4 = 3.

  41. Dimension pruning • Property 4.3 • In an n dimensional space, given a query Q and its approximate nearest neighbor A. • Suppose that X and Y are two node sets: • X={x| x is a node and DIS(Q, x) is d} • Y={y|y is a node and DIS(Q, y) is r*d}. • Moreover, the data of X are uniform distribution on the surface of a hyper-sphere whose center is Q and radius is d. The data of Y are uniform distribution on the surface of a hyper-sphere whose center is Q and radius is r*d. • If the ANCD of X is m, m < n, then the ANCD of Y is m/(r2). ANCD is m Q rd d A ANCD is m/r2

  42. ANCD=27.78 ANCD=62.5 ANCD=20.4 ANCD=250 ANCD=40 ANCD=111.11 Dimension pruning Dimension = 1000 ANCD=999.9

  43. 30 30 = 100 10 10 10 10 20 20 30 = 100 15 10 5 20 30 20 15 15 = 100 10 25 30 20 = 100 10 5 Average value of each dimension d2/n 17 85 80 16 18 90 80 16 75 15 18 90 100/6=16.67 Dimension pruningAverage value of each dimension Dimension = 6 d=10 30 20 20 15 = 100 10 5

  44. d=20 100 60 40 20 20 60 80 100 80 40 Dimension pruningConcept d=10 d=9 Dimension = 100 1 Dimension 4 Dimension

  45. 300 400 90 200 100 500 Dimension pruningExperiments • In a 100 dimensional space, we generate a query Q and its approximate nearest datum A, DIS(Q, A)=90. • we produce the data on the surface of five hyper-spheres. The center of the five hyper-spheres is Q and their radius range from 100 to 500. • For each hyper-sphere, 100,000 uniform data are produce on its surface.

  46. P P Q Dimension pruning • Variance affected dimension pruning • We can analyze the variance of each dimension and change the computation order of the Euclidian distance. Q

  47. query 1-NN queryApproximate 1-NN Random selection N=6 , =3 (1*10+2*6+3*3+4*1)/20 = 1.75

  48. 1-NN queryApproximate 1-NN N data, If  data are selected (1*10+2*6+3*3+4*1)/20 = 1.75 = (6+1)/(3+1)

  49. Reference point R A B A Q C R Q B C 1-NN queryApproximate 1-NN Greedy selection

  50. Approximate 1-NNExperiment • The number of dimensions of USPS is 256, the number of dimensions of MNIST is 784, and the number of dimensions of LETTER is 16. 7291 data • based on Formula (16), the average minimum rank of random selection method on 7291 data and 200 selected data is 36.27. 11.47

More Related