1 / 10

Approximate NN queries on Streams with Guaranteed Error/performance Bounds

Approximate NN queries on Streams with Guaranteed Error/performance Bounds. Nick Koudas @ AT&T labs-research Beng Chin Ooi , Kian-Lee Tan , Rui Zhang @ National University of Singapore. Problem. Problem: kNN search. Environment: data stream (one scan; memory constraint).

neil
Download Presentation

Approximate NN queries on Streams with Guaranteed Error/performance Bounds

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate NN queries on Streams with Guaranteed Error/performance Bounds Nick Koudas @ AT&T labs-research Beng Chin Ooi , Kian-Lee Tan , Rui Zhang @ National University of Singapore

  2. Problem • Problem: kNN search. • Environment: data stream (one scan; memory constraint). • Approximate Solution: e-approximate kNN (ekNN). • Motivation: Applications in which absolute error is preferable or more straightforward. IP: 137.132.48.120 137.132.48.121 …

  3. Two Optimization Problems: • memory optimization for a given error bound: given an error bound e, use as little memory as possible to answer ekNN queries. • error minimization for a given memory size: given a fixed amount of memory, achieve the best accuracy for ekNN queries. • Requirements: • One scan algorithm. • Satisfies the constraints. • Efficient updates and query processing.

  4. A Framework • Divide space into equal square-shaped cells. • Maintain at most K points in each cell. • For any k≤K, absolute error of kNN distance is bounded by dM, the maximum distance within a cell. For Euclidean distance: dM = where d is dimensionality; u is the number of cells each dim is divided to.

  5. Maintenance of the Points--aDaptive Indexing on Streams by space-filling Curves (DISC) • Cells are not explicitly maintained, only points. • Cells linearized according to Z-curve. • Z-value of the cell is the key of a point. • Points maintained in a B*-tree. • An efficient merge-cell algorithm possible.

  6. Algorithm: Build index • m: the order of Z-curve, 2m cells each dim. • If e given, , we get . me is integer, so • If memory constraint given, set a large enough m. • Build index • Initialize m • Read a record P, calculate Z-value, search the B*-tree and find out Nc: number of existing points in the cell P belongs to. • If Nc <K • Insert P to the B*-tree. • Else • Discard one and insert P. • If memory runs out //this only happens for the error minimization problem • Merge cells and let m=m-1 • Go back to Step 2 (Read next record)

  7. Algorithm: Merge Cells • General Merge-Cell • Apply to any structure. • For each new cell, find all the points of the old cells in it, and merge them. • Bulk Merge-Cell • Only apply to DISC. • Scan all the leaf pages once.

  8. Algorithm: KNN search • W: a window query centered at the center of the cell Q is in; and with gradually increasing side length s. • Find the kNN to Q within W. • If the kNN distance is no larger than the distance between the nearest side of W to Q and Q, search terminates; • Else increase s by 1/u .

  9. Experiments

  10. Questions ?

More Related