1 / 1

Redundant Bit Vectors for Fast High-Dim Lookup

Redundant Bit Vectors for Fast High-Dim Lookup. Database & CCSP Research Groups. Jonathan Goldstein, John Platt, Chris Burges. Problem Statement. Our Three Key Ideas. Partitioning the Queries. bin. Query. S 2. S 1. Use redundancy to combat high dimensionality.

inga
Download Presentation

Redundant Bit Vectors for Fast High-Dim Lookup

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Redundant Bit Vectors for Fast High-Dim Lookup Database & CCSP Research Groups Jonathan Goldstein, John Platt, Chris Burges Problem Statement Our Three Key Ideas Partitioning the Queries bin Query S2 S1 • Use redundancy to combat high dimensionality. • Use bit vector indices to keep the representation small. • Use redundancy by partitioning the queries, not the data. S3 dim 1 BV1 BV2 BV3 BV4 Query S4 S5 S1 S5 S4 S2 Data sphere S6 S3 dim 2 • Find all data spheres that contain the query point • Low, non-zero error is OK BV5 BV7 BV8 BV6 Construct BV6: 101101 Bit Vector Indices • Redundancy: point stored as “yes” result for multiple bins per dimension • When querying, a single bit vector index is picked per dimension; ANDed together. • The results are post-filtered using the actual data spheres. Applications Bit vector indices • Server for audio fingerprinting: look up matches for audio • Fingerprinting in general • Pub/sub filtering 1 0 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 0 0 1 Point ID AND Existing Technology Results for RARE • Space partitioning with no redundancy • R-trees, SS-trees, etc. • worse than linear scan for truly high dimensional data! • Advantages: • Small: 1 bit/point/query part • Fast: 1 CPU cycle operates on 32 points in parallel ! • 56x faster than linear scan • Introduces no false negatives • Post-filtering 1000x faster than full linear scan

More Related