1 / 19

Redundant Bit Vectors for the Audio Fingerprinting Server

Redundant Bit Vectors for the Audio Fingerprinting Server . John Platt Jonathan Goldstein Chris Burges. Structure of Talk. Problem Statement Problems with Existing Techniques Bit Vectors Partitioning the Query Space Results Future Extensions. Problem Statement.

Antony
Download Presentation

Redundant Bit Vectors for the Audio Fingerprinting Server

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Redundant Bit Vectors for the Audio Fingerprinting Server John Platt Jonathan Goldstein Chris Burges

  2. Structure of Talk • Problem Statement • Problems with Existing Techniques • Bit Vectors • Partitioning the Query Space • Results • Future Extensions

  3. Problem Statement • Find all hyperspheres that overlap a query point • Centers of spheres = undistorted fingerprint of song • Radius of sphere = acceptable distortion of fingerprint • Sphere overlaps query = query song is same as dB song S1 S2 S3 Query is song 2 Query S4 S5 Data sphere

  4. Existing Techniques • Linear scan • For each sphere, check if query is inside • Linear effort in size of dB • Previous best known technique • Data partitioning (R-trees, SS-trees, etc) • Store data in a tree of shapes • Shapes chop up space • Descend and backtrack in tree, finding all leaf nodes that could match query • Performs worse than linear scan!!!

  5. Key New Ideas • New Algorithm: Redundant Bit Vectors (RBV) • Partition the queries, not the data • Avoids hopeless task • Store each data point redundantly • Combats high dimensionality • Use bit vectors to index database • Small & fast representation

  6. Data Structure: Bit Vectors • Bit vectors represent every data point as one bit Point 1 1 0 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 0 1 Most examples are excluded from a final linear scan & & & → Point N 0 means exclude this example from linear search 1 means linear search must still look at example

  7. Why Bit Vectors are Good • Small memory footprint • 1 bit per example per bit vector • Fast on modern CPUs • 1 CPU cycle operates on 32 examples per clock cycle • Compare to Euclidean distance • 3 operations/example/dimension • Potential speed up: 96x !!! • we use 1 bit vector per dimension in lookup

  8. Partition the Queries, not the Data Query bin • For each query dimension, • dimension indexed into bins • bit vector associated with each bin • when query falls into bin … • use the corresponding bit vector • AND together bit vectors for some • or all dimensions • Each dimension trims examples • Perform linear scan on survivors 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 0

  9. How We Compute the RBVs At index building time, construct the bit vectors: S1 Project spheres into each dimension S5 S4 S2 S3 S6 i th vector, j th bit = does sphere j overlap bin i ? 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1

  10. How We Decide on the Bin Edges • Two equivalent heuristics • each bin should have ~ same number of spheres • adjacent bit vectors should be ~ constant Hamming distance apart • Place bin edges equal num. of sphere edges apart 6 3 9 12 S1 S5 S2 S4 S3 S6

  11. Improve Selectivity by Shrinking Boxes • Fingerprinting with hyperspheres (L2 norm) • Low, but non-zero false negative rate • Bit vectors implement hyperrectangles (L∞norm) bit vectors guaranteed to never introduce false positives bit vectors empirically found to introduce no extra false positives We shrink the hyperrectangles to speed up final linear search

  12. Speed Comparisons • Ran 1000 queries against fingerprint database • Database size = 240K • 64 dimensional points • 14 bit vector dimensions used • chosen to optimize bit vector + linear scan speed • more dim: bit vectors slow down, linear scan speeds up • 32 bins per dimension • chosen to optimize memory/speed tradeoff • Pentium 4, 2.2 GHz

  13. Results • all linear scans used early bailing • measured code by itself, not in context of SQL or IIS

  14. Code Details • ~600 lines of C++ (pretty simple) • Not integrated with SQL Server or IIS, etc. • Running as part of audio fingerprinting demo

  15. How Fast Is It Really? • You tell us: it depends on several factors • Linear time in size of database • Linear time in amount of resistance to cropping • Sorting by popularity may help substantially

  16. location of fingerprint true start of song largest acceptable crop to recognize need to search this portion of song Resistance to Cropping • How much “crop slop” do you need? • current system = 5.4 traces / second of slop • possible to reduce by 2x • server load linear in number of traces

  17. Popularity Sorting May Help Order database by approximate popularity of music Split search into different sections 5000 most popular first search here May yield substantial speed gain if not found, search here 50000 next most popular if not found, search here the rest

  18. Memory Performance • Bit Vectors can stay resident in memory • For 240K songs • All fingerprints live in 128M • Bit vector indices only require 13M • We can store fingerprints as 2-byte short, save 2x mem. • Bit vector search blows out of cache • speed depends on memory bandwidth of server

  19. Summary • Bit Vectors are Simple, Small, and Fast • Must be used to get good server-side performance

More Related