Redundant Bit Vectors for the Audio Fingerprinting Server

Redundant Bit Vectors for the Audio Fingerprinting Server John Platt Jonathan Goldstein Chris Burges

Structure of Talk • Problem Statement • Problems with Existing Techniques • Bit Vectors • Partitioning the Query Space • Results • Future Extensions

Problem Statement • Find all hyperspheres that overlap a query point • Centers of spheres = undistorted fingerprint of song • Radius of sphere = acceptable distortion of fingerprint • Sphere overlaps query = query song is same as dB song S1 S2 S3 Query is song 2 Query S4 S5 Data sphere

Existing Techniques • Linear scan • For each sphere, check if query is inside • Linear effort in size of dB • Previous best known technique • Data partitioning (R-trees, SS-trees, etc) • Store data in a tree of shapes • Shapes chop up space • Descend and backtrack in tree, finding all leaf nodes that could match query • Performs worse than linear scan!!!

Key New Ideas • New Algorithm: Redundant Bit Vectors (RBV) • Partition the queries, not the data • Avoids hopeless task • Store each data point redundantly • Combats high dimensionality • Use bit vectors to index database • Small & fast representation

Data Structure: Bit Vectors • Bit vectors represent every data point as one bit Point 1 1 0 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 0 1 Most examples are excluded from a final linear scan & & & → Point N 0 means exclude this example from linear search 1 means linear search must still look at example

Why Bit Vectors are Good • Small memory footprint • 1 bit per example per bit vector • Fast on modern CPUs • 1 CPU cycle operates on 32 examples per clock cycle • Compare to Euclidean distance • 3 operations/example/dimension • Potential speed up: 96x !!! • we use 1 bit vector per dimension in lookup

Partition the Queries, not the Data Query bin • For each query dimension, • dimension indexed into bins • bit vector associated with each bin • when query falls into bin … • use the corresponding bit vector • AND together bit vectors for some • or all dimensions • Each dimension trims examples • Perform linear scan on survivors 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 0

How We Compute the RBVs At index building time, construct the bit vectors: S1 Project spheres into each dimension S5 S4 S2 S3 S6 i th vector, j th bit = does sphere j overlap bin i ? 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1

How We Decide on the Bin Edges • Two equivalent heuristics • each bin should have ~ same number of spheres • adjacent bit vectors should be ~ constant Hamming distance apart • Place bin edges equal num. of sphere edges apart 6 3 9 12 S1 S5 S2 S4 S3 S6

Improve Selectivity by Shrinking Boxes • Fingerprinting with hyperspheres (L2 norm) • Low, but non-zero false negative rate • Bit vectors implement hyperrectangles (L∞norm) bit vectors guaranteed to never introduce false positives bit vectors empirically found to introduce no extra false positives We shrink the hyperrectangles to speed up final linear search

Speed Comparisons • Ran 1000 queries against fingerprint database • Database size = 240K • 64 dimensional points • 14 bit vector dimensions used • chosen to optimize bit vector + linear scan speed • more dim: bit vectors slow down, linear scan speeds up • 32 bins per dimension • chosen to optimize memory/speed tradeoff • Pentium 4, 2.2 GHz

Results • all linear scans used early bailing • measured code by itself, not in context of SQL or IIS

Code Details • ~600 lines of C++ (pretty simple) • Not integrated with SQL Server or IIS, etc. • Running as part of audio fingerprinting demo

How Fast Is It Really? • You tell us: it depends on several factors • Linear time in size of database • Linear time in amount of resistance to cropping • Sorting by popularity may help substantially

location of fingerprint true start of song largest acceptable crop to recognize need to search this portion of song Resistance to Cropping • How much “crop slop” do you need? • current system = 5.4 traces / second of slop • possible to reduce by 2x • server load linear in number of traces

Popularity Sorting May Help Order database by approximate popularity of music Split search into different sections 5000 most popular first search here May yield substantial speed gain if not found, search here 50000 next most popular if not found, search here the rest

Memory Performance • Bit Vectors can stay resident in memory • For 240K songs • All fingerprints live in 128M • Bit vector indices only require 13M • We can store fingerprints as 2-byte short, save 2x mem. • Bit vector search blows out of cache • speed depends on memory bandwidth of server

Summary • Bit Vectors are Simple, Small, and Fast • Must be used to get good server-side performance

Redundant Bit Vectors for the Audio Fingerprinting Server

Redundant Bit Vectors for the Audio Fingerprinting Server

Presentation Transcript

Intel 64-bit Server Technology

Redundant Bit Vectors for Fast High-Dim Lookup

Audio Fingerprinting

AUDIO STREAMING SERVER

A Bit About Audio Description

Fingerprinting the Datacenter

24-bit Audio CODEC

Fingerprinting

Fingerprinting

Fingerprinting

Compressed Domain Audio Fingerprinting

Fingerprinting

the Vectors

SoundConnect Audio Conferencing for Lync Server

Fingerprinting

Fingerprinting

STP: A Decision Procedure for Bit-vectors and Arrays

Detecting and Defending against Web-Server Fingerprinting

Satisfiability modulo the Theory of Bit Vectors

Audio Fingerprinting as a New Task for MIREX-2014

Fingerprinting the Datacenter

Fingerprinting