Audio fingerprinting
1 / 17

Audio Fingerprinting - PowerPoint PPT Presentation

  • Uploaded on

Audio Fingerprinting. Wes Hatch MUMT-614 Mar.13, 2003. What is Audio Fingerprinting?. a small, unknown segment of audio data (it can be as short as just a couple of seconds) is used to identify the original audio file from which it came. Applications. Broadcast monitoring

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Audio Fingerprinting' - arlais

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Audio fingerprinting

Audio Fingerprinting

Wes Hatch


Mar.13, 2003

What is audio fingerprinting
What is Audio Fingerprinting?

  • a small, unknown segment of audio data (it can be as short as just a couple of seconds) is used to identify the original audio file from which it came


  • Broadcast monitoring

    • playlist generation

      • royalty collection

      • ad verification

  • Connected Audio

    • general term for consumer applications

  • Other

    • Napster--use of fingerprinting systems to prohibit the transmission of copywritten materials

    • Finding desired content efficiently in “an overwhelming amount of audio material”


  • Automated search of illegal content on the Internet

    • examines the real audio information rather than just tag information

  • For the consumer

    • make the meta-data of songs in a library consistent, allowing for easy organization

    • can guarantee that what is downloaded is actually what it says it is

    • will allow consumer to record signatures of sound and music on small handheld devices

Two principle components
Two principle components

  • Compute the fingerprint

  • Compare it to a database of previously computed fingerprints

    • A text example: “…in a box. I will not eat them with a fox. I…”

Details to worry about
Details to worry about

  • Robustness (to noise, distortion)

  • Reliability

  • Fingerprint size (reduced dimensionality)

  • Granularity

  • Search speed and scalablity

  • Computationally efficient

  • Resulting features must be informative about the audio content

  • Semantic or non-semantic features?

  • Hash table or vector representation?

Computing the fingerprint
Computing the fingerprint

  • Compare to hash functions…?

    • compare computed hash value with that stored in a database

  • Drawback

    • need to worry about perceptual similarity and not mathematical similarity

      • PCM audio vs. MP3: both sound alike but mathematically (i.e. spectral content) are quite different

    • perceptual similarity is not transitive

      • not possible to design a system which computes mathematical fingerprints for perceptually similar objects

Techniques general
Techniques (general)

  • Any ‘x’ number of seconds may be used to compute the fingerprint

  • Audio gets separated into frames

    • Features computed for each frame:

      • Fourier coefficients

      • MFCC, LPC

      • Spectral flatness

      • sharpness

  • “features mapped into a more compact representation by using …HMM, or quantization”

Techniques haitsma kalker
Techniques (Haitsma, Kalker)

  • one 32-bit sub-fingerprint every 11.6 ms

    • A block consists of 256 sub-fingerprints

      • Corresponds to a granularity of only 3 seconds

    • Large overlap (31/32), so subsequent sub-fingerprints are similar and vary slowly in time

    • worst-case scenario: the frame boundaries used during identification are 5.8 ms off with those in database

Techniques haitsma kalker1
Techniques (Haitsma, Kalker)

  • Data from each frame is sent through a filterbank

    • 33 filters, logarithmically spaced (to correspond roughly to the Bark scale)

      • between 300 and 2000Hz

    • phase is neglected (perceptual reasons)

Techniques burges platt
Techniques (Burges, Platt)

  • downsampled to 11.025 kHz, split into frames with overlap of 2

    • MCLT is then applied to each frame. A 128-sample log spectrum is generated by taking the log modulus of each MCLT coefficient

Techniques burges platt1
Techniques (Burges, Platt)

  • Use prior knowledge to define form of the feature extractor

  • Features computed by a “linear, convolutional” neural network

  • convert signal into a feature vector

    • uses Pattern Classification and Scene Analysis (PCA) to find a set of projections

    • generates a vector of 128 values for every 11.6ms interval

      • dimensional-reduction method (i.e. lots of math)

Techniques burges platt2
Techniques (Burges, Platt)

  • 3 layers of Oriented PCA (OPCA)

    • operates on a frame of 128 values

      • layer 1: generates 10 values for each frame

      • layer 2: takes 42 ‘layer 1 outputs’ and produces 20 values

      • layer 3: takes 40 ‘layer 2 outputs’ and produces 64 values (11K inputs --> 64 outputs)

Searching the database
Searching the Database

  • Look for the most similar (not necessarily exact) fingerprint

    • 10,000 5-min. songs  250 million sub-fingerprints

    • brute force takes in excess of 20 minutes on a very fast PC

      • brute force computes bit-error rate for every possible position in the database

Searching the database1
Searching the Database

  • make assumption that at least 1 (of the 256) sub-fingerprints are error-free

    • then, use a hash table (as opposed to more memory-intensive look-up table)

    • 800,000 times faster


  • false-positive rate of 3.6x10-2 (Haitsma, Kalker)

  • On tests with a large (500,000) set of input traces

    • has a “low” false-positive and false-negative rate. (Burges, Platt)

    • didn’t test on time compression, expansion

  • can withstand distortions occurring from transmission over mobile phones.