1 / 22

Anindya Sarkar, Pratim Ghosh, Emily Moxley and B. S. Manjunath Presented by: Anindya Sarkar

Video Fingerprinting: Features for Duplicate and Similar Video Detection and Query-based Video Retrieval. Anindya Sarkar, Pratim Ghosh, Emily Moxley and B. S. Manjunath Presented by: Anindya Sarkar Vision Research Lab, Department of Electrical & Computer Engg,

kalin
Download Presentation

Anindya Sarkar, Pratim Ghosh, Emily Moxley and B. S. Manjunath Presented by: Anindya Sarkar

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Video Fingerprinting: Features for Duplicate and Similar Video Detection and Query-based Video Retrieval Anindya Sarkar, Pratim Ghosh, Emily Moxley and B. S. Manjunath Presented by: Anindya Sarkar Vision Research Lab, Department of Electrical & Computer Engg, University of California, Santa Barbara Januray 30, 2008

  2. Problem Definition: • Duplicate video and similar video detection • we represent a video compactly (fingerprint), for efficient storage and faster search without compromising the retrieval accuracy • Query-based video retrieval • Input: short length (1-2% of big video length) query video • Output: actual “big” video from which the query is taken

  3. Generation of Duplicate Videos • Dataset: BBC rushes dataset, provided for the TRECVID-2007 task of video summarization • Operations performed: • Image processing (per frame) based: • Blurring using 3x3 and 5x5 window • Gamma correction by 20% and -20% • Gaussian noise addition at SNR of -20,0,10,20,30 and 40 dB • JPEG compression at QF=10,30,50,70 and 90 • Frame drop based errors: • frame drops of 20%, 40% and 60% of the original video for both random and bursty case.

  4. Interpretation of Similar videos • Different takes of the same scene are considered as “similar” videos • These videos are similar in content • However, due to human variability at both the cameraman and actor level, (camera angles, cuts, and actor performance), videos may look similar but are still different • BBC rushes dataset has unedited footage of the different retakes – hence, ideally suited for generation of similar videos

  5. K x d N frames in the actual video K key-frames Video Summarization and key-frame extraction Video Fingerprint d-dimensional signature computed per key-frame Keyframe based Video Fingerprint • Features used for fingerprint creation: • 1. Compact Fourier Mellin Transform • 2. Scale Invariant Feature Transform

  6. Log-Polar Transformation Any 2-D Matrix m,n=0 M-1 x=em∆rcos(n∆θ) y=em∆rsin(n∆θ) R origin (m,n) ∆r ∆θ (x,y) N-1 First fix the value of M,N ∆r= log(R)/M, ∆θ=2π/N M is the no of concentric circles . N is the no. of diverging radial lines . R is the maximum radius of in-circle

  7. CFMT FEATURE EXTRACTION -(K-1) K-1 -(V-1) m, n=0 M-1 |FFT| 50% A.C. Energy Normalization & vectorization V-1 N-1 PCA Quantization

  8. SIFT Feature • Generally used for object recognition – hence, can be used as an image similarity measure • Distance between SIFT features – number of descriptor comparisons makes it computationally prohibitive • Speed up – quantize descriptors to a finite vocabulary (consisting of words) • Each image is a weighted vector of the word frequencies

  9. Straight vocabulary – created by clustering – e.g. 12 dimensional feature needs 12 clusters image descriptors words M=1 more general words Vocabulary tree: created using hierarchical k-means on SIFT features; final vocabulary size=3+9=12 Each feature belongs to one “word” at each level M=3 M=9 most specific words

  10. Straight Vocabulary vs Vocabulary Tree • Straight vocabulary: • Does not consider relationship between words • That is, ignores that certain words are closer to each other than other words. • At very coarse level (dictionary size ~10-20), additional words are more descriptive than the relationship among words. Therefore, outperforms vocabulary tree. • In our experiments, low-dimensional SIFT features, obtained using straight vocabulary, perform much better as “fingerprints” than tree-based features

  11. P=N/K frames, where each window has P frames P frames K x 125 Video Fingerprint Extraction for each of K windows P frames Video Fingerprint N frames Computing the 125-dim YCbCr Histogram in YCbCr Space using P Consecutive Frames and thus avoiding Key Frames Extraction. Whole color space is quantized into 125 bins (5 bins for each of Y, Cb and Cr). Non-keyframe based Video Fingerprint • Features used for fingerprint creation: • YCbCr histogram based feature

  12. K ½ ¾ X ( ) j j ( ) ( ) j j d X Y X Y i i j ¡ ( ) m n d b l f 6 X Y X Y i i i 0 = ; s p o s s e e v e n 1 ; = = ; · · K j 1 i 1 = ( ) ( ) d 6 d X Y Y X = ; ; Signature Distance Computation • For two (K x d) fingerprints, X and Y, where X(i) = ith feature vector of X • Properties of this distance function: • Such a distance relation is called a “quasi-distance”

  13. Motivation Behind Distance Function This closest-overlap based distance is robust to: Frame reordering: For 2 signatures, temporal sequence may not be maintained between them – e.g. a video consisting of a reordering of scenes from the same video is still regarded as a duplicate Frame drops: If frame drops occur or some video frames are corrupted by noise, distance between duplicate videos should still be small

  14. Experiments and Results • We present precision-recall plots for both similarity and duplicate detection, over 3888 videos • CFMT for dimensions 36/24/20/12/4 • SIFT for dimensions 781/341/33/21/12 • CFMT vs best performing SIFT for duplicate detection • SIFT vs best performing CFMT for similarity detection • CFMT performs better for duplicate detection • SIFT performs better for similarity detection

  15. Precision-recall curves for different dimensional CFMT for duplicate detection Precision-recall curves for different dimensional CFMT for similarity detection

  16. Precision-recall curves for different dimensional SIFT for duplicate detection Precision-recall curves for different dimensional SIFT for similarity detection

  17. Precision-recall curves comparing different descriptors for duplicate detection Precision-recall curves, comparing different descriptors for similarity detection

  18. Full-length Video Retrieval with Clip Querying • Generation of the small-length query: • We put together 4 different scenes from a full length video to create our input query: • Each individual scene is represented by 8 keyframes • For a single query, we have 4x8=32 keyframes • We experiment with different features for query representation • Repository is of full-length video signature (65 videos): • Number of keyframes used to create the signature size for “large video” is varied from 1%-4% of video length

  19. ( ) j j ( ) ( ) j j ( ) ¢ X X i i i j i 1 3 2 1 · · ¡ m n = l q u e r y a r g e 1 ; j 3 2 X ( ) ( ) = ( ) D X X ¢ i 3 2 2 = l q u e r y a r g e ; i 1 = Algorithm • Step 1: Input query signature Xquery is a (32 x d) matrix • Step 2: Its distance from all the stored “large video” signatures (Xlarge) is computed, as shown below: • Step 3: The best matched video is returned

  20. Retrieval results for 1% summary lengths for “large” videos Retrieval results for 4% summary lengths for “large” videos

  21. Conclusions • CFMT features provide quick/accurate retrieval for duplicate videos • SIFT features perform better for similar video detection • Future work • expanding the domain of “similar” videos (non-retakes yet still similar ?) • Importance of an efficient summary to create video signature (strategic keyframes vs random keyframes ?)

  22. Thanks for your patience. Questions?

More Related