1 / 24

Event retrieval in large video collections with circulant temporal encoding

Event retrieval in large video collections with circulant temporal encoding. CVPR 2013 Oral. Outline. 1. Introduction 2. EVVE: an event retrieval dataset 3. Frame description 4. Circulant temporal aggregation 5. Indexing strategy and complexity 6. Experiments 7. Conclusion.

elani
Download Presentation

Event retrieval in large video collections with circulant temporal encoding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral

  2. Outline • 1. Introduction • 2. EVVE: an event retrieval dataset • 3. Frame description • 4. Circulant temporal aggregation • 5. Indexing strategy and complexity • 6. Experiments • 7. Conclusion

  3. 1. Introduction *This paper introduces an approach for specific event retrieval *Searching for specific events related video copy detection[13] event category recognition [16] find deformed videos *Measures the similarity between two sequences for all possible alignments Frame descriptors are jointly encoded in the frequency domain Frequency pruning Regularization Quantization

  4. 1. Introduction *Contribution 1.encode the frame descriptors into a temporal representation 2.exploit the properties of circulant matrices to compare videos in the frequency domain 3.Dataset named EVVE for specific event retrieval in large user-generated video content

  5. 2. EVVE: an event retrieval dataset *Several of them are localized precisely in time and space *EVVE also includes events for which relevant videos might not correspond to the same instance in place or time. *The human annotators have marked the videos as either positive or negative. Ambiguous videos were removed. * we have also retrieved a set of 100,000 distractor

  6. 2. EVVE: an event retrieval dataset

  7. 3. Frame description *Pre-processing Sampling them at a fixed rate of 15 fps and resizing them to a maximum of 120k pixels *Local description. extract dense grid SIFT We square root the SIFT components and reduce the descriptor to 32 dimensions with PCA *Descriptor aggregation (VLAD: Vector of Locally Aggregated Descriptors) The SIFT descriptors of a frame are encoded using MultiVLAD Two VLAD descriptors are obtained from two different codebooks of size 128, and concatenated. Power law normalization is applied to the vector and it is reduced by PCA to dimension d

  8. 4. Circulant temporal aggregation qt=0,if t<1 or t>m bt=0,if t<1 or t>n Assumption 1: There is no (or limited) temporal acceleration. Assumption 2: The inner product is a good similarity between individual frames Assumption 3: The sum of similarities between the frame descriptors reflects the similarity of the sequences (In practice,this assumption is not well satisfied, because the videosare very self-similar in time)

  9. 4. Circulant temporal aggregation • Circulant encoding of vector sequences *column notation convolution operator: *Assuming sequences of equal lengths (n = m) element-wise multiplication of 2 vectors:

  10. 4. Circulant temporal aggregation • Regularized comparison metric (which is more efficient to compute) *we consider hereafter that the sequences have been preprocessed to have the same length m = n = 2k: because the temporal consistency and more generally the self-similarity of frames in videos,the values of the score vector s(q,b) are noisy and its peak over is not precisely localized

  11. 4. Circulant temporal aggregation (in the Fourier domain) additional filter: *we compute Wi assuming that the contributions are shared equally across dimensions

  12. 4. Circulant temporal aggregation *One major drawback is that the denominator in Eqn. 8 may be close to zero *Solve this issue regularization coefficient :

  13. 4. Circulant temporal aggregation • Boundary detection *Time shift: to find optimal alignment In some applications such as video alignment (see Section 6), we also need the boundaries of the matching segments database descriptors are reconstructed in the temporal domain from The matching sequence is defined as a set of contiguous t for which the scores St are high enough.

  14. 5. Indexing strategy and complexity *The goal is to implement the method presented in Section 4 in an approximate manner • Frequency domain representation Video : b Length: n complex matrix *Frequency pruning reduce the video representation by keeping only a fraction of the low frequency vectors

  15. 5. Indexing strategy and complexity • Complex PQ-codes and metric optimization *Descriptor sizes we pre-compute a Fourier descriptor for different zero-padded versions of the query by noticing that the Fourier descriptor of the concatenation of a signal with itself is Use product quantization technique [9], which is a compression technique that enables efficient compressed-domain comparison and search.

  16. 5. Indexing strategy and complexity (j=1…p) database vector is split into p sub-vectors The sub-vectors are separately quantized using k-means quantizers This produces a vector of indexes *The comparison between a query descriptor x and the database vectors is performed in two stages First, the squared distances between each sub-vector xj and all the possible centroids are computed and stored in a table

  17. 5. Indexing strategy and complexity Second, the squared distance between x and y is approximated as • Summary of search procedure and complexity which only requires p look-ups and additions. 1. The video is pre-processed and each frame is described as a d-dimensional Multi-VLAD descriptor. 2. This vector is padded with zeros to the next power of two, and mapped to the Fourier domain using d independent 1-dimensional FFTs.

  18. 5. Indexing strategy and complexity 3. High frequencies are pruned: Only frequency vectors are kept. After this step, the video is represented by dimensional complex vectors. 4. These vectors are separately encoded with a complex product quantizer, producing a compressed representation of bytes for the whole video. * complexity

  19. 5. Indexing strategy and complexity

  20. 6. Experiments

  21. 6. Experiments For TRECVID the measure is NDCR (lower = better)

  22. 6. Experiments

  23. 6. Experiments

  24. 7. Conclusion • This video representation provides an efficient search scheme • Extensive experiments on two video copy detection benchmarks show that our approach improves over the state of the art with respect to accuracy, search time and memory usage.

More Related