1 / 29

Automatic Summarization of Rushes Video using Bipartite Graphs

Automatic Summarization of Rushes Video using Bipartite Graphs. Liang Bai, Songyang Lao National University of Defense Technology, PR China Alan F.Smeaton, Noel E. O’Connor Dublin City University, Ireland. Contents. Introduction Video summaries Video summarization for rushes in TRECVID

sibley
Download Presentation

Automatic Summarization of Rushes Video using Bipartite Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Summarization of Rushes Video using Bipartite Graphs Liang Bai, Songyang LaoNational University of Defense Technology, PR ChinaAlan F.Smeaton, Noel E. O’ConnorDublin City University, Ireland

  2. Contents Introduction Video summaries Video summarization for rushes in TRECVID Various approaches in 2007 and 2008 Our Proposed Approach Video Structuring Useless Content Removal Re-take Shot Detection and Removal Representative Shots and Summary Generation Experiments and Our Results Ground Truth Experimental Results Conclusion and Discussion

  3. Video Summarisation Summary is a condensed version of something so that judgments about the full thing can be made in less time and effort than the full thing In a world of information overload, summaries have widespread application as surrogates resulting from searches, as previews, as familiarisation with unknown collections Video summaries can be keyframes (static storyboards, dynamic slideshows), skims (fixed or variable speed) or multi-dimensional browsers Literature & previous work shows interest in evaluating summaries, but datasets always small, single-site, closed TVS’07 and TVS’08 were “tracks” in TRECVID and have energised work in the area, and made available data, evaluation metrics, etc.

  4. Summarisation Data BBC provided 11 boxes of BETA SP tapes, 250 hours of rushes from dramatic series… Casualty, House of Elliot, Jonathan Creek, Ancient Greece, Between the Lines & other miscellaneous Rushes video is pre-production, lots of noise and redundancy, was digitised into MPEG-1 and used for training and testing

  5. TRECVid Measures and Process TRECVid created groundtruth and invited summaries of 4% (2007) and 2% (2008) of the original video Task is to reduce time needed to grasp content and constraint is single playback, no interaction except play/pause Measures used were subjective … Fraction of (12 items of) ground truth found Ease of use Amount of near-redundancy … and objective: Assessment time to judge included ground truth Summary duration Summary creation compute time

  6. Ground Truth Items (24 / Rushes) Rushes 1 (appx. 30min) Measures Summary 39 • Fraction of GT included 39 Rushes video 1. A man standing beside... 2. : 3.: Participant 1 39 GT Items lists Randomly selected GT Items (12 / Rushes) 24. Total 31 Participants 1. A man standing beside... 2. : Rushes 2 Rushes 3 Total 39 x 31 = 1209 Video Summaries submitted 12. Rushes 39 ... after watching original Rushes 1 video and examining its GT Items 2% Summary 1 Participant 1 : Summary 2 10 assessors Summary 3 + Summary 1 Summary 2 Summary 3 Participant 2 Summary 39 All summaries of Rushes 1 grouped, randomized, all assigned to 3 Assessors... Summary 1 Summary 2 Summary 3 Participant 31 … Summary 39 • Lack of junk video • Lack of redundancy • Pleasant tempo and rhythm • GT Assessment Time

  7. 22 Participating groups 2007 AT&T: shot clustering to remove redundancy, use shot with most speech/faces;Brno Univ.: cluster shots using PCA, remove junk shots;CMU: k-means clustering using iterative colour matching, audio coherence;City UHK: obj. detection, camera motion, keypoint matching for repetitive shots;Columbia: duplicate shot detection and ASR;COST292: face, camera motion, audio excitement;Curtin U: shot clustering using SIFT matching;DCU: amount of motion & faces for keyframe selection;FXPAL: colour distribution, camera motion, for repetition detection;HUT: SOMs for shot pruning to eliminate redundancy;HKPU: junk shot removal, visual & aural redundancy;

  8. 22 Participating groups 2007 Eurecom: determine the most non-redundant shots;Joanneum: variant of LCSS to cluster re-takes of same scene;KDDI: use only low-level features for fast summarisation;LIP6: eliminate repeating shots using ‘stacking’ technique;NII: feature extraction and clustering;Natl. Taiwan U: LL shot similarity & motion vectors, then cluster;Tsinghua/Intel: keyframe clustering, repetitive segments, main scenes/actors;UCSB: k-means clustering on HL features, speech, camera motion;Glasgow: 0-1 knapsack optimisation problem, shot clustering;UA Madrid: single pass for realtime clustering on-the-fly, colour-based;Sheffield: concatenate some frames from middle of each shot;

  9. Most groups, almost all, explicitly searched for and removed junk frames;Most groups, majority, used some form of clustering of shots/scenes in order to detect redundancy;Several groups included face detection as some component;Most groups used visual-only, though some also used audio in selecting segments to include in summary;Camera motion/optical flow was used by some;Most groups used whole frame for selecting, though some also used frame regions; 31 Participating groups 2008

  10. Even more variety among techniques for summary generation than summary selection; Many groups used FF or VS/FF video playback; Several groups incorporated visual indicator(s) of offset into original video source, within the summary; Some used an overall storyboard of keyframes; Summary generation Plain keyframes Plain clipsClips of 1s durationFF and VSFF Clips Main scene/actor re-capsClips w. indicators of offset/re-takesClips w. picture-in-picture Clips in 4-windows

  11. Fraction GT/ease of use 2007

  12. What is the best combination ? From a high-level analysis of participant approaches and performances, in 2007, we were able to “pick” promising looking techniques … which we did.

  13. Our Approach Our Approach (take 2 !)

  14. Video Structuring Used mutual information measure for Shot detection The probability that a pixel with gray level i in frame ft has gray level j in frame ft+1: The mutual information of frame fk, fl for the R component is expressed as: The total mutual information between frames fk and fl is defined as: Local mutual information mean values on a temporal window W of size Nw for frame ft are calculated as: Our Approach

  15. Mutual information measure for Shot detection The standard deviation of mutual information on the window is calculated as: Step 1: calculate the mutual information time series with Step 2: calculate and at each temporal window in which ft is the first frame. Step 3: if , frame ft is determined as a shot boundary. Proposed Approach

  16. Sub-shot partitioning 8x8 pixel grid and calculate the mean and variance of RGB color in each grid Using Euclidean distance to measure the difference between neighboring frames In one sub-shot the cumulative frame difference shows gradual change. High curvature points within the curve of the cumulative frame difference are very likely to indicate sub-shot boundaries. Proposed Approach

  17. Proposed Approach Useless content removal … Shots less than 1 second are removed anyway Extracted four features from frames: color layout, scalable color, edge histogram and homogenous texture Built SVM classifiers for color bars and monochromatic shots detection Used the algorithm for Near-Duplicate Key frame (NDK) detection described in [C. W. Ngo] to detect clapperboards

  18. Proposed Approach Re-take Shot Detection and Removal In rushes video, the same shot can be re-taken many times in order to eliminate actor or filming mistakes.

  19. Proposed Approach Re-take Shot Detection and Removal … the principle … The similarity between shots can be measured according to the similarity of keyframes extracted from corresponding shots. Re-take shots can be detected by modeling the continuity of similarity of key frames. We use maximal matching in bipartite graphs for similarity detection between video shots, and patterns in these graphs indicate shot re-takes. Similarity measure between video shots is divided into two phases: key frame similarity and shot similarity.

  20. Proposed Approach Re-take Shot Detection and Removal Key frame similarity component: a video shot is partitioned into several sub-shots and one key frame is extracted from each sub-shot. The similarity among sub-shots is used instead of the similarity between corresponding key frames. Key frame similarity is measured according to the spatial color histogram and texture features. Shot similarity using maximal matching in bipartite graphs A shot is expressed as: where represents the ith key frame. For two shots, the similar key frames between Sx and Sy can be expressed by a bipartite graph , where , , indicates is similar with .

  21. Proposed Approach Re-take Shot Detection and Removal Shot similarity using maximal matching in bipartite graphs There exist many similar pairs of key frames between two retake- shots and often in one retake-shot . This results in one to many, many to one and many to many relations in a bipartite graph. In this case, there will be many similar key frames pairs found between two dissimilar shots.

  22. Proposed Approach Re-take Shot Detection and Removal So, the similarity between two shots is measured by the maximal matching of similar key frames. Hungarian algorithm for calculating maxima matching M, If , where n, m are the number of key frames in the two shots, it is determined that one shot is similar with respect to the other.

  23. Proposed Approach Selecting Representative Shots and Summary Generation 4% duration limit, so, the most representative clips need to be selected to generate the final summary. Motion and face factors to rank how representative each remaining sub-shot is in the context of the overall video. And sub-shot duration is important for sub-shot selection so we use simple weighting to combine the factors. The sub-shots with highest scores are selected according to the summary duration limitation. Finally, 1-second clips centred around the keyframe in each selected sub-shot are extracted for generating final summary.

  24. Experiments Results Experimental Setup The seven criteria set by the TRECVid guidelines for summarization evaluation are: EA: Easy to understand: (1 strongly disagree - 5 strongly agree); RE: Little duplicate video: (1 strongly disagree - 5 strong agree); IN: Fraction of inclusions found in the summary (0 - 1); DU: Duration of the summary (sec); XD: Difference between target and actual summary size (sec); TT: Total time spent judging the inclusions (sec); VT: Video play time (vs. pause) to judge the inclusions (sec). Ten participants were selected to review the summaries under the exact same guidelines as provided by NIST and give their scores for the four subjective criteria.

  25. Experiments Results Investigation into subjective variations of the evaluation process By running our own evaluation we could potentially introduce new subjective variations into the evaluation process. So, we first evaluated three sets of results: the two TRECVid baselines and our own original submission.

  26. Experiments Results Experimental results on our summaries Our enhanced approach results in a big improvement in IN (0.40) with a slightly longer duration of summaries (0.71 sec) compared with our original approach. Our enhanced approach’s XD is 18.83, which is 8.5 sec longer than the mean of the other 22 teams (IN fraction of inclusions; DU duration; XD target vs. actual duration difference)

  27. Experiments Results The experimental results on all of our summaries We obtain very encouraging results for the EA and RE Results show our enhanced approach performs competitively compared with the other teams and the baselines (EA ease of understanding; RE little duplication; TT time taken to judge; VT video play time)

  28. Conclusions Rushes videos include many useless and redundant shots and are organized based on (filmic) shot structure Shot and sub-shot detections for video structuring seem to really help in selecting material for rushes summaries SVM-based method for removing useless content useful We introduced modeling the similarity of key frames between two shots by bipartite graphs and measuring shot similarity by maximal matching for re-take shot detection, and MI for SBD Selecting the most representative clips based on considerations of motion, face and duration of sub-shots, seems to work Obtained improvements compared to our original approach, but more importantly compared to other teams who participated, and the TRECVID baselines Video summarization clearly still remains challenging Most submissions cannot significantly outperform the two baselines … we were lucky, we made good, informed guesses ! A deeper semantic understanding of the content can help in this regard

  29. Thank You

More Related