Color map for flow visualization
(a) A selected frame
(b) Layer labeling
(c) User-annotated motion
(d) Ground-truth from 
(e) Difference between (c) and (d)
Figure 3. For the RubberWhale sequence in , we labeled 20 layers in (b) and obtained the annotated motion in (c). The “ground-truth” motion from  is shown in (d). The error between (c) and (d) is 3.21º in average angular error (AAE) and 0.104 in average endpoint error (AEP), excluding the outliers (black dots) in (d).
The Hebrew University of Jerusalem
Ce Liu William T. Freeman Edward H. Adelson
Massachusetts Institute of Technology
- Existing motion databases are either synthetic or limited to indoor, experimental setups . Can we have ground-truth motion for arbitrary, real-world videos?
- Humans are an expert at segmenting moving objects and perceiving difference between two frames. Can we have a computer vision system to quantify human perception of motion and generate ground-truth for motion analysis?
- Several issues need to addressed:
- Is human labeling reliable (compared to the veridical ground-truth) and consistent (across subjects)?
- How to efficiently label every pixel at every frame for hundreds of real-world videos?
Figure 1. The graphical user interface (GUI) of our system: (a) main window for labeling contours and feature points; (b) depth controller to change depth value; (c) magnifier; (d) optical flow viewer; (e) control panel.
- Our work
- We designed a human-in-loop system to annotate motion for real-world videos :
- Semiautomatic layer segmentation—The user labels contours using polygons, and the system automatically propagates the contours to other frames. The system can also propagate user’s correction across frames.
- Automatic layer-wise optical flow—The system automatically computes dense optical flow fields for every layer at every frame using user-specified parameters. For each layer, the user picks up the best flow that yields the correct matching and agrees with the smoothness and discontinuities of the image.
- Semiautomatic motion labeling—When the flow estimation fails, the user can label sparse correspondences between two frames, and the system automatically interpolates it to a dense flow field.
- Automatic full-frame motion composition.
- Our methodology is examined by comparing with veridical ground-truth data and user studies.
- We created a ground-truth motion database consisting of 10 real-world video sequences (still growing). This database can be used for evaluating motion analysis algorithms as well as other vision and graphics applications.
- We applied our system to annotating a veridical example from  (Figure 3). Our annotation is very close to theirs: 3.21° AAE, 0.104 AEP. The main difference is on the occluding boundary.
- We tested the consistency of human annotation (Figure 3). The mean error is 0.989° AAE, 0.112 AEP. The error magnitude correlates with the blurriness of the image.
- We created a ground-truth motion database containing 10 real-world videos with 341 frames (Figure 5, Table 1) for both indoor and outdoor scenes. The statistics of the ground-truth motion are plotted in Figure 4.
Figure 2. The consistency of nine subjects’ annotation. Clockwise from top left: the image frame, mean labeled motion, mean absolute error (red: higher error, white: lower error), and error histogram.
Figure 5. Some frames of the ground-truth motion database we created. We obtained ground-truth flow fields that are consistent with object boundaries, as shown in column (3) and (4). In comparison, the output of an optical flow algorithm  is shown in column (5). From Table 1, the performance of this algorithm on our database is worse than the performance on the Yosemite sequence (1.723° AAE, 0.071 AEP).
- System Features
- We used the-state-of-the art computer vision algorithms to design our system. Many of the objective functions in contour tracking, flow estimation and flow interpolation have L1 norms for robustness. Techniques such as iterative reweighted least square (IRLS), pyramid-based coarse-to-fine search and occlusion/outlier detection were intensively used for optimizing these nonlinear objective functions.
- The system was written in C++, and QtTM 4.3 was used for GUI design (Figure 1). Our system has all the components to make annotation simple and easy, and also gives the user full freedom to label motion manually.
Table 1. The performance of an optical flow algorithm  on our database
Figure 4. The marginal ((a)~(h)) and joint ((i)~(n)) statistics of the ground-truth motion from the database we created (log histogram). Symbol u and v denotes horizontal and vertical motion, respectively. From these statistics it is evident that horizontal motion dominates vertical; vertical motion is sparser than horizontal; flow fields are sparser than natural images; spatial derivatives are sparser than temporal derivatives.