Advanced multimedia
Download
1 / 83

Advanced Multimedia - PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on

Advanced Multimedia. Tamara Berg Video & Tracking. Video. A video is a sequence of frames captured over time Now our image data is a function of space (x, y) and time (t). Slide Credit: Svetlana Lazebnik. 5.1 Types of Video Signals. Component video

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Advanced Multimedia' - luce


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Advanced multimedia

Advanced Multimedia

Tamara Berg

Video & Tracking


Video
Video

  • A video is a sequence of frames captured over time

  • Now our image data is a function of space (x, y) and time (t)

Slide Credit: Svetlana Lazebnik


5 1 types of video signals
5.1 Types of Video Signals

  • Component video

  • • Component video: Higher-end video systems make use of three separate video signals for the red, green, and blue image planes. Each color channel is sent as a separate video signal.

    • (a) Most computer systems use Component Video, with separate signals for R, G, and B signals.

    • (b) For any color separation scheme, Component Video gives the best color reproduction since there is no “crosstalk” between the three channels.

    • (c) This is not the case for S-Video or Composite Video, discussed next. Component video, however, requires more bandwidth and good synchronization of the three components.

Li & Drew


Composite video 1 signal
Composite Video — 1 Signal

• Composite video: color (“chrominance”) and intensity (“luminance”) signals are mixed into a single carrier wave.

a) Chrominance is a composition of two color components (U and V).

b) In NTSC TV, e.g., U and V are combined into a chroma signal, and a color subcarrier is then employed to put the chroma signal at the high-frequency end of the signal shared with the luminance signal.

c) The chrominance and luminance components can be separated at the receiver end and then the two color components can be further recovered.

d) When connecting to TVs or VCRs, Composite Video uses only one wire and video color signals are mixed, not sent separately. The audio and sync signals are additions to this one signal.

• Since color and intensity are wrapped into the same signal, some interference between the luminance and chrominance signals is inevitable.

Li & Drew


Yuv color space
YUV color space

Example of U-V color plane,

Y' value = 0.5.


S video 2 signals
S-Video — 2 Signals

  • • S-Video: as a compromise, (separated video, or Super-video, e.g., in S-VHS) uses two wires, one for luminance and another for a composite chrominance signal.

  • • As a result, there is less crosstalk between the color information and the crucial gray-scale information.

  • • The reason for placing luminance into its own part of the signal is that black-and-white information is most crucial for visual perception.

    • – In fact, humans are able to differentiate spatial resolution in grayscale images with a much higher acuity than for the color part of color images.

    • – As a result, we can send less accurate color information than must be sent for intensity information — we can only see fairly large blobs of color, so it makes sense to send less color detail.

Li & Drew


5 2 analog video
5.2 Analog Video

  • • An analog signal f(t) samples a time-varying image. So-called “progressive” scanning traces through a complete picture (a frame) row-wise for each time interval.

  • • In TV, and in some monitors and multimedia standards as well, another system, called “interlaced” scanning is used:

    • a) The odd-numbered lines are traced first, and then the even-numbered

    • lines are traced. This results in “odd” and “even” fields — two fields

    • make up one frame.

Li & Drew


  • Fig. 5.1: Interlaced raster scan

  • b) Figure 5.1 shows the scheme used. First the solid (odd) lines are traced, P to Q, then R to S, etc., ending at T; then the even field starts at U and ends at V.

  • c) The jump from Q to R, etc. in Figure 5.1 is called the horizontal retrace, during which the electronic beam in the CRT is blanked. The jump from T to U or V to P is called the vertical retrace.

Li & Drew


  • • Because of interlacing, the odd and even lines are displaced in time from each other — generally not noticeable except when very fast action is taking place on screen, when blurring may occur.

  • • For example, in the video in Fig. 5.2, the moving helicopter is blurred more than is the still background.

Li & Drew


Fig. 5.2: Interlaced scan produces two fields for each frame. (a) The

video frame, (b) Field 1, (c) Field 2, (d) Difference of Fields

(a)

(b)

(c)

(d)

Li & Drew


  • • Since it is sometimes necessary to change the frame rate, resize, or even produce stills from an interlaced source video, various schemes are used to “de-interlace” it.

    • a) The simplest de-interlacing method consists of discarding one field and duplicating the scan lines of the other field. The information in one field is lost completely using this simple technique.

    • b) Other more complicated methods that retain information from both fields are also possible.

Li & Drew


Ntsc video
NTSC Video rate, resize, or even produce stills from an interlaced source video, various schemes are used to “de-interlace” it.

• NTSC (National Television System Committee) TV standard is mostly used in North America and Japan. It uses the familiar 4:3 aspect ratio (i.e., the ratio of picture width to its height) and uses 525 scan lines per frame at 30 frames per second (fps).

a) NTSC follows the interlaced scanning system, and each frame is divided into two fields, with 262.5 lines/field.

b) Thus the horizontal sweep frequency is 525×29.97 ≈ 15, 734 lines/sec, so that each line is swept out in 1/15.734 × 103 sec ≈ 63.6μsec.

c) Since the horizontal retrace takes 10.9 μsec, this leaves 52.7 μsec for the active line signal during which image data is displayed (see Fig.5.3).

Li & Drew


Fig. 5.3 Electronic signal for one NTSC scan line. rate, resize, or even produce stills from an interlaced source video, various schemes are used to “de-interlace” it.

Li & Drew


  • • NTSC video is an analog signal with no fixed horizontal resolution. Therefore one must decide how many times to sample the signal for display: each sample corresponds to one pixel output.

  • • A “pixel clock” is used to divide each horizontal line of video into samples. The higher the frequency of the pixel clock, the more samples per line there are.

  • • Different video formats provide different numbers of samples per line, as listed in Table 5.1.

  • Table 5.1: Samples per line for various video formats

Li & Drew


Pal video
PAL Video resolution. Therefore one must decide how many times to sample the signal for display: each sample corresponds to one pixel

• PAL (Phase Alternating Line) is a TV standard widely used in Western Europe, China, India, and many other parts of the world.

• PAL uses 625 scan lines per frame, at 25 frames/second, with a 4:3 aspect ratio and interlaced fields.

(a) PAL uses the YUV color model. It uses an 8 MHz channel and allocates a bandwidth of 5.5 MHz to Y, and 1.8 MHz each to U and V. The color subcarrier frequency is fsc≈ 4.43 MHz.

(b) In order to improve picture quality, chroma signals have alternate signs (e.g., +U and -U) in successive scan lines, hence the name “Phase Alternating Line”.

(c) This facilitates the use of a (line rate) comb filter at the receiver — the signals in consecutive lines are averaged so as to cancel the chroma signals (that always carry opposite signs) for separating Y and C and obtaining high quality Y signals.

Li & Drew


Secam video
SECAM Video resolution. Therefore one must decide how many times to sample the signal for display: each sample corresponds to one pixel

• SECAM stands for Système Electronique Couleur Avec Mémoire, the third major broadcast TV standard.

• SECAM also uses 625 scan lines per frame, at 25 frames per second, with a 4:3 aspect ratio and interlaced fields.

• SECAM and PAL are very similar. They differ slightly in their color coding scheme:

(a) In SECAM, U and V signals are modulated using separate color subcarriers at 4.25 MHz and 4.41 MHz respectively.

(b) They are sent in alternate lines, i.e., only one of the U or V signals will be sent on each scan line.

Li & Drew


• Table 5.2 gives a comparison of the three major analog resolution. Therefore one must decide how many times to sample the signal for display: each sample corresponds to one pixel broadcast TV systems.

Table 5.2: Comparison of Analog Broadcast TV Systems

Li & Drew


5 3 digital video
5.3 Digital Video resolution. Therefore one must decide how many times to sample the signal for display: each sample corresponds to one pixel

• The advantages of digital representation for video are many. For example:

(a) Video can be stored on digital devices or in memory, ready to be processed (noise removal, cut and paste, etc.), and integrated to various multimedia applications;

(b) Direct access is possible, which makes nonlinear video editing achievable as a simple, rather than a complex, task;

(c) Repeated recording does not degrade image quality;

(d) Ease of encryption and better tolerance to channel noise.

Li & Drew


Chroma subsampling
Chroma Subsampling resolution. Therefore one must decide how many times to sample the signal for display: each sample corresponds to one pixel

• Since humans see color with much less spatial resolution than they see black and white, it makes sense to “decimate” the chrominance signal.

• Interesting (but not necessarily informative!) names have arisen to label the different schemes used.

• To begin with, numbers are given stating how many pixel values, per four original pixels, are actually sent:

(a) The chroma subsampling scheme “4:4:4” indicates that no chroma subsampling is used: each pixel’s Y, Cb and Cr values are transmitted, 4 for each of Y, Cb, Cr.

Li & Drew


(b) The scheme “4:2:2” indicates horizontal subsampling of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averaging is used).

(c) The scheme “4:1:1” subsamples horizontally by a factor of 4.

(d) The scheme “4:2:0” subsamples in both the horizontal and vertical dimensions by a factor of 2. Theoretically, an average chroma pixel is positioned between the rows and columns as shown Fig.5.6.

• Scheme 4:2:0 along with other schemes is commonly used in JPEG and MPEG (see later chapters in Part 2).

Li & Drew


  • Fig. 5.6: Chroma subsampling of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

Li & Drew


Hdtv high definition tv
HDTV (High Definition TV) of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

• The main thrust of HDTV (High Definition TV) is not to increase the “definition” in each unit area, but rather to increase the visual field especially in its width.

(a) The first generation of HDTV was based on an analog technology developed by Sony and NHK in Japan in the late 1970s.

(b) MUSE (MUltiple sub-Nyquist Sampling Encoding) was an improved NHK HDTV with hybrid analog/digital technologies that was put in use in the 1990s. It has 1,125 scan lines, interlaced (60 fields per second), and 16:9 aspect ratio.

(c) Since uncompressed HDTV will easily demand more than 20 MHz bandwidth, which will not fit in the current 6 MHz or 8 MHz channels, various compression techniques are being investigated.

(d) It is also anticipated that high quality HDTV signals will be transmitted using more than one channel even after compression.

Li & Drew


• A brief history of HDTV evolution: of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

(a) In 1987, the FCC decided that HDTV standards must be compatible with the existing NTSC standard and be confined to the existing VHF (Very High Frequency) and UHF (Ultra High Frequency) bands.

(b) In 1990, the FCC announced a very different initiative, i.e., its preference for a full-resolution HDTV, and it was decided that HDTV would be simultaneously broadcast with the existing NTSC TV and eventually replace it.

(c) Witnessing a boom of proposals for digital HDTV, the FCC made a key decision to go all-digital in 1993. A “grand alliance” was formed that included four main proposals, by General Instruments, MIT, Zenith, and AT&T, and by Thomson, Philips, Sarnoff and others.

(d) This eventually led to the formation of the ATSC (Advanced Television Systems Committee) — responsible for the standard for TV broadcasting of HDTV.

(e) In 1995 the U.S. FCC Advisory Committee on Advanced Television Service recommended that the ATSC Digital Television Standard be adopted.

Li & Drew


• The salient difference between conventional TV and HDTV: of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

(a) HDTV has a much wider aspect ratio of 16:9 instead of 4:3.

(b) HDTV moves toward progressive (non-interlaced) scan. The rationale is that interlacing introduces serrated edges to moving objects and flickers along horizontal edges.

Li & Drew


Computer vision video
Computer Vision & Video of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

Many slides adapted from S. Seitz, R. Szeliski, M. Pollefeys, S. Lazebnik, K. Grauman


Video applications
Video Applications of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Background subtraction

    • A static camera is observing a scene

    • Goal: separate the static background from the moving foreground

How to come up with background frame estimate without access to “empty” scene?


Video applications1
Video Applications of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Shot boundary detection

    • Commercial video is usually composed of shots or sequences showing the same objects or scene

    • Goal: segment video into shots for summarization and browsing (each shot can be represented by a single keyframe in a user interface)

    • Difference from background subtraction: the camera is not necessarily stationary


Video applications2
Video Applications of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Shot boundary detection

    • For each frame

      • Compute the distance between the current frame and the previous one

        • Pixel-by-pixel differences

        • Differences of color histograms

        • Block comparison

      • If the distance is greater than some threshold, classify the frame as a shot boundary


Video applications3
Video Applications of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Background subtraction

  • Shot boundary detection

  • Motion segmentation

    • Segment the video into multiple coherently moving objects


Motion and perceptual organization
Motion and perceptual organization of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Sometimes, motion is the only cue


Motion and perceptual organization1
Motion and perceptual organization of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Sometimes, motion is foremost cue


Motion and perceptual organization2
Motion and perceptual organization of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Even “impoverished” motion data can evoke a strong percept

http://www.youtube.com/watch?v=FAAyB5jGx_4


Uses of motion
Uses of motion of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Estimating 3D structure

  • Segmenting objects based on motion cues

  • Learning dynamical models

  • Recognizing events and activities

  • Improving video quality (motion stabilization)


Motion field
Motion field of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • The motion field is the projection of the 3D scene motion into the image


Motion field camera motion
Motion field + camera motion of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

Length of flow vectors inversely proportional to depth Z of 3d point

points closer to the camera move more quickly across the image plane

Figure from Michael Black, Ph.D. Thesis


Motion field camera motion1
Motion field + camera motion of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

Zoom out

Zoom in

Pan right to left


Motion estimation techniques
Motion estimation techniques of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Feature-based methods

    • Extract visual features (corners, textured areas) and track them over multiple frames

    • Sparse motion fields, but more robust tracking

    • Suitable when image motion is large (10s of pixels)

  • Direct methods

    • Directly recover image motion at each pixel from spatio-temporal image brightness variations

    • Dense motion fields, but sensitive to appearance variations

    • Suitable for video and when image motion is small


Feature based matching for motion
Feature-based matching for motion of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

Best matching neighborhood

Interesting point

Time t

Time t+1


A camera mouse
A Camera Mouse of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Video interface: use feature tracking as mouse replacement

  • User clicks on the feature to be tracked

  • Take the 15x15 pixel square of the feature

  • In the next image do a search to find the 15x15 region with the highest correlation

  • Move the mouse pointer accordingly

  • Repeat in the background every 1/30th of a second

James Gips and Margrit Betke

http://www.bc.edu/schools/csom/eagleeyes/


A camera mouse1
A Camera Mouse of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Specialized software for communication, games, making music

James Gips and Margrit Betke

http://www.bc.edu/schools/csom/eagleeyes/meta-elements/wmv/EEMovie02.wmv


A camera mouse2
A Camera Mouse of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Specialized software for communication, games

James Gips and Margrit Betke

http://www.bc.edu/schools/csom/eagleeyes/


Eagle eyes
Eagle Eyes of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Specialized software for communication, games, making music

http://www.bc.edu/schools/csom/eagleeyes/meta-elements/wmv/EEMovie02.wmv


What are good features to track
What are good features to track? of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Corners (Harris corner detector)

  • Can measure quality of features from just a single image

  • Automatically select candidate “templates”


Reminder homework 3a
Reminder: Homework 3A of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

On course webpage. Due Tues, March 31. Start now and come ask me questions asap. My solution is < 20 lines of code, but hardest part will be conceptually figuring out how to do it.

Homework 3B will build on the implementation of 3A. So, I will release a solution to 3A on Thurs, April 2 and no late homeworks will be accepted after that.

Overview: Implement image warping and composition in matlab.


Homework 3a
Homework 3A of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

+

Target

Source


Homework 3a1
Homework 3A of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

+

Target

Source

User clicks on 4 points in target, 4 points in source.


Homework 3a2
Homework 3A of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

+

Target

Source

User clicks on 4 points in target, 4 points in source.

Find best affine transformation between source subimg and target subimg.


Homework 3a3
Homework 3A of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

+

Target

Source

User clicks on 4 points in target, 4 points in source.

Find best affine transformation between source subimg and target subimg.

Transform the source subimg accordingly.


Homework 3a4
Homework 3A of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

+

Target

Source

User clicks on 4 points in target, 4 points in source.

Find best affine transformation between source subimg and target subimg.

Transform the source subimg accordingly.

Insert the transformed source subimg into the target image.


Homework 3a5
Homework 3A of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

+

Target

Source

User clicks on 4 points in target, 4 points in source.

Find best affine transformation between source subimg and target subimg.

Transform the source subimg accordingly.

Insert the transformed source subimg into the target image.

Useful matlab functions: ginput, cp2transform, imtransform. Look these up!


Homework 3a6
Homework 3A of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

+

Target

Source

=

Result


Homework 3b out on tuesday
Homework 3B – out on Tuesday of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Extend Homework 3A to work on video.

  • Same insertion technique, but automated tracking.

  • In first frame, click on corners (colored guide dots will be provided).

  • Use correlation based tracking to track the dots over successive frames.

  • Hand correct if tracking starts to drift.

  • Insert source subimgs into each


Motion estimation techniques1
Motion estimation techniques of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Feature-based methods

    • Extract visual features (corners, textured areas) and track them over multiple frames

    • Sparse motion fields, but more robust tracking

    • Suitable when image motion is large (10s of pixels)

  • Direct methods

    • Directly recover image motion at each pixel from spatio-temporal image brightness variations

    • Dense motion fields, but sensitive to appearance variations

    • Suitable for video and when image motion is small


Optical flow
Optical flow of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Definition: optical flow is the apparent motion of brightness patterns in the image

  • Ideally, optical flow would be the same as the motion field

  • Have to be careful: apparent motion can be caused by lighting changes without any actual motion


Apparent motion motion field
Apparent motion ~= motion field of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

Figure from Horn book


Estimating optical flow
Estimating optical flow of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Given two subsequent frames, estimate the apparent motion field between them.

I(x,y,t–1)

I(x,y,t)

  • Key assumptions

    • Brightness constancy: projection of the same point looks the same in every frame

    • Small motion: points do not move very far

    • Spatial coherence: points move like their neighbors


The aperture problem
The aperture problem of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

Perceived motion


The aperture problem1
The aperture problem of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

Actual motion


The barber pole illusion
The barber pole illusion of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

http://en.wikipedia.org/wiki/Barberpole_illusion


Solving the aperture problem grayscale image
Solving the aperture problem (grayscale image) of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • How to get more equations for a pixel?

  • Spatial coherence constraint: pretend the pixel’s neighbors have the same (u,v)

    • If we use a 5x5 window, that gives us 25 equations per pixel


Low texture region
Low-texture region of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • gradients have small magnitude

  • small l1, small l2


High texture region
High-texture region of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • gradients are different, large magnitudes

  • large l1, large l2


Edge of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • gradients very large or very small

  • large l1, small l2


Recognizing action at a distance

Recognizing Action at a Distance of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

A.A. Efros, A.C. Berg, G. Mori, J. Malik

UC Berkeley


Looking at people

3-pixel man of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

Blob tracking

vast surveillance literature

300-pixel man

Limb tracking

e.g. Yacoob & Black, Rao & Shah, etc.

Looking at People

Near field

Far field


Medium field recognition

The 30-Pixel Man of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

Medium-field Recognition


Appearance vs motion

Jackson Pollock of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

Number 21 (detail)

Appearance vs. Motion


Goals
Goals of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Recognize human actions at a distance

    • Low resolution, noisy data

    • Moving camera, occlusions

    • Wide range of actions (including non-periodic)


Our approach
Our Approach of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Motion-based approach

    • Non-parametric; use large amount of data

    • Classify a novel motion by finding the most similar motion from the training set

  • Related Work

    • Periodicity analysis

      • Polana & Nelson; Seitz & Dyer; Bobick et al; Cutler & Davis; Collins et al.

    • Model-free

      • Temporal Templates [Bobick & Davis]

      • Orientation histograms [Freeman et al; Zelnik & Irani]

      • Using MoCap data [Zhao & Nevatia, Ramanan & Forsyth]


Gathering action data
Gathering action data of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Tracking

    • Simple correlation-based tracker

    • User-initialized


Figure centric representation
Figure-centric Representation of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Stabilized spatio-temporal volume

    • No translation information

    • All motion caused by person’s limbs

      • Good news: indifferent to camera motion

      • Bad news: hard!

  • Good test to see if actions, not just translation, are being captured


Remembrance of things past

run of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

jog

swing

walk right

walk left

motion analysis

database

Remembrance of Things Past

  • “Explain” novel motion sequence by matching to previously seen video clips

    • For each frame, match based on some temporal extent

input sequence

Challenge: how to compare motions?


How to describe motion
How to describe motion? of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • Appearance

    • Not preserved across different clothing

  • Gradients (spatial, temporal)

    • same (e.g. contrast reversal)

  • Edges/Silhouettes

    • Too unreliable

  • Optical flow

    • Explicitly encodes motion

    • Least affected by appearance

    • …but too noisy


Spatial motion descriptor

blurred of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

Spatial Motion Descriptor

Image frame

Optical flow


Spatio temporal motion descriptor

Temporal extent of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averagE

E

A

A

E

I matrix

E

B

B

E

frame-to-frame

similarity matrix

motion-to-motion

similarity matrix

blurry I

Spatio-temporal Motion Descriptor

Sequence A

S

Sequence B

t


Football actions matching
Football Actions: matching of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

Input

Sequence

Matched

Frames

input

matched


Football actions classification
Football Actions: classification of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

10 actions; 4500 total frames; 13-frame motion descriptor


Classifying ballet actions
Classifying Ballet Actions of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

16 Actions; 24800 total frames; 51-frame motion descriptor. Men used to classify women and vice versa.


Classifying tennis actions
Classifying Tennis Actions of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

6 actions; 4600 frames; 7-frame motion descriptor

Woman player used as training, man as testing.


Querying the database
Querying the Database of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

run

jog

swing

walk right

walk left

Action Recognition:

run

walk left

swing

walk right

jog

Joint Positions:

input sequence

database


2d skeleton transfer
2D Skeleton Transfer of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

We annotate database with 2D joint positions

After matching, transfer data to novel sequence

Ajust the match for best fit

Input sequence:

Transferred 2D skeletons:


Actor replacement
Actor Replacement of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag


Conclusions
Conclusions of the Cb, Cr signals by a factor of 2. That is, of four pixels horizontally labelled as 0 to 3, all four Ys are sent, and every two Cb’s and two Cr’s are sent, as (Cb0, Y0)(Cr0, Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averag

  • In medium field action is about motion

  • What we propose:

    • A way of matching motions at coarse scale

  • What we get out:

    • Action recognition

    • Skeleton transfer

    • Synthesis: “Do as I Do” & “Do as I say”

  • What we learned?

    • A lot to be said for the “little guy”!


ad