Extracting features from spatio-temporal volumes (STVs) for activity recognition Dheeraj Singaraju Reading group: 06/29/06
Motivation for dealing with STVs • Optical flow based methods would be able to capture only first order motion. • Methods that use HMMs deal with single point trajectories that carry only motion information and no spatial information We aim at a direct scheme for event detection and classification that does not require feature tracking, segmentation or computation of optical flow We want to detect points in the space-time volume which have significant local variation in both space and time.
Approaches that we shall discuss • On Space-Time Interest Points; Ivan Laptev • Local image features provide compact and abstract representations of images, eg: corners • Extend the concept of a spatial corner detector to a spatio-temporal corner detector • Actions as Objects: A Novel Action Represenation; Alper Yilmaz and Mubarak Shah • Concepts of differential geometry: Extract features from the STV based on local variations in curvatures of points on the volume • The curvatures show invariance to rotation and translation
Detecting interest points in space • An image can be modeled by its linear scale representation as follows • To look for interest points one analyzes the matrix of 2nd moments : A more familiar form of the matrix
Detecting interest points in space (contd.) • We want to choose corners in the image since they have significant spatial variation. • We therefore detect positive maxima of the following function How do we detect interest points in space-time ?
Results of detecting interest points in space • Detecting interest points in space gives interest points in the stationary background also • We want to find interest points that have information in the space as well as the temporal domain.
Detecting interest points in space-time • A spatio-temporal image sequence can be modeled by its linear scale representation as follows • Note that there are different scales for the spatial and the temporal scale, i.e. and respectively
Detecting interest points in space-time (contd.) • To look for interest points one analyzes the matrix of 2nd moments : • We therefore look for the maxima of the following spatio-temporal corner function
Results of detecting interest points in the STV • Consider a synthetic sequence of a ball moving towards a wall and colliding with it • An interest point is detected at the collision point
Results of detecting interest points in the STV • Consider a synthetic sequence of 2 balls moving towards each other • Different interest points are calculated at different spatial and temporal scales coarser scale
Effects of scales on interest point detection Long temporal events are detected for large values of while short events are detected for small values of Long spatial events are detected for large values of while short events are detected for small values of
Scale selection in space-time • We consider a prototype event modeled by a spatio-temporal Gaussian blob • The scale space representation of f is hence given by
Scale selection in space-time (contd.) • We want to find a differential operator that assumes simultaneous extrema over spatial and temporal scales that are characteristic of this Gaussian prototype event • To recover the spatio-temporal extent of f, we consider second order derivatives of L normalized by the scales as: • By solving for the fact that the above normalized 2nd order derivatives assume maxima at scales and we get a =1, b= ¼,c= ½ and d= ¾.
Scale selection in space-time (contd.) • We therefore define a normalized spatio-temporal Laplace operator as follows: • The following plots show that the zero crossings correspond to the maxima that are detected at and
Scale adapted space time interest points • So far we have found events that are local extrema in the space time volume at a particular choice of space and time scales • We would like to detect interest points that are extrema over the space time volume as well as over the scale of the scale-normalized Laplace operator • The reason for doing so is that different events would in general have different spatial and temporal extents
Results on a previously used synthetic example Note that all the extrema are detected irrespective of their spatial and temporal extents DOUBT Why are these points not detected as interest points ?
Results of the algorithm on real seq. Note that events of all spatial and temporal extents are captured. The size of the circle shows the spatial extent of the event
Results of interest pt. detection Note that the regularity and extent of the spatio-temporal interest points is actually representative of the true events in time
Classification of events • Every interest point is described by its local spatio-temporal neighbor and we compare neighborhoods of events to classify events • The neighborhood of an interest point is defined by evaluating the following event descriptors This normalization guarantees the invariance of the derivative response to image scaling
Classification of events (contd.) • To compare two events, we compute the Mahalanobis distance between their descriptors as • To detect similar events in the given data, we apply k-means clustering to the event descriptors and thus detect groups of interest points with similar spatio-temporal neighbourhoods • Once the cluster centers are evaluated from the training data, given a new event, we evaluate its distance from the cluster centers. If the distance from all the centers is above a threshold we declare it as a background event.
Recognizing gaits • We extract the following features from the spatio-temporal volume • Positions of the interest points: • The corresponding scales: • The class of interest points: • We introduce a state for the model determined by the vector , where the variables are • Position of person in the image: • His/her size: • Frequency of the gait: • Phase of the gait at current moment: • Temporal variations of
Recognizing gaits (contd.) • We then have the following model for walking • Such a model helps handle translations as well as uniform rescaling in the image and the temporal domain
Recognizing gaits (contd.) • Given a model state X, a current time , a length of time window , and a set of data features detected from the recent time window , the match between the model and the data is defined by a weighted sum of distances h between the model features and the data features . • is a data feature minimizing the distance h for a given and is the variance for the exponential function.
Recognizing gaits (contd.) • To find the best match between the model and the data, we search for the model state that minimizes
Summary of the approach • An interest point detector is developed that finds local image features that show high variation of the image values in space and in time • The spatio-temporal extents of detected events can be estimated by using a normalized Laplacian operator • The neighborhoods of the events are described using scale invariant spatio-temporal descriptors • Different actions are then compared by checking for the matches between the event descriptors
Actions as objects: Action sketches • This methods analyzes the spatio-temporal volume by using the differential geometric surface properties such as peaks, pits, valleys and ridges • The authors claim that these are important action descriptors as they capture both spatial and temporal properties • These descriptors are related to the convex and concave parts of the object contours and/or to the maxima in the spatio-temporal curvature of a trajectory, and are hence view invariant.
STV: a collection of contours • In this approach the spatio-temporal volume is really a hollow solid object whose boundaries are defined by the contours of the boundaries of a person in every image frame. • It is assumed that the STV can be considered as a manifold, which helps us to consider small neighborhoods around a point to be nearly flat. • Since the STV is really the time evolution of a contour, we can define a 2D parametric representation by considering arc length s of the contour and time t.
STV: a collection of contours (contd.) t varying, s fixed s varying, t fixed The STV is a continuous representation in the normalized time scale and it does not require ay time warping for matching two sequences of different lengths.
Action descriptors • We want to compute action descriptors that correspond to changes in direction, speed and shape of parts of contour • Changes in these quantities are reflected on the surface of the STV and can be computed using differential geometry by identifying different landmarks. • These landmarks can be classified by basis of the local curvatures at points on the STV
Action descriptors (contd.) • Differential geometry gives us the concept of Gaussian Curvature K and Mean Curvature H that can be evaluated at points on the manifold of the STV. These curvatures exhibit invariance to algebraic transformations such as translation and rotation. • Local extrema of these curvatures can therefore be used to identify interest points for describing actions
Action descriptors (contd.) • The following table shows the different surface types and their associated curvatures
Analysis of action descriptors • We consider three types of contours: concave contours, convex contours and straight contours • The following contours generate typical landmarks in the spatial-temporal volume • Straight contour: ridge, valley or flat surface • Convex contour: peak, ridge or saddle ridge • Concave contour: pit, valley or saddle valley Shapes generated from straight contours
STVs corresponding to hand motion The STV generated by a hand staying stable. Such a motion (or lack of it) creates a ridge
STVs corresponding to hand motion The STV created by a hand that first moves downwards and then upwards. Note that a saddle ridge is created at the point of change of motion
Properties of the event descriptors • The landmarks discussed so far are essentially produced due to stable motion or change in stable motion. The stability of motion enforces that the STV is smooth enough so that one can consider valid local planar neighborhoods at points • Some of the landmarks are related to the curvature of the point trajectories and body contours as follows
View invariance of event descriptors • Since the landmarks are associated with extrema of local curvatures, even when the view changes the transformed landmarks are extrema in the new STV DOUBT: Not very confident about the derivation of the above • Due to this view invariance, comparing two STV volumes is equivalent to checking if there is a valid Fundamental Matrix relating the set of event descriptors in 2 given action volumes. Derived formula relating curvatures of corresponding points in 2 different views
Comparing two actions • We check if a linear system of the following kind is satisfied by the event descriptors in both the actions • This boils down to checking if the last singular value of A is 0. From a set of possible matches between the input action sketch and the known action sketches, we select the action with the minimum matching score
Summary of the approach • Using concepts of differential geometry, extract interest points; action sketches that have local spatiotemporal information by virtue of being local extrema of curvatures in space-time • These event descriptors are associated with uniform motion or stable changes in uniform motion • Since the action sketches are view invariant, comparing 2 actions is equivalent to checking if there is a valid Fundamental Matrix relating the positions of the action sketches for the individual actions.