Modeling Facial Shape and Appearance

Modeling Facial Shape and Appearance Shape and Changes in the Texture Parametric Face Modeling and Tracking Illumination Modeling

Outline • Modeling Shape and Changes in the Texture • Parametric Face Modeling and Tracking • Illumination Modeling

Modeling Facial Shape and Appearance To interpret images of faces, it is important to have a model of how the face can appear. Changes can be broken down into two parts: changes in shape and changes in texture (patterns of pixel values) across the face. The lecture describes a powerful method of generating compact models of shape and texture variation and describe how such models can be used to interpret images of faces.

Statistical Shape Analysis Statistical Shape Analysis • Statistical shape analysis is a geometrical analysis from a set of shapes in which statistics are measured to describe geometrical properties from similar shapes or different groups, for instance, the difference between face or hand shapes.

Example -Hands • Training set • By varying the first three parameters of the shape vector, one at a time, one can demonstrate some of the modes of variation allowed by the model (http://www.isbe.man.ac.uk/research/Flexible_Models/pdms.html) • Each row obtained by varying a parameter and fixing others at zero

PART I • Modeling Shape and Changes in the Texture • Statistical Models (Appearance, Shape) • Procrustes analysis for aligning set of shapes • Statistical Models of Variation and Texture • Fitting model to new points • Active Shape Models • Parametric Face Modeling and Tracking • Illumination Modeling

Statistical Models of Appearance To build models of facial appearance and its variation one can adopt a statistical approach, learning the ways in which the shape and texture of the face vary across a range of images. The method relies on obtaining a suitably large, representative training set of facial images, each of which is annotated with a set of feature points defining correspondences across the set. The positions of the feature points are used to define the shape of the face and are analyzed to learn the ways in which the shape can vary. The patterns of intensities are then analyzed to learn the ways in which the texture can vary.

Statistical Shape Models Building a statistical model requires a set of training images. The set should be chosen so it covers the types of variation one wish the model to represent. For instance, if we are interested only in faces with neutral expressions, we should include only neutral expressions in the model. If however, we wish to be able to synthesize and recognize a range of expressions, the training set should include images of people smiling, frowning, winking and so on.

Statistical Shape Models Another approach s that each face must be annotated with a set of points defining the key facial features. These points are used to define the correspondences across the training set and represent the shape of the face in the image. Thus the same number of points should be placed on each image and with the same set of labels. The number of such points can be varied from a few to a few thousands and they can be 2D or 3D points. Example of 68 points defining facial features.

Aligning Sets of Shapes There is considerable literature on methods of aligning shapes into a common coordinate frame, the most popular approach being Procrustes analysis. The transforms of each shape in a set, xi, so the sum of squared distances of the shape to the mean is minimized. It is poorly defined unless constraints are placed on the alignment of the mean (for instance, ensuring that it is centered on the origin and has unit scale and some fixed but arbitrary orientation).

Procrustes Analysis • Procrustes analysis is a form of statistical shape analysis used to analyse the distribution of a set of shapes.Procrustes refers to a character from Greek mythology who made his victims fit his bed either by stretching their limbs or cutting them off. • Here we consider objects made up from a finite number k of points in n dimensions.The shape of object can be considered as a member of an equivalence class formed by removing the translational, rotational and scaling components. • For example, translational components can be removed from an object by translating the object so that the mean of all the points lies at the origin. • Likewise the scale component can be removed by scaling the object so that the sum of the squared distances from the points to the origin is 1 (s-size). The process finds the size of the object • and dividing the points by the scale giving points

Procrustes Analysis • Removing the rotational component is more complex. Consider two objects with scale and translation removed. Fix one of these and rotate the other around the origin so that the sum of the squared distances between the points is minimised. A rotation by angle gives: • The Procrustes distance is • The distance can be minimised by using a least squares technique to find the angle θ which gives the minimum distance.

Iterative Aligning Sets of Shapes

Statistical Models of Variation Suppose we have s sets of n points xi in d dimensions (usually two or three) that are aligned into a common coordinate frame. These vectors form a distribution in nddimensional space. If we can model this distribution, we can generate new examples similar to those in the original training set, and we can examine new shapes to determine if they are plausible examples.

Statistical Models of Variation The approach is as follows: • Compute the mean of the data: • Compute the covariance of the data: • Compute the eigenvectors Φi and corresponding eigenvalues λi of S (sorted so λi ≥ λi +1). Efficient methods of computing the eigenvectors and values exist for the case in which there are fewer samples than dimensions in the vectors.

Face Shape Variation The figure shows the first two most significant modes of face shape variation of a model built from examples of a single individual with different viewpoints and expressions. The model has learned that the 2D shape change caused by 3D head rotation causes the largest shape change. Two modes of a face shape model (parameters varied by ±2σ from the mean).

Statistical Models of Texture To build a statistical model of the texture (intensity or color over an image patch) one can warp (modify) each example image so its feature points match a reference shape (typically the mean shape). The warping can be achieved by using any continuous deformation, such as piece-wise affine using a triangulation of the region or an interpolating spline. Warping to a reference shape removes spurious texture variation due to shape differences that would occur if we simply performed eigenvector decomposition on the un-normalized face patches (as in the eigenface approach). The intensity information is sampled from the shape-normalized image over the region covered by the mean shape to form a texture vector gim. Although he main shape changes due to smiling have been removed, there is considerable texture difference from a purely neutral face. By varying the elements of the texture parameter vector bg within limits learned from the training set, one can generate a variety of plausible shape-normalized face textures. Example of a labeled face image and the face patch warped into the mean shape.

Fitting the Model to New Points Goal: to find the best pose and shape parameters to match a model instance x to a new set of image points Y. Minimizing the sum of squared distances between corresponding model and image points is equivalent to minimizing the expression: More generally, one can allow different weights for different points, S- shape transformation, b is a shape, and “Phi” is a function on shape. If the allowed global transformation St(.) is more complex than a simple translation, this is a nonlinear equation with no analytic solution. However, a good approximation can be found rapidly using a two-stage iterative approach. • Solve for the pose parameters tassuming a fixed shape bs. • Solve for the shape parameters bs, assuming a fixed pose. • Repeat until convergence.

Active Shape Models (ASM) • We assume we have an initial estimate for the pose and shape parameters (eg the mean shape). This is iteratively updated as follows: • Look along normals through each model point to find the best local match for the model of the image appearance at that point (eg strongest nearby edge) • Update the pose and shape parameters to best fit the model instance to the found points • Repeat until convergence

Example of ASM failing The figure demonstrates the Active Shape Model (ASM) failing. The main facial features have been found, but the local models searching for the edges of the face have failed to locate their correct positions, perhaps because they are too far away. The ASM is a local method and prone to local minima. Example of ASM search failure. The search profiles are not long enough to locate the edges of the face.

Multiresolution Models The performance can be significantly improved using a multi-resolution implementation, in which we start searching on a coarse level of a Gaussian image pyramid, and progressively refine If a facial appearance model is trained on a sufficiently general set of data, it is able to synthesize faces similar to those in target images. If we can find the model parameters that generate a face similar to the target, those parameters imply the position of the facial features and can be used directly for face interpretation. Both models and update matrices can be estimated at a range of image resolutions (training on a Gaussian image pyramid). We can then use a Multiresolution search algorithm in which we start at a coarse resolution and iterate to convergence at each level before projecting the current Solution to the next level of the model. This is more efficient and can converge to the correct solution from further away than search at a single resolution.

Multiresolution Active Shape Models To improve the efficiency and robustness of the algorithm, it can be implemented in a multiresolution framework. This involves first searching for the object in a coarse image and then refining the location in a series of finer resolution images. This leads to a faster algorithm and one that is less likely to get stuck on the wrong image structure. Local models for each point are trained on each level of a Gaussian image pyramid. The Gaussian Pyramid is a hierarchy of low-pass filtered versions of the original image, such that successive levels correspond to lower frequencies. Search along sampled profile to find best fit of gray-level model.

Example –face modeling using acttive multi-resolution method Example of multi-resolution approach at highest resolution. Left to right: Initial, after 5 iterations, final model

http://www.cs.virginia.edu/~gfx/Courses/2003/Intro.fall.03/slides/morph_web/morph_images/pages/Slide46.htmlhttp://www.cs.virginia.edu/~gfx/Courses/2003/Intro.fall.03/slides/morph_web/morph_images/pages/Slide46.html

Discussion Open questions regarding the models include • How does one obtain accurate correspondences across the training set? • What is the optimal choice of model size and number of model modes? • What representation of image structure should be modeled? • What is the best method for matching the model to the image?

PART II • Modeling Shape and Changes in the Texture • Parametric Face Modeling and Tracking • Definitions and samples of modern work • Previous work on face tracking • Methods for parametric face modeling • Tracking Strategies • Illumination Modeling

Parametric Face Modeling and Tracking In the previous section, models for describing the (2D) appearance and geometry of faces were discussed. Let us now look at three-dimensional models and how they are used for face tracking. Whether we want to analyze a facial image (face detection, tracking, recognition) or synthesize one (computer graphics, face animation), we need a model for the appearance and/or structure of the human face. Depending on the application, the model can be simple (e.g. just an oval shape) or complex (e.g. thousands of polygons in layers simulating bone and layers of skin and muscles). We usually wish to control appearance, structure and motion of the model with a small number of parameters, chosen so as to best represent the variability likely to occur in the application.

Parametric Face Modeling and Tracking When analyzing a sequence of images (or frames), showing a moving face, the model might describe not only the static appearance of the face but also its dynamic behavior (i.e. the motion). To be able to execute any further analysis of a facial image (e.g. reconstruction), the position of the face in the image is helpful, as is the pose (i.e. the 3D position and orientation) of the face. The process of estimating position and pose parameters from each frame in a sequence is called tracking. In contrast to face detection, we can utilize the knowledge of position, pose and so on, of the face in the previous image in the sequence. This section explains the basics of parametric face models used for face tracking as well as fundamental strategies and methodologies for tracking.

Face tracking in digital cameras FotoNation Face Tracker http://www.fotonation.com/index.php?module=product&item=23

Stereo Face tracking Stereo tracking with two web cameras Images captured by two cameras are used in self calibration

Stereo Face tracking Affordable 3D Face Tracking Using Projective Vision D.O. Gorodnichy, S. Malik, G. Roth Computational Video Group, Ottawa The StereoTracker at work. The orientation and scale ofthe virtual man (at the bottom right) is controlled by the positionof the observed face.

Realistic Face Reconstruction and 3D Face Tracking INRIA MIRAGES Lab research (France) In the very beginning the user creates, for each image, a camera which is then manually positioned in front of the image plane so that the projection of the generic model matches approximately the person'sface on this image

Realistic Face Reconstruction and 3D Face Tracking INRIA MIRAGES Lab research (France) User manually positions key points on the image Model is adapted to changes

Realistic Face Reconstruction and 3D Face Tracking INRIA MIRAGES Lab research (France) Bezier curves (green) drawn by the user and computer generated model silhouettes (red) Reconstruction system interface (right)

Tracking through background Cha Zhang (Microsoft Research) uses background segmentation for face identification and tracking

Previous Work in Face Tracking A plethora of face trackers are available in the literature. They differ in how they model the face, how they track changes from one frame to the next, if and how changes in illumination and structure are handled, if they are susceptible to drift, and if real- time performance is possible. The presentation here is limited to monocular systems (in contrast to stereo-vision) and 3D tracking. Li et al. estimated face motion in a simple 3D model by a combination of prediction and a model based least-squares solution to the optical flow constraint equation. LaCascia et al. used a cylindrical face model with a parameterized texture being a linear combination of texture warping templates and orthogonal illumination templates. The 3D head pose was derived by registering the texture map captured from the new frame with the model texture. Stable tracking was achieved via regularized, weighted least-squares minimization of the registration error.

Previous Work in Face Tracking • Malciu et al. used an ellipsoidal textured wireframe model and minimized the registration error and/or used the optical flow to estimate the 3D pose. • DeCarlo et al. used a sophisticated face model parameterized in a set of deformations. Rigid and nonrigid motion was tracked by integrating optical flow constraints and edge-based forces, thereby preventing drift. • Wiles et al. tacked a set of hyperpatches (i.e. representations of surface patches invariant to motion and changing lighting). • Gokturk et al. developed a two-stage approach for 3D tracking of pose and deformations. The first stage learns the possible deformations of 3D faces by tracking stereo data. The second stage simultaneously tracks the pose and deformation of the face in the monocular image sequence using an optical flow formulation associated with the tracked features. A simple face model using 19 feature points was utilized. • Ahlberg et al. represented the face using a deformable wireframe model with a statistical texture. The active appearance models were used to minimize the registration error. Because the model allows deformation, rigid and nonrigid motions are tracked. • Dornaika et al. extend the tracker with a step based on random sampling and consensus to improve the rigid 3D pose estimate.

Parametric Face Modeling There are many ways to parameterize and model the appearance and behavior of the human face. The choice depends on, among other things, the application, the available resources, and the display device. The many kinds of variability being modeled/parameterized include the following: • Three-dimensional motion and pose – The dynamic, 3D position and rotation of the head. Tracking involves estimating these parameters for each frame in the video sequence. • Facial action – Facial feature motion such as lip and eyebrow motion. • Shape and feature configuration – The shape of the head, face and the facial features (e.g. mouth, eyes). This could be estimated or assumed to be known by the tracker. • Illumination – The variability in appearance due to different lighting conditions. • Texture and color – The image pattern describing the skin. • Expression – Muscular synthesis of emotions making the face look happy or sad, for example.

Parametric Face Modeling • Parametric Face Modeling and Tracking • Definitions and samples of current works • Previous work on face tracking • Methods for parametric face modeling • Eigenfaces • Facial Action Coding System • MPG-4 Facial Animation • Computer Graphics Models • Wireframe models • Projection models

PFM: Eigenfaces The space spanned by the eigenfaces is called the face space. Unfortunately, the manifold (distribution) of facial images has a highly nonlinear structure. For face tracking, it has been more popular to linearize the face manifold by warping the facial images to a standard pose and/or shape, thereby creating shape-free, geometrically normalized, or shape normalized images and eigenfaces (texture templates, texture modes) that can be warped to any face shape or texture-mapped onto a wireframe face model.

PFM: Facial Action Coding System During the 1960s and 1970s, a system for parameterizing minimal facial actions was developed by psychologists trying to analyze facial expressions. The system was called the Facial Action Coding System (FACS) and describes each facial expression as a combination of around 50 action units (AUs). Each AU represents the activation of one facial muscle. The FACS has been popular tool not only for psychology studies but also for computerized facial modeling. There are also other models available in the literature.

FACS Level of Description FACS itself is purely descriptive and includes no inferential labels. By converting FACS codes to EMFACS or similar systems, face images may be coded for emotion-specified expressions as well as for more molar categories of positive or negative emotion.

PFM: MPG-4 Facial Animation MPEG-4, since 1999 an international standard for coding and representation of audiovisual objects, contains definitions of face model parameters. There are two sets of parameters: facial definition parameters (FDPs), which describe the static appearance of the head, and facial animation parameters (FAPs), which describe the dynamics. The FAPs describe the motion of certain feature points, such as lip corners. Points on the face model not directly affected by the FAPs are then interpolated according to the face model’s own motion model, which is not defined by MPEG-4 (complete face models can also be specified and transmitted). Typically, the FAP coefficients are used as morph target weights, provided the face model has a morph target for each FAP. The FDPs describe the static shape of the face by the 3D coordinates of each feature point (MPEG-4 defines 84 feature points) and the texture as an image with the corresponding texture coordinates.

PFM: Computer Graphics Models When synthesizing faces using computer graphics, the most common model is a wireframe model or a polygonal mesh. The face is then described as a set of vertices connected with lines forming polygons (usually triangles). The polygons are shaded or texture-mapped, and illumination is added. The texture could be parameterized or fixed – in the latter case, facial appearance is changed by moving the vertices only. To achieve life-like animation of the face, a large number (thousands) of vertices and polygons are commonly used. Each vertex can move in three dimensions, so the model requires a large number of degrees of freedom. To reduce this number, some kind of parameterization is needed. A commonly adopted solution is to create a set of morph targets and blend between them. A morph target is a predefined set of vertex positions, where each morph target represents, for example, a facial expression or a viseme.

PFM: Wireframe Face Model Candide is a simple face model that has been a popular research tool for many years. It was originally created by Rydfalk and later extended by Welsh to cover the entire head (Candide-2) and by Ahlberg to correspond better to MPEG-4 facial animation (Candide- 3). The simplicity of the model makes it a good pedagogic example. Candide is a wireframe model with 113 vertices connected by lines forming 184 triangular surfaces. The geometry (shape, structure) is determined by the 3D coordinates of the vertices in a model- centered coordinate system (x, y, z). To modify the geometry, Candide-1 and Candide-2 implement a set of action units from FACS. Each action is implemented as a list of vertex displacements, an action unit vector, describing the change in face geometry when the action unit is fully activated.

PFM: Projection Models There are general projection models representing the camera. (Parameters may be known to calibrate camera) or unknown (uncalibrated). Skewness and rotation can sometime play role as well. Perspective projection and weak perspective projection (an approximation of perspective projection where depth variation is small) are used.

Example of CMU head tracking Example of the CMU S2 3D head tracking, including re-registration after losing the head.

Modeling Facial Shape and Appearance