1 / 15

Who’s in the Picture

This research investigates the correspondence between faces and names in images, and its importance in organizing large image collections. The study explores a semi-supervised approach using EM (Expectation Maximization) to link faces and names, and proposes a generative model for name assignment based on appearance of faces and context within captions. The study demonstrates the effectiveness of incorporating a language model to improve clustering accuracy.

bdamon
Download Presentation

Who’s in the Picture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presented by Wanxue Dong Who’s in the Picture

  2. Research Question: • Who is depicted in the associated image? Find a correspondence between the faces and names. • Why this is important? It has been used to browse museum collections and organize large image collections To take the large collection of news images and captions as semi-supervised input and produce a fully supervised dataset of faces labeled with names

  3. Linking a face and language model with EM (Expectation Maximization) • Consider name assignment as a hidden variable problem where the hidden variables are the correct name-face correspondences for each picture. • EM (Expectation Maximization) procedure: iterates between computing the expected values of the set of face-name correspondences and updating the face clusters and language model given the correspondences.

  4. Generative Model

  5. Name Assignment The likelihood of picture xi, under assignment ai, of names to faces under our generative model is α indexes into the names that are pictured σ(α) indexes into the faces assigned to the pictured names β indexes into the names that are not pictured γ indexes into the faces without assigned names

  6. Name Assignment EM procedure: E – update the Pij according to the normalized probability of picture i with assignment j. M – Maximize the parameters P(face | name) and P(pictured | context)

  7. Modeling the Appearance of FacesP(face | name) • P(face | name) gaussians with fixed covariance • Need a representation for faces in a feature space • Rectify all faces to a canonical pose • Five support vector machines to train the feature detectors • Use kPCA to reduce the dimensionality of data, compute linear discriminants

  8. Language Model P(pictured | context) • Assigns a probability to each name based on its context within the caption. • The distributions learned using counts of how often each context appears describing an assigned name and unassigned name • One distribution for each context cue modeled independently

  9. Comparison of EM & MM procedure • The Maximal Assignment procedure: • M1 – set of maximal Pij to 1 and all others to 0 • M2 – maximize the parameters P(face | name) and P(pictured | context) For both methods, incorporating a language model improves their respective clustering greatly.

  10. DataSets • Consisting of approximately half a million news pictures and captions from Yahoo News over a period of roughly two years • Faces: 44,773 large well detected face images • Names: use an open source named entity recognizer to detect proper names • Scale: reject face images that cannot be rectified satisfactorily, leaving 34,623. Finally, concentrate on images within whose captions could be detected proper names, leaving 30,281.

  11. Experiments’ Results • Results of applying learned language model to a test set of 430 captions (text alone) • Test set: hand labeled each detected name with IN/OUT based on whether the referred name was pictured within the corresponding picture • Test how well the language model could predict those labels

  12. Conclusion • This study coupled language and images, using language to learn about images and images to learn about language • Analyzing language more carefully can produce a much better clustering • A natural language classifier that can be used to determine who is pictured from text alone

  13. Critiques and Future Work • Test set is limited as all hand labeled • There is no other model comparison • Next step: learn a language model for free text on a webpage to improve google image search results

  14. Reference • Tamara L. Berg, Alexander C. Berg, Jaety Edwards, David A. Forsyth, Who's in the Picture?, Neural Information Processing Systems (NIPS), 2004. Thank you!

More Related