1 / 79

Words and Pictures

Words and Pictures. Rahul Raguram. Motivation. Huge datasets where text and images co-occur. ~ 3.6 billion photos. Motivation. Huge datasets where text and images co-occur. Motivation. Huge datasets where text and images co-occur. Photos in the news. Motivation.

Download Presentation

Words and Pictures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Words and Pictures Rahul Raguram

  2. Motivation • Huge datasets where text and images co-occur ~ 3.6 billion photos

  3. Motivation • Huge datasets where text and images co-occur

  4. Motivation • Huge datasets where text and images co-occur Photos in the news

  5. Motivation • Huge datasets where text and images co-occur Subtitles

  6. Motivation • Interacting with large image datasets • Image content ‘Blobworld’ [Carson et al., 99]

  7. Motivation • Interacting with large photo collections • Image content ‘Blobworld’ [Carson et al., 99]

  8. Motivation • Interacting with large photo collections • Image content ‘Blobworld’ [Carson et al., 99]

  9. Motivation • Interacting with large photo collections • Image content Query by sketch [Jacobs et al., 95]

  10. Motivation • Interacting with large photo collections • Image content Query by sketch [Jacobs et al., 95]

  11. Motivation • Interacting with large photo collections • Large disparity between user needs and what technology provides (Armitage and Enser 1997, Enser 1993, Enser 1995, Markulla and Sormunen 2000) • Queries based on image histograms, texture, overall appearance, etc. are vanishingly small

  12. Motivation • Interacting with large photo collections • Text queries

  13. Motivation • Text and images may be separately ambiguous; jointly they tend not to be • Image descriptions often leave out what is visually obvious (eg: the colour of a flower) • …but often include properties that are difficult to infer using vision (eg: the species of the flower)

  14. Linking words and pictures: Applications • Automated image annotation • Auto illustration • Browsing support tiger cat mouth teeth “statue of liberty”

  15. Learning the Semantics of Words and Pictures Barnard and Forsyth, ICCV 2001

  16. Key idea • Model the joint distribution of words and image features Random bits Impossible Joint probability model for text and image features Unlikely Keywords: apple tree Keywords: sky water sun Reasonable Slide credit: David Forsyth

  17. Input Representation • Extract keywords • Segment the image into a set of ‘blobs’

  18. EM revisited: Image segmentation Examples from: http://www.eecs.umich.edu/~silvio/teaching/

  19. EM revisited: Image segmentation Segment 1 Segment 2 . . . Segment k Image Generative model Problem: You don’t know the parameters, the mixing weights, or the segmentation

  20. EM revisited: Image segmentation • If you knew the segmentation, then you could find the parameters easily Compute maximum likelihood estimates for Fraction of the image in the segment gives the mixing weight Image

  21. EM revisited: Image segmentation • If you knew the segmentation, then you could find the parameters easily • If you knew the parameters, you could easily determine the segmentation • Solution: iterate Image Calculate the posteriors

  22. EM revisited: Image segmentation Image from: http://www.ics.uci.edu/~dramanan/teaching/

  23. Input Representation • Segment the image into a set of ‘blobs’ • Each region/blob represented by a vector of 40 features (size, position, colour, texture, shape)

  24. Modeling image dataset statistics • Generative, hierarchical model • Extension of Hofmann’s model for text (1998) Middle nodes emit moderately general words and blobs Higher nodes emit more general words and blobs Lower nodes emit more specific words and blobs sky Each node emits blobs and words sun waves

  25. Modeling image dataset statistics • Generative, hierarchical model • Extension of Hofmann’s model for text (1998) sky Following a path from root to leaf generates image and associated text sun waves sun sky waves

  26. Modeling image dataset statistics • Generative, hierarchical model • Extension of Hofmann’s model for text (1998) Each cluster is associated with a path from the root to a leaf Cluster of images

  27. Modeling image dataset statistics • Generative, hierarchical model • Extension of Hofmann’s model for text (1998) sky Each cluster is associated with a path from the root to a leaf sun, sea waves rocks sun sea sky waves sun sea sky rocks Adjacent clusters

  28. Modeling image dataset statistics Each cluster is associated with a path from a leaf to the root D = blobs words Conditional independence of the items Nodes along the path from leaf to root

  29. Modeling image dataset statistics • For blobs • For words • Tabulate word frequencies

  30. Modeling image dataset statistics • Model fitting: EM • Missing data is path, nodes that generated each data element • Two hidden variables: • If path, node were known for each data element, easy to get maximum likelihood estimate of parameters • Given parameter estimate, path, node easy to figure out document d is in cluster c item i of document d was generated at level l

  31. Results • Clustering • Does text+image clustering have an advantage? Only text

  32. Results • Clustering • Does text+image clustering have an advantage? Only blob features

  33. Results • Clustering • Does text+image clustering have an advantage? Both text and image segments

  34. Results • Clustering • Does text+image clustering have an advantage? • User study: • Generate 64 clusters for 3000 images • Generate 64 random clusters from the same images • Present random cluster to user, ask to rate coherence (yes/no) • 94% accuracy

  35. Results • Image search • Supply a combination of text + image features • Approach: compute for each candidate image, the probability of emitting the query items Q – set of query items d – candidate document

  36. Results • Image search Image credit: David Forsyth

  37. Results • Image search Image credit: David Forsyth

  38. Results • Image search Image credit: David Forsyth

  39. Results • Auto-annotation • Compute:

  40. Results • Auto-annotation • Quantitative performance: • Use 160 Corel CDs, each with 100 images (grouped by theme) • Select 80 of the CDs, split into training (75%) and test (25%). Remaining 80 CDs are a ‘harder’ test set • Model scoring: n – number of words for the image r – number of words predicted correctly w – number of words predicted incorrectly N – vocabulary size All words that exceed a threshold are predicted

  41. Results • Auto-annotation • Quantitative performance: • Use 160 Corel CDs, each with 100 images (grouped by theme) • Select 80 of the CDs, split into training (75%) and test (25%). Remaining 80 CDs are a ‘harder’ test set • Model scoring: n – number of words for the image r – number of words predicted correctly Model predicts n words Can do surprisingly well just by using the empirical word frequency!

  42. Results • Auto-annotation • Quantitative performance: Score of 0.1 indicates roughly 1 out of every 3 words is correctly predicted (vs. 1 out of 6 for the empirical model)

  43. Names and Faces in the News Berg et al., CVPR 2004

  44. Motivation President George W. Bush makes a statement in the Rose Garden while Secretary of DefenseDonald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

  45. Motivation President George W. Bush makes a statement in the Rose Garden while Secretary of DefenseDonald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

  46. Motivation • Organize news photographs for browsing and retrieval • Build a large ‘real-world’ face dataset • Datasets captured in lab conditions do not truly reflect the complexity of the problem

  47. Motivation • Organize news photographs for browsing and retrieval • Build a large ‘real-world’ face dataset • Datasets captured in lab conditions do not truly reflect the complexity of the problem • In many traditional face datasets, it’s possible to get excellent performance by using no facial features at all (Shamir, 2008)

  48. Motivation Top left 100×100 pixels of the first 10 individuals in the color FERET dataset. The IDs of the subjects are listed right to the images

  49. Dataset • Download news photos and captions • ~500,000 images from Yahoo News, over a period of two years • Run a face detector • 44,773 faces • Resized to 86x86 pixels • Extract names from the captions • Identify two or more capitalized words followed by a present tense verb • Associate every face in the image with every detected name • Goal is to label each face detector output with the correct name

  50. Dataset Properties • Diverse • Large variation in lighting and pose • Broad range of expressions

More Related