slide1 n.
Skip this Video
Download Presentation
Improved Video Categorization from Text Metadata and User Comments

Loading in 2 Seconds...

play fullscreen
1 / 15

Improved Video Categorization from Text Metadata and User Comments - PowerPoint PPT Presentation

  • Uploaded on

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011: Research and development in Information Retrieval - Katja Filippova - Keith B. Hall . Presenter Viraja Sameera Bandhakavi. 1. Contributions.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Improved Video Categorization from Text Metadata and User Comments' - heidi

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Improved Video Categorization from Text Metadata and User Comments

ACM SIGIR 2011:Research and development in Information Retrieval

- KatjaFilippova

- Keith B. Hall

  • PresenterVirajaSameeraBandhakavi




  • Analyze sources of text information like title, description, comments, etc and show that they provide valuable indications to the topic
  • Show that a text based classifier trained on imperfect predictions of weakly supervised video content-based classifier is not redundant
  • Demonstrate that a simple model combining the predictions of the two classifiers outperforms each of them taken independently


research question not answered by related work
Research question not answered by related work
  • Can a classifier learn from imperfect predictions of a weakly supervised classifier?Is the accuracy comparable to the original one? Can a combination of two classifiers outperform either one?
  • Do the video and text based classifiers capture different semantics?
  • How useful is user provided text metadata? Which source is the most helpful?
  • Can reliable predictions be made from user comments? Can it improve the performance of the classifier?


  • Builds on top of the predictions of Video2Text
  • Uses Video2Text:
    • Requires no labeled data other than video metadata
    • Clusters similar videos and generates a text label for each cluster
    • The resulting label set is larger and better suited for categorization of video content on YouTube


  • Starts from a set of weak labels based on the video metadata
  • Creates a vocabulary of concepts (unigrams or bigrams from the video metadata)
  • Every concept is associated with a binary classifier trained from a large set of audio and video signals
  • Positive instances- videos that mention the concept in the metadata
  • Negative instances-videos which don’t mention the concept in the metadata


  • Binary classifier is trained for every concept in the vocabulary
    • Accuracy is assessed on a portion of a validation dataset
    • Each iteration uses a subset of unseen videos from the validation set
    • The classifier and concept are retained if precision and recall are above a threshold (0.7 in this paper)
  • The remaining classifiers are used to update the feature vectors of all videos
  • Repeated until the vocabulary size doesn’t change much or the maximum number of iterations is reached
  • Finer grained concepts are learned from concepts added in the previous iteration
  • Group together labels related to news, sports, film, etc resulting in the final set of 75 two level categories


categorization with video2text
Categorization with Video2Text
  • Use Video2Text to assign two-level categories to videos
  • Total number of binary classifiers (hence labels) limited to 75
  • Output of Video2Text represented as a list of strings: (vi , cj,sij, )


distributed maxent
Distributed MaxEnt
  • Approach automatically generates training examples for the category classifier
  • Uses conditional maximum entropy optimization criteria to train the classifiers
  • Results in a conditional probability model over the classes given the YouTube videos.


data and models
Data and Models
  • Text models differ regarding the text sources from which the features are extracted: title, description, comments, etc
  • Features used are all token based
  • Infrequent tokens are filtered out to reduce feature space
  • Token frequencies are calculated over 150K videos
  • Every unique token is counted onceper video
  • Threshold token frequency of 10 is used
  • Tokens are prefixed with the first letter of where it was found
  • eg: T:xbox, D:xbox, U:xbox, C:xbox, etc


combined classifier
Combined Classifier
  • Used to see if the combination of the two views – video and text based, is beneficial
  • A simple meta classifier is used, which ranks the video categories based on predictions of the two classifiers
  • Video based predictions are converted to a probability distribution
  • The distribution from the video based prediction and from MaxEnt(Maximum Entropy classifier) are multiplied
  • This approach proved to be effective
  • Idea: Each classifier has a veto power
  • The final prediction for each video is the one with the highest product score


experiments evaluation of text models
Experiments- Evaluation of Text Models
  • Training data set containing 100K videos which get high scoring prediction
  • Correct prediction – score of at least 0.85 from Video2Text
  • Text based prediction must be in the set of video-assigned categories
  • Evaluation was done on two sets of videos:
    • Videos with at least one comment
    • Videos with at least 10 comments


experiments evaluation of text models contd
Experiments- Evaluation of Text Models Contd…
  • The best model is TDU+YT+C for both sets
  • This model is used for comparison against Video2Text model with human raters
  • This model is also used in the Combination model


experiments with human raters
Experiments with Human Raters
  • Total of 750 videos are extracted equally from the 15 YouTube categories
  • Human rater rates (video, category) as -fully correct (3), partially correct(2), somewhat related(1) or off topic (0)
  • Every pair received from 3 human raters
  • The three ratings are summed and normalized (by dividing by 9) and rounded off to get the resultant score


experiments with human raters contd
Experiments with Human Raters Contd…
  • Score of at least 0.5 – correct category
  • Text based model performs significantly better than video model
  • Combination model improved accuracy
  • Accuracy of all models increases with number of comments


  • Text based approach for assigning categories to videos
  • Competitive classifier trained on high-scoring predictions made by a weakly supervised classifier (video features)
  • Text and video models provide complementary views on the data
  • Simple combination model outperforms each model on its own
  • Accurate predictions from user comments
  • Reasons for impact of comments:
    • Substitute for a proper title
    • Disambiguate the category
    • Help correct wrong predictions
  • Future work: Investigate usefulness of user comments for other tasks