Skip this Video
Download Presentation
Improved Video Categorization from Text Metadata and User Comments

Loading in 2 Seconds...

play fullscreen
1 / 15

Improved Video Categorization from Text Metadata and User Comments - PowerPoint PPT Presentation

  • Uploaded on

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011: Research and development in Information Retrieval - Katja Filippova - Keith B. Hall . Presenter Viraja Sameera Bandhakavi. 1. Contributions.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Improved Video Categorization from Text Metadata and User Comments' - heidi

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Improved Video Categorization from Text Metadata and User Comments

ACM SIGIR 2011:Research and development in Information Retrieval

- KatjaFilippova

- Keith B. Hall

  • PresenterVirajaSameeraBandhakavi




  • Analyze sources of text information like title, description, comments, etc and show that they provide valuable indications to the topic
  • Show that a text based classifier trained on imperfect predictions of weakly supervised video content-based classifier is not redundant
  • Demonstrate that a simple model combining the predictions of the two classifiers outperforms each of them taken independently


research question not answered by related work
Research question not answered by related work
  • Can a classifier learn from imperfect predictions of a weakly supervised classifier?Is the accuracy comparable to the original one? Can a combination of two classifiers outperform either one?
  • Do the video and text based classifiers capture different semantics?
  • How useful is user provided text metadata? Which source is the most helpful?
  • Can reliable predictions be made from user comments? Can it improve the performance of the classifier?


  • Builds on top of the predictions of Video2Text
  • Uses Video2Text:
    • Requires no labeled data other than video metadata
    • Clusters similar videos and generates a text label for each cluster
    • The resulting label set is larger and better suited for categorization of video content on YouTube


  • Starts from a set of weak labels based on the video metadata
  • Creates a vocabulary of concepts (unigrams or bigrams from the video metadata)
  • Every concept is associated with a binary classifier trained from a large set of audio and video signals
  • Positive instances- videos that mention the concept in the metadata
  • Negative instances-videos which don’t mention the concept in the metadata


  • Binary classifier is trained for every concept in the vocabulary
    • Accuracy is assessed on a portion of a validation dataset
    • Each iteration uses a subset of unseen videos from the validation set
    • The classifier and concept are retained if precision and recall are above a threshold (0.7 in this paper)
  • The remaining classifiers are used to update the feature vectors of all videos
  • Repeated until the vocabulary size doesn’t change much or the maximum number of iterations is reached
  • Finer grained concepts are learned from concepts added in the previous iteration
  • Group together labels related to news, sports, film, etc resulting in the final set of 75 two level categories


categorization with video2text
Categorization with Video2Text
  • Use Video2Text to assign two-level categories to videos
  • Total number of binary classifiers (hence labels) limited to 75
  • Output of Video2Text represented as a list of strings: (vi , cj,sij, )


distributed maxent
Distributed MaxEnt
  • Approach automatically generates training examples for the category classifier
  • Uses conditional maximum entropy optimization criteria to train the classifiers
  • Results in a conditional probability model over the classes given the YouTube videos.


data and models
Data and Models
  • Text models differ regarding the text sources from which the features are extracted: title, description, comments, etc
  • Features used are all token based
  • Infrequent tokens are filtered out to reduce feature space
  • Token frequencies are calculated over 150K videos
  • Every unique token is counted onceper video
  • Threshold token frequency of 10 is used
  • Tokens are prefixed with the first letter of where it was found
  • eg: T:xbox, D:xbox, U:xbox, C:xbox, etc


combined classifier
Combined Classifier
  • Used to see if the combination of the two views – video and text based, is beneficial
  • A simple meta classifier is used, which ranks the video categories based on predictions of the two classifiers
  • Video based predictions are converted to a probability distribution
  • The distribution from the video based prediction and from MaxEnt(Maximum Entropy classifier) are multiplied
  • This approach proved to be effective
  • Idea: Each classifier has a veto power
  • The final prediction for each video is the one with the highest product score


experiments evaluation of text models
Experiments- Evaluation of Text Models
  • Training data set containing 100K videos which get high scoring prediction
  • Correct prediction – score of at least 0.85 from Video2Text
  • Text based prediction must be in the set of video-assigned categories
  • Evaluation was done on two sets of videos:
    • Videos with at least one comment
    • Videos with at least 10 comments


experiments evaluation of text models contd
Experiments- Evaluation of Text Models Contd…
  • The best model is TDU+YT+C for both sets
  • This model is used for comparison against Video2Text model with human raters
  • This model is also used in the Combination model


experiments with human raters
Experiments with Human Raters
  • Total of 750 videos are extracted equally from the 15 YouTube categories
  • Human rater rates (video, category) as -fully correct (3), partially correct(2), somewhat related(1) or off topic (0)
  • Every pair received from 3 human raters
  • The three ratings are summed and normalized (by dividing by 9) and rounded off to get the resultant score


experiments with human raters contd
Experiments with Human Raters Contd…
  • Score of at least 0.5 – correct category
  • Text based model performs significantly better than video model
  • Combination model improved accuracy
  • Accuracy of all models increases with number of comments


  • Text based approach for assigning categories to videos
  • Competitive classifier trained on high-scoring predictions made by a weakly supervised classifier (video features)
  • Text and video models provide complementary views on the data
  • Simple combination model outperforms each model on its own
  • Accurate predictions from user comments
  • Reasons for impact of comments:
    • Substitute for a proper title
    • Disambiguate the category
    • Help correct wrong predictions
  • Future work: Investigate usefulness of user comments for other tasks