Improved Video Categorization from Text Metadata and User
1 / 15

Improved Video Categorization from Text Metadata and User Comments - PowerPoint PPT Presentation

  • Uploaded on

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011: Research and development in Information Retrieval - Katja Filippova - Keith B. Hall . Presenter Viraja Sameera Bandhakavi. 1. Contributions.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Improved Video Categorization from Text Metadata and User Comments' - heidi

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Improved Video Categorization from Text Metadata and User Comments

ACM SIGIR 2011:Research and development in Information Retrieval

- KatjaFilippova

- Keith B. Hall

  • PresenterVirajaSameeraBandhakavi



  • Analyze sources of text information like title, description, comments, etc and show that they provide valuable indications to the topic

  • Show that a text based classifier trained on imperfect predictions of weakly supervised video content-based classifier is not redundant

  • Demonstrate that a simple model combining the predictions of the two classifiers outperforms each of them taken independently


Research question not answered by related work
Research question not answered by related work

  • Can a classifier learn from imperfect predictions of a weakly supervised classifier?Is the accuracy comparable to the original one? Can a combination of two classifiers outperform either one?

  • Do the video and text based classifiers capture different semantics?

  • How useful is user provided text metadata? Which source is the most helpful?

  • Can reliable predictions be made from user comments? Can it improve the performance of the classifier?



  • Builds on top of the predictions of Video2Text

  • Uses Video2Text:

    • Requires no labeled data other than video metadata

    • Clusters similar videos and generates a text label for each cluster

    • The resulting label set is larger and better suited for categorization of video content on YouTube



  • Starts from a set of weak labels based on the video metadata

  • Creates a vocabulary of concepts (unigrams or bigrams from the video metadata)

  • Every concept is associated with a binary classifier trained from a large set of audio and video signals

  • Positive instances- videos that mention the concept in the metadata

  • Negative instances-videos which don’t mention the concept in the metadata



  • Binary classifier is trained for every concept in the vocabulary

    • Accuracy is assessed on a portion of a validation dataset

    • Each iteration uses a subset of unseen videos from the validation set

    • The classifier and concept are retained if precision and recall are above a threshold (0.7 in this paper)

  • The remaining classifiers are used to update the feature vectors of all videos

  • Repeated until the vocabulary size doesn’t change much or the maximum number of iterations is reached

  • Finer grained concepts are learned from concepts added in the previous iteration

  • Group together labels related to news, sports, film, etc resulting in the final set of 75 two level categories


Categorization with video2text
Categorization with Video2Text

  • Use Video2Text to assign two-level categories to videos

  • Total number of binary classifiers (hence labels) limited to 75

  • Output of Video2Text represented as a list of strings: (vi , cj,sij, )


Distributed maxent
Distributed MaxEnt

  • Approach automatically generates training examples for the category classifier

  • Uses conditional maximum entropy optimization criteria to train the classifiers

  • Results in a conditional probability model over the classes given the YouTube videos.


Data and models
Data and Models

  • Text models differ regarding the text sources from which the features are extracted: title, description, comments, etc

  • Features used are all token based

  • Infrequent tokens are filtered out to reduce feature space

  • Token frequencies are calculated over 150K videos

  • Every unique token is counted onceper video

  • Threshold token frequency of 10 is used

  • Tokens are prefixed with the first letter of where it was found

  • eg: T:xbox, D:xbox, U:xbox, C:xbox, etc


Combined classifier
Combined Classifier

  • Used to see if the combination of the two views – video and text based, is beneficial

  • A simple meta classifier is used, which ranks the video categories based on predictions of the two classifiers

  • Video based predictions are converted to a probability distribution

  • The distribution from the video based prediction and from MaxEnt(Maximum Entropy classifier) are multiplied

  • This approach proved to be effective

  • Idea: Each classifier has a veto power

  • The final prediction for each video is the one with the highest product score


Experiments evaluation of text models
Experiments- Evaluation of Text Models

  • Training data set containing 100K videos which get high scoring prediction

  • Correct prediction – score of at least 0.85 from Video2Text

  • Text based prediction must be in the set of video-assigned categories

  • Evaluation was done on two sets of videos:

    • Videos with at least one comment

    • Videos with at least 10 comments


Experiments evaluation of text models contd
Experiments- Evaluation of Text Models Contd…

  • The best model is TDU+YT+C for both sets

  • This model is used for comparison against Video2Text model with human raters

  • This model is also used in the Combination model


Experiments with human raters
Experiments with Human Raters

  • Total of 750 videos are extracted equally from the 15 YouTube categories

  • Human rater rates (video, category) as -fully correct (3), partially correct(2), somewhat related(1) or off topic (0)

  • Every pair received from 3 human raters

  • The three ratings are summed and normalized (by dividing by 9) and rounded off to get the resultant score


Experiments with human raters contd
Experiments with Human Raters Contd…

  • Score of at least 0.5 – correct category

  • Text based model performs significantly better than video model

  • Combination model improved accuracy

  • Accuracy of all models increases with number of comments



  • Text based approach for assigning categories to videos

  • Competitive classifier trained on high-scoring predictions made by a weakly supervised classifier (video features)

  • Text and video models provide complementary views on the data

  • Simple combination model outperforms each model on its own

  • Accurate predictions from user comments

  • Reasons for impact of comments:

    • Substitute for a proper title

    • Disambiguate the category

    • Help correct wrong predictions

  • Future work: Investigate usefulness of user comments for other tasks