Machine Learning for multimedia information retrieval

Machine Learning for multimedia information retrieval Zengchang Qin (zengchang.qin@gmail.com) Intelligent Computing and Machine Learning Beihang University IHSMC, Hangzhou, Aug. 2013

About Us ICMLL @ BUAA: icmll.buaa.edu.cn

Outline • Motivation & Introduction • Semantic Gaps • Text Modeling • Content-based Image Retrieval • Cross-modal Information Retrieval • Future Work

Motivation • Massive explosion of “content” on the web. • Content rich in multiple modalities — Text, Images, Videos, Music etc. • There is a need for retrieval systems that are robust to any modalities. • Cross modal text query, eg. retrieval of images from photoblogs using textual query. • Finding images to go along with a text article • Finding music to enhance videos. • Position an image in the text. • Etc.

Motivation • Current retrieval systems are predominantly uni-modal. • The query and retrieved results are from the same modality • Is Google Image search cross-modal retrieval? • No, text is matched to text metadata for the image • The operation would fail, in absence of text modality for the retrieval set. Text Text Images Images Music Music Videos Videos

BBC’s Problem

Related Works • Several multi-modalsystems have been proposed [TRECVID, ImageCLEF, Iria’09, Wang’09, Escalante’08, Pham’07, Snoek’05, Westerveld’02, etc.] • Given a query consisting of multiple modalities, retrieve examples containing the same multiple modalities. • Eg. Combining the modalities into a single modality, combining the outputs of multiple uni-modal systems. Text Text Images Images Music Music Videos Videos • Annotations systems [TRECVID, ImageCLEF, Carneiro’07, Feng’04, Lavrenko’03, Barnard’03, etc] • Given a query from a modality (say image), assign text labels. • Are true cross-modal systems. • However, text modality is constrained to a few keywords.

Cross-modality • Given query from modality A, retrieve results from modality B. • The query and retrieved items are not required to share a common modality. • In this work – I restrict to text and image modalities • Although similar ideas can be applied to other modalities. • Thus, • the retrieval of text in response to a query image. • And, the retrieval of images in response to a query text. Text Text . . . Images Images Music Music Videos Videos

Semantic Gap

An Example – Search Engine

Conceptual Idea

Multimedia Information • Multimedia information is universal. The basic idea is how to convey the semantic concept hidden in different modalities.

Semantic Gap • Population of the Capital of China. • Population, Capital, China

Text Modeling

Road Map • Rock – (geology, music)? • Context could help to solve the polysemy of words. • Similarity:

Latent Semantic Index (Analysis) • A is the original term-by-document matrix, S is a diagonal matrix, T and D are orthonormal matrices Ref: K. Gimpel (2006), Modeling topics, CMU Technical Report

LSI Results

Generative Models • Topics for generating observable words Ref: M. Steyvers, and T. Griffiths (2004), Finding scientific topics, PNAS, 101 (sup. 1), 5228-5235.

Simplex Representation • Topic models --hierarchical Bayesian models for language modeling and document analysis. • A document can be represented by a mixture of latent topics and each topic is a distribution over the vocabulary.

Topic Model Ref: D. Blei, A. Ng, and M. Jordan (2003), Latent Dirichlet allocation, Journal of Machine Learning Research, 3:993-1022 (cited over 3107, from Google scholar)

LDA- Latent Dirichlet Allocation • Latent Dirichlet Allocation (LDA) is one of the most popular and commonly used topic models Ref: D. Blei, A. Ng, and M. Jordan (2003), Latent Dirichlet allocation, Journal of Machine Learning Research, 3:993-1022 (cited over 3107, from Google scholar)

Semantic Distance KL-Distance Query can be extended using WordNet to Improve the quality of topic modeling. Ref: Z. Qin, M. Thint and Z. Huang (2009), IEA/AIE, LNCS 5579, pp. 103-112.

Chinese Language • Its basic structural unit is character instead of word. • Chinese words are written without spaces between them • “计算机会下象棋” • （Computer can play chess） • (Computing the chance of playing chess) • Character is expressive in meaning though word has more specified meaning. • “病” (illness) “人” (person) “病人”(patient) • If two words share the same character, they potentially have sort of semantic relations. • “比赛” (game) “锦标赛(tournament)” Ref: T. Xu (2001), Fundamental structural principles of Chinese semantic syntax in terms of Chinese characters, Applied Linguistics, 1, 3-13.

Test Data • NEWS1W: • This corpus is a news archive for classification provided by the Sogou laboratory. • We selected 10000 documents with 1000 documents per class from the corpus as our experiment data which is named as NEWS1W • BIL3200: • A Chinese-English bilingual corpus collected from the Yeeyan.com. We select 3200 documents for experiments and named the corpus as BIL3200

Chinese Characters and Words

Bilingual Experiments • Train LDA model based on Chinese character, Chinese word and English respectively. • Extracted topics on bilingual corpus BIL3200 Top 20 terms of 3 topics extracted independently from 3 models. These 3 topics are semantically close which are talking about internet service or technology

Word-Character Relations • Former comparison experiments showed topic models based on Chinese character can also capture the semantic meaning of text. • Chinese words “学习”(study) and “学生”(student) are literally related by sharing the same character “学”. • And these two words are semantically related and may have a high probability to occur in the same context. • New words are created constantly while the characters constitute the new word probably already appeared in the history docs • We proposed Chinese character-word topic model (CWTM) to consider the character-word relation by placing an asymmetric prior on the topic-word distribution of the standard LDA.

Word-Character Relation Model The document-topic (left) part of the model is exactly the same as LDA. Wd,n-- nth word in the dth document. -- kth topic over Chinese word -- kth topic over Chinese character R -- Character-Word relation Ref: Q. Zhao, Z. Qin and T. Wan (2011), ICONIP.

Discussions • Some of the words relevant to sports, even not appeared in the training data, were assigned relatively higher probabilities in topics about sports by CWTM. • For example, the word “主客场” (home and away), though never appeared in training documents, its component characters “主”, “ 客”, “场”are the components of those words appeared in training data. Therefore, it obtained a higher probability in CWTM. • On the other hand, not surprisingly, none of these words were ranked into top1000 words in the corresponding topic extracted by standard LDA model.

Topics

Classification Results

Topic-based Browser

Content-based image retrieval

Bag-of-Features Ref: X. Yuan, J. Yu, Z. Qin and T. Wan, A SIFT-LBP image retrieval model based on a bag of features, ICIP 2011

SIFT-LBP Feature Our Framework Retrieval Results Ref: Jing Yu, Zengchang Qin, Tao Wan and Xi Zhang, Feature Integration Analysis of Bag-of-Features Model for Image Retrieval, Neurocomputing., 2013.

What Color is an Object?

Single Label and Multi-labels

Cross-modal information retrieval

Basic Idea • No natural correspondence between representations of different modalities. • For example, we use Bag-of-Features and Topic Model to represent • images and texts, respectively. • Images: vectors over visual textures ( ) • Text: vectors of topics ( ) • How do we compute similarity? Image Space Text Space The population of Turkey stood at 71.5 million with a growth rate of 1.31% per annum, based on the 2008 Census. It has an average population density of 92 persons per km². The proportion of the population residing in urban areas is 70.5%. People within the 15–64 age group constitute 66.5% of the total population, the 0–14 age group corresponds 26.4% of th In 1920, at the age of 20, Coward starred in his own play, the light comedy ''I'll Leave It to You''. After a tryout in Manchester, it opened in London at the Like most of the UK, the Manchester area mobilised extensively during World War II. For example, casting and machining expertise at Beyer, Peacock and Company's locomotive works in Gorton was switched to bomb making; Dunlop's rubber works in Chorlton-on-Medlock made barrage balloons; ? ? SkyBomb Terrorist India Success Weather Prime President Navy Army Sun Books Music Food Poverty Iran America Martin Luther King's presence in Birmingham was not welcomed by all in the black community. A black attorney was quoted in ''Time'' magazine as saying, "The new administration should have been given a chance to confer with the various groups interested in change. … In 1920, at the age of 20, Coward starred in his own play, the light comedy ''I'll Leave It to You''. After a tryout in Manchester, it opened in London at the New Theatre (renamed the Noël Coward Theatre in 2006), his first full-length play in the West End.Thaxter, John. British Theatre Guide, 2009 Neville Cardus's praise in ''The Manchester Guardian''

Semantic Space • Learn mappings thatmaps different modalities into intermediate spaces . • Given an image query in the Topics of Features Space, the cross-modal retrieval system finds the nearest neighbor of text in the Semantic Space. • Similarly for image query. • The task now is to design these mappings.

System Overview Semantic Concept 1 Original Image Image Classifiers Image Space Bag-of-Features Semantic Concept V Topic Model Correlated Semantic Space Original Text Text Classifiers Semantic Concept 2 Text Space Retrieval Procedure

Sample Results

Data Sample A number of variants were built on the same chassis as the TAM tank. The original program called for the design of an infantry fighting vehicle, and in 1977 the program finished manufacturing the prototype of the ''Vehículo de Combate Transporte de Personal'' (Personnel Transport Combat Vehicle), or VCTP. The VCTP is able to transport a squad of 12 men, including the squad leader and nine riflemen. The squad leader is situated in the turret of the vehicle; one rifleman sits behind him and another six are seated in the chassis, the eighth manning the hull machine gun and the ninth situated in the turret with the gunner. All personnel can fire their weapons from inside the vehicle, and the VCTP's turret is armed with Rheinmetall's Rh-202 20 millimeter (.79 in) autocannon. The VCTP holds 880 rounds for the autocannon, including subcaliber armor-piercing DM63 rounds. It is also armed with a 7.62 millimeter FN MAG 60-20 mounted on the turret roof. Infantry can dismount through a door on the rear of the hull. (…) Warfare The population of Turkey stood at 71.5 million with a growth rate of 1.31% per annum, based on the 2008 Census. It has an average population density of 92 persons per km². The proportion of the population residing in urban areas is 70.5%. People within the 15–64 age group constitute 66.5% of the total population, the 0–14 age group corresponds 26.4% of the population, while 65 years and higher of age correspond to 7.1% of the total population. Life expectancy stands at 70.67 years for men and 75.73 years for women, with an overall average of 73.14 years for the populace as a whole. Education is compulsory and free from ages 6 to 15. The literacy rate is 95.3% for men and 79.6% for women, with an overall average of 87.4%. The low figures for women are mainly due to the traditional customs of the Arabs and Kurds who live in the southeastern provinces of the country. Article 66 of the Turkish Constitution defines a "Turk" as "anyone who is bound to the Turkish state through the bond of citizenship"; therefore, the legal use of the term "Turkish" as a citizen of Turkey is different from the ethnic definition. (…) Geography and Places Despite agreeing on most issues regarding the protection of national parks, friction between the NPA and NPS was seemingly unavoidable. Mather and Yard disagreed on many issues; whereas Mather was not interested in the protection of wildlife and accepted the Biological Survey's efforts to exterminate predators within parks, Yard vehemently criticized the program as early as 1924 (Fox, p. 204). Yard was also highly critical of Mather's administration of the parks. Mather advocated plush accommodations, city comforts and various entertainments to encourage park visitation. These plans clashed with Yard's ideals, and he considered such urbanization of the nation's parks misguided. While visiting Yosemite National Park in 1926, he stated that the valley was "lost" after finding crowds, automobiles, jazz music and even a bear show (Sutter, p. 126). In 1924, the United States Forest Service initiated a program to set aside "primitive areas" in the national forests that protected wilderness while opening it to use. (…) Culture and Society

Data for Experiments • Test the model on a dataset English Wikipedia built by UCSD SVCL using Wikipedia’s featured articles • 2700 articles, selected and reviewed by Wikipedia’s editors since 2009. • The articles are accompanied by one or more pictures from the Wikimedia Commons • Each article is split into sections that may or may not have an assigned image (sections without images were dropped) • Each article is categorized into one of 29 categories (only the 10 most populated categories were chosen) • Each ‘document’ in the proposed set is a ‘section of Wikipedia featured article’ and its ‘associated image’. • Test the model on a dataset TVGraz built by I. Khan et. al. using Google image search results. • Containing 2596 image-text pairs coming from 10 categories .

TVGraz • TVGraz , built by I. Khan et. al. using Google image search results, contains 2596 image-text pairs coming from 10 categories .

Experiments on the English Datasets Experimental Results Mean Average Precision (MAP) of the Retrieval Performance [1] [1] [1] N. Rasiwasia, J. Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimediaretrieval[A]. In ACM Multimedia[C], pp. 251-260,2010.

Experiments on the English Datasets Experimental Results Comparison of retrieval MAP on all the categories on theEn-Wikipedia dataset. (Left) MAP of the query images; (Middle) MAPof the query texts; (Right) Average performance for both the queryimages an the query texts.

Experiments on the English Datasets Experimental Results Comparison of retrieval MAP on all the categories on theTVGraz dataset. (Left) MAP of the query images; (Middle) MAPof the query texts; (Right) Average performance for both the queryimages an the query texts.

Machine Learning for multimedia information retrieval

Machine Learning for multimedia information retrieval

Presentation Transcript

Statistical Learning Methods for Information Retrieval

Design and Creation of Ontologies for Environmental (Multimedia) Information Retrieval *

Support Vector Machine Active Learning for Image Retrieval

Multimedia Information Retrieval

Learning Techniques for Information Retrieval

Multimedia Information Retrieval Systems

CSE 635 Multimedia Information Retrieval

Genetic Learning for Information Retrieval

A Machine Learning Approach for Improved BM25 Retrieval

Machine Learning and Information Retrieval

CSE 635 Multimedia Information Retrieval

Learning to Rank for Information Retrieval

Introduction to Machine Learning for Information Retrieval

Machine Learning for Information Extraction

Multimedia Information Retrieval

Multimedia Information Retrieval

Machine Learning for Personal Information Management