1 / 59

Machine Learning for multimedia information retrieval

Machine Learning for multimedia information retrieval. Zengchang Qin (zengchang.qin@gmail.com) Intelligent Computing and Machine Learning Beihang University IHSMC, Hangzhou, Aug. 2013. About Us. ICMLL @ BUAA: icmll.buaa.edu.cn. Outline. Motivation & Introduction Semantic Gaps

jerrod
Download Presentation

Machine Learning for multimedia information retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning for multimedia information retrieval Zengchang Qin (zengchang.qin@gmail.com) Intelligent Computing and Machine Learning Beihang University IHSMC, Hangzhou, Aug. 2013

  2. About Us ICMLL @ BUAA: icmll.buaa.edu.cn

  3. Outline • Motivation & Introduction • Semantic Gaps • Text Modeling • Content-based Image Retrieval • Cross-modal Information Retrieval • Future Work

  4. Motivation • Massive explosion of “content” on the web. • Content rich in multiple modalities — Text, Images, Videos, Music etc. • There is a need for retrieval systems that are robust to any modalities. • Cross modal text query, eg. retrieval of images from photoblogs using textual query. • Finding images to go along with a text article • Finding music to enhance videos. • Position an image in the text. • Etc.

  5. Motivation • Current retrieval systems are predominantly uni-modal. • The query and retrieved results are from the same modality • Is Google Image search cross-modal retrieval? • No, text is matched to text metadata for the image • The operation would fail, in absence of text modality for the retrieval set. Text Text Images Images Music Music Videos Videos

  6. BBC’s Problem

  7. Related Works • Several multi-modalsystems have been proposed [TRECVID, ImageCLEF, Iria’09, Wang’09, Escalante’08, Pham’07, Snoek’05, Westerveld’02, etc.] • Given a query consisting of multiple modalities, retrieve examples containing the same multiple modalities. • Eg. Combining the modalities into a single modality, combining the outputs of multiple uni-modal systems. Text Text Images Images Music Music Videos Videos • Annotations systems [TRECVID, ImageCLEF, Carneiro’07, Feng’04, Lavrenko’03, Barnard’03, etc] • Given a query from a modality (say image), assign text labels. • Are true cross-modal systems. • However, text modality is constrained to a few keywords.

  8. Cross-modality • Given query from modality A, retrieve results from modality B. • The query and retrieved items are not required to share a common modality. • In this work – I restrict to text and image modalities • Although similar ideas can be applied to other modalities. • Thus, • the retrieval of text in response to a query image. • And, the retrieval of images in response to a query text. Text Text . . . Images Images Music Music Videos Videos

  9. Semantic Gap

  10. An Example – Search Engine

  11. An Example – Search Engine

  12. An Example – Search Engine

  13. Conceptual Idea

  14. Multimedia Information • Multimedia information is universal. The basic idea is how to convey the semantic concept hidden in different modalities.

  15. Semantic Gap • Population of the Capital of China. • Population, Capital, China

  16. Text Modeling

  17. Road Map • Rock – (geology, music)? • Context could help to solve the polysemy of words. • Similarity:

  18. Latent Semantic Index (Analysis) • A is the original term-by-document matrix, S is a diagonal matrix, T and D are orthonormal matrices Ref: K. Gimpel (2006), Modeling topics, CMU Technical Report

  19. LSI Results

  20. Generative Models • Topics for generating observable words Ref: M. Steyvers, and T. Griffiths (2004), Finding scientific topics, PNAS, 101 (sup. 1), 5228-5235.

  21. Simplex Representation • Topic models --hierarchical Bayesian models for language modeling and document analysis. • A document can be represented by a mixture of latent topics and each topic is a distribution over the vocabulary.

  22. Topic Model Ref: D. Blei, A. Ng, and M. Jordan (2003), Latent Dirichlet allocation, Journal of Machine Learning Research, 3:993-1022 (cited over 3107, from Google scholar)

  23. LDA- Latent Dirichlet Allocation • Latent Dirichlet Allocation (LDA) is one of the most popular and commonly used topic models Ref: D. Blei, A. Ng, and M. Jordan (2003), Latent Dirichlet allocation, Journal of Machine Learning Research, 3:993-1022 (cited over 3107, from Google scholar)

  24. Semantic Distance KL-Distance Query can be extended using WordNet to Improve the quality of topic modeling. Ref: Z. Qin, M. Thint and Z. Huang (2009), IEA/AIE, LNCS 5579, pp. 103-112.

  25. Chinese Language • Its basic structural unit is character instead of word. • Chinese words are written without spaces between them • “计算机会下象棋” • (Computer can play chess) • (Computing the chance of playing chess) • Character is expressive in meaning though word has more specified meaning. • “病” (illness) “人” (person) “病人”(patient) • If two words share the same character, they potentially have sort of semantic relations. • “比赛” (game) “锦标赛(tournament)” Ref: T. Xu (2001), Fundamental structural principles of Chinese semantic syntax in terms of Chinese characters, Applied Linguistics, 1, 3-13.

  26. Test Data • NEWS1W: • This corpus is a news archive for classification provided by the Sogou laboratory. • We selected 10000 documents with 1000 documents per class from the corpus as our experiment data which is named as NEWS1W • BIL3200: • A Chinese-English bilingual corpus collected from the Yeeyan.com. We select 3200 documents for experiments and named the corpus as BIL3200

  27. Chinese Characters and Words

  28. Bilingual Experiments • Train LDA model based on Chinese character, Chinese word and English respectively. • Extracted topics on bilingual corpus BIL3200 Top 20 terms of 3 topics extracted independently from 3 models. These 3 topics are semantically close which are talking about internet service or technology

  29. Word-Character Relations • Former comparison experiments showed topic models based on Chinese character can also capture the semantic meaning of text. • Chinese words “学习”(study) and “学生”(student) are literally related by sharing the same character “学”. • And these two words are semantically related and may have a high probability to occur in the same context. • New words are created constantly while the characters constitute the new word probably already appeared in the history docs • We proposed Chinese character-word topic model (CWTM) to consider the character-word relation by placing an asymmetric prior on the topic-word distribution of the standard LDA.

  30. Word-Character Relation Model The document-topic (left) part of the model is exactly the same as LDA. Wd,n-- nth word in the dth document. -- kth topic over Chinese word -- kth topic over Chinese character R -- Character-Word relation Ref: Q. Zhao, Z. Qin and T. Wan (2011), ICONIP.

  31. Discussions • Some of the words relevant to sports, even not appeared in the training data, were assigned relatively higher probabilities in topics about sports by CWTM. • For example, the word “主客场” (home and away), though never appeared in training documents, its component characters “主”, “ 客”, “场”are the components of those words appeared in training data. Therefore, it obtained a higher probability in CWTM. • On the other hand, not surprisingly, none of these words were ranked into top1000 words in the corresponding topic extracted by standard LDA model.

  32. Topics

  33. Classification Results

  34. Topic-based Browser

  35. Content-based image retrieval

  36. Bag-of-Features Ref: X. Yuan, J. Yu, Z. Qin and T. Wan, A SIFT-LBP image retrieval model based on a bag of features, ICIP 2011

  37. SIFT-LBP Feature Our Framework Retrieval Results Ref: Jing Yu, Zengchang Qin, Tao Wan and Xi Zhang, Feature Integration Analysis of Bag-of-Features Model for Image Retrieval, Neurocomputing., 2013.

  38. What Color is an Object?

  39. Single Label and Multi-labels

  40. Cross-modal information retrieval

  41. Basic Idea • No natural correspondence between representations of different modalities. • For example, we use Bag-of-Features and Topic Model to represent • images and texts, respectively. • Images: vectors over visual textures ( ) • Text: vectors of topics ( ) • How do we compute similarity? Image Space Text Space The population of Turkey stood at 71.5 million with a growth rate of 1.31% per annum, based on the 2008 Census. It has an average population density of 92 persons per km². The proportion of the population residing in urban areas is 70.5%. People within the 15–64 age group constitute 66.5% of the total population, the 0–14 age group corresponds 26.4% of th In 1920, at the age of 20, Coward starred in his own play, the light comedy ''I'll Leave It to You''. After a tryout in Manchester, it opened in London at the Like most of the UK, the Manchester area mobilised extensively during World War II. For example, casting and machining expertise at Beyer, Peacock and Company's locomotive works in Gorton was switched to bomb making; Dunlop's rubber works in Chorlton-on-Medlock made barrage balloons; ? ? SkyBomb Terrorist India Success Weather Prime President Navy Army Sun Books Music Food Poverty Iran America Martin Luther King's presence in Birmingham was not welcomed by all in the black community. A black attorney was quoted in ''Time'' magazine as saying, "The new administration should have been given a chance to confer with the various groups interested in change. … In 1920, at the age of 20, Coward starred in his own play, the light comedy ''I'll Leave It to You''. After a tryout in Manchester, it opened in London at the New Theatre (renamed the Noël Coward Theatre in 2006), his first full-length play in the West End.Thaxter, John. British Theatre Guide, 2009 Neville Cardus's praise in ''The Manchester Guardian''

  42. Semantic Space • Learn mappings thatmaps different modalities into intermediate spaces . • Given an image query in the Topics of Features Space, the cross-modal retrieval system finds the nearest neighbor of text in the Semantic Space. • Similarly for image query. • The task now is to design these mappings.

  43. System Overview Semantic Concept 1 Original Image Image Classifiers Image Space Bag-of-Features Semantic Concept V Topic Model Correlated Semantic Space Original Text Text Classifiers Semantic Concept 2 Text Space Retrieval Procedure

  44. Sample Results

  45. Data Sample A number of variants were built on the same chassis as the TAM tank. The original program called for the design of an infantry fighting vehicle, and in 1977 the program finished manufacturing the prototype of the ''Vehículo de Combate Transporte de Personal'' (Personnel Transport Combat Vehicle), or VCTP. The VCTP is able to transport a squad of 12 men, including the squad leader and nine riflemen. The squad leader is situated in the turret of the vehicle; one rifleman sits behind him and another six are seated in the chassis, the eighth manning the hull machine gun and the ninth situated in the turret with the gunner. All personnel can fire their weapons from inside the vehicle, and the VCTP's turret is armed with Rheinmetall's Rh-202 20 millimeter (.79 in) autocannon. The VCTP holds 880 rounds for the autocannon, including subcaliber armor-piercing DM63 rounds. It is also armed with a 7.62 millimeter FN MAG 60-20 mounted on the turret roof. Infantry can dismount through a door on the rear of the hull. (…) Warfare The population of Turkey stood at 71.5 million with a growth rate of 1.31% per annum, based on the 2008 Census. It has an average population density of 92 persons per km². The proportion of the population residing in urban areas is 70.5%. People within the 15–64 age group constitute 66.5% of the total population, the 0–14 age group corresponds 26.4% of the population, while 65 years and higher of age correspond to 7.1% of the total population. Life expectancy stands at 70.67 years for men and 75.73 years for women, with an overall average of 73.14 years for the populace as a whole. Education is compulsory and free from ages 6 to 15. The literacy rate is 95.3% for men and 79.6% for women, with an overall average of 87.4%. The low figures for women are mainly due to the traditional customs of the Arabs and Kurds who live in the southeastern provinces of the country. Article 66 of the Turkish Constitution defines a "Turk" as "anyone who is bound to the Turkish state through the bond of citizenship"; therefore, the legal use of the term "Turkish" as a citizen of Turkey is different from the ethnic definition. (…) Geography and Places Despite agreeing on most issues regarding the protection of national parks, friction between the NPA and NPS was seemingly unavoidable. Mather and Yard disagreed on many issues; whereas Mather was not interested in the protection of wildlife and accepted the Biological Survey's efforts to exterminate predators within parks, Yard vehemently criticized the program as early as 1924 (Fox, p. 204). Yard was also highly critical of Mather's administration of the parks. Mather advocated plush accommodations, city comforts and various entertainments to encourage park visitation. These plans clashed with Yard's ideals, and he considered such urbanization of the nation's parks misguided. While visiting Yosemite National Park in 1926, he stated that the valley was "lost" after finding crowds, automobiles, jazz music and even a bear show (Sutter, p. 126). In 1924, the United States Forest Service initiated a program to set aside "primitive areas" in the national forests that protected wilderness while opening it to use. (…) Culture and Society

  46. Data for Experiments • Test the model on a dataset English Wikipedia built by UCSD SVCL using Wikipedia’s featured articles • 2700 articles, selected and reviewed by Wikipedia’s editors since 2009. • The articles are accompanied by one or more pictures from the Wikimedia Commons • Each article is split into sections that may or may not have an assigned image (sections without images were dropped) • Each article is categorized into one of 29 categories (only the 10 most populated categories were chosen) • Each ‘document’ in the proposed set is a ‘section of Wikipedia featured article’ and its ‘associated image’. • Test the model on a dataset TVGraz built by I. Khan et. al. using Google image search results. • Containing 2596 image-text pairs coming from 10 categories .

  47. TVGraz • TVGraz , built by I. Khan et. al. using Google image search results, contains 2596 image-text pairs coming from 10 categories .

  48. Experiments on the English Datasets Experimental Results Mean Average Precision (MAP) of the Retrieval Performance [1] [1] [1] N. Rasiwasia, J. Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimediaretrieval[A]. In ACM Multimedia[C], pp. 251-260,2010.

  49. Experiments on the English Datasets Experimental Results Comparison of retrieval MAP on all the categories on theEn-Wikipedia dataset. (Left) MAP of the query images; (Middle) MAPof the query texts; (Right) Average performance for both the queryimages an the query texts.

  50. Experiments on the English Datasets Experimental Results Comparison of retrieval MAP on all the categories on theTVGraz dataset. (Left) MAP of the query images; (Middle) MAPof the query texts; (Right) Average performance for both the queryimages an the query texts.

More Related