1 / 68

Toward Multimedia: A String Pattern-based Passage Ranking Model for Video Question Answering

2. Introduction. With the rapid expansion of video data, there is an increasing demand for retrieving and browsing videosCurrent video retrieval techniques merely support for retrieving related

ivo
Download Presentation

Toward Multimedia: A String Pattern-based Passage Ranking Model for Video Question Answering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Toward Multimedia: A String Pattern-based Passage Ranking Model for Video Question Answering

    2. 2 Introduction With the rapid expansion of video data, there is an increasing demand for retrieving and browsing videos Current video retrieval techniques merely support for retrieving related “documents” To provide multimedia Q/A, it implies: Video content extraction Objects, sounds, speech, images, motions, etc Text-based Q/A Pinpoint exact answers rather than returning documents

    3. 3 Related Works (Video) Extracting video contents is a very difficult but important task Objects, sounds, speech, images, motions, etc Among them, text in videos, especially for the closed captions is the most powerful features Common OCR (optical character recognition) > SR (speech recognition) The well-known Informedia project (Wactlar, 2000) and TREC VID tracks (Over et al., 2005) But both of them serves for simple retrieval only, Find shots of [a ship or boat]

    4. 4 Related Works (TextQ/A) TREC-Q/A gave the pilot competition on extracting answers from huge document corpus Most top-performed Q/A systems required combining many domain and language dependent resources: Parsers (Charniak, 2002) Named Entity Taggers (Florian et al., 2003) Elaborate ontology (Yang et al., 2003) WordNet (almost every Q/A studies) They are far difficult to port to different languages or domains

    5. 5 Related Works (VideoQ/A) Lin et al. (2001) presented the earlier work Simple OCR techniques and combining with simple term weighting schemes Un-advanced work in OCR The thesaurus was hand-created Yang et al. (2003) proposed the earliest video Q/A system Made use many linguistic resources NER, Parser, WordNet, WWW, … Applying the news articles to correct speech errors Keyword frequency-based answer selection Cao et al. (2004) designed a domain-dependent Q/A system For online education Pattern-based (manually constructed) answer selection Wu et al. (2004) showed the first cross-language video Q/A system Applying density-based method for answer selection Convert each language into English (only support for English query) Zhang and Nunamaker (2004) developed a videoQ/A technique based on retrieving short clips The short clips were segmented manually Applied a simple TFIDF-like weighting

    6. 6 In this paper We propose a passage ranking algorithm for extending textQ/A to videoQ/A Users interact with our system through natural language questions Passages are able to answer the question Lin et al. (2003) showed that users prefer passages rather than short answers since it contains context Our method are: Multilingual portable Effective

    7. 7 Outline Introduction Related works Our videoQ/A Method Video Processing Passage Ranking Algorithm Experiments Settings Results Conclusion

    8. 8 System Architecture

    9. 9 Video Processing

    10. 10 Video Processing Text localization: Purpose: Localize the text areas in frames Related works: Top-down (Cai et al., 2002) Bottom-up (Fan et al., 2001)

    11. 11 Video Processing Extraction & Tracking: Purpose: Extract text color and multi-frame integration Related works: Text extraction (Ryu et al., 2005) Multi-frame integration

    12. 12 Video Processing OCR Purpose: Recognizing the characters in text components Related works: Simple OCR (Wu et al., 2004; Hong et al., 1995)

    13. 13 System Architecture

    14. 14 Chinese word segmentation There is no explicit boundary between words in most oriented languages (Chinese, Japanese, Korean, etc) We could adopt two approaches to extract words in those text A well-trained Chinese word segmentation (SIGHAN bake-off, see Levow, 2006) N-gram (widely used for NTCIR cross-language retrieval, see Kishida et al., 2007)

    15. 15 An example

    16. 16 System Architecture

    17. 17 What is a sentence?

    18. 18 Document Retrieval and Passage Segmentation Passage segmentation: Sliding window with size=3 and one previous sentence overlapping Initial retrieval model Okapi-BM25 (Robertson et al., 2000; Savoy, 2005) Top-1000 relevant passages for further re-ranking One can replace BM-25 with better retrieval models

    19. 19 System Architecture

    20. 20 Ranking Algorithm Related works Introduction Limitations The importance of N-gram and word density Our method Suffix Tree Algorithms for finding the best match sequence Preprocessing Re-tokenization and Weighting

    21. 21 Ranking Algorithm (Related works) The ranking model receives the segmented passages and ranks the top-N passages to response the question Tellex et al. (2003) compared seven portable passage retrieval algorithm Density-based is the best Cui et al. (2005) further improve the density-based method with 17% relatively MRR (main reciprocal ratio) score But it is necessary to prepare training data, WordNet, and parsers at first

    22. 22 Ranking Algorithm (Related works) Parsing is a very complex work, in particular to Chinese Word segmentation Part-of-Speech tagging Constituent parsing / Dependency parsing Full parsing is also very slow 1 sentence cost 0.8-1.3 second (Charniak’s parser) Besides, the development of labeled corpora is a laborious work Develop a trained-passage ranker to another language is also very expensive What about the OCR errors ?

    23. 23 Ranking Algorithm (N-gram) Traditional ranking algorithm biased to give more weight for high-frequent words rather than N-gram N-gram is useful but much less ambiguous than its individual unigrams For example, it is often the case that “OpticalnCharacternRecognition” = “Optical?Character?Recognition”

    24. 24 Ranking Algorithm (Density) Dense “distinct” word distribution is useful If the passage contains abundant “identical” question words, potential answer words might occur Basic assumption of the density-based algorithms We should state that “distinct” word distribution does apart from the classic word distribution Classical density-based method simply account the match word distributions The first term of the SiteQ’s method is “keyword frequency” In comparison, we focus on find the “only one” best fit match word of each question term

    25. 25 Ranking Algorithm (Frequency) Frequency is not always useful Usually a passage contain Chinese stopwords, and punctuations: ” ;!????, In our case, many unrecognizable or false-alarm words are also appear ? ? ? ? ? ? ? ?

    26. 26 Ranking Algorithm Our ranking algorithm both takes the two “views” into account In other words, find the best match sequence for the passage that results in the “long” N-gram matching and “dense” N-gram distribution In addition, each match word is restricted to appear at most once in the sentence

    27. 27 Ranking Algorithm Unfortunately finding the best match is an NP-Complete problem => O(2n) Match or Mismatch Thus we propose an algorithm to approximately find the best fit match sequence to be scored Induction: Probabilistic view to score the importance of a passage Propose an algorithm to find the match sequence An example to estimate the score Time complexity analysis Compared with “density”, “frequency” methods

    28. 28 System Architecture

    29. 29 Question Analysis At first, we remove all the Chinese stopwords from the given question using the maximum-N-gram matching algorithm Decreasingly check N-gram, N-1 gram, … 2-gram, 1-gram in the sentence The stoplist is selected via Estimate the N-gram (1,2,3) frequency Sorting Selected by a Chinese native expert 897 = 571 (English stopwords) +326 (semi-manually)

    30. 30 Question Suffix Tree

    31. 31 Passage Suffix Tree

    32. 32 String Matching Insert the question string into the Passage Suffix Tree, we can find the common subsequences for question string

    33. 33 String Matching Hence we observe the following common subsequences Similarly, we can insert the passage string into the Question Suffix Tree to get:

    34. 34 Scoring Function The passage score is ranked by ? is used to adjust the importance of the density score QW_Density(Q, P) estimates the Q word density in P QW_Weight(Q, P) measures the sum of weight of the matched question words in P

    35. 35 QW_Density Quantifies the weighted word density distribution Modify SiteQ’s second term As our hypothesis, we want the long string patterns

    36. 36 Discriminative Power Also the discriminative power should be taken into consideration

    37. 37 QW_Density By re-tokenizing and re-weighting, the QW_Density can be computed as follows

    38. 38 QW_Weight This term estimates How much content information the passage has given the question

    39. 39 Combining Density and Weight We further taking first two or last two sentences into account Answers might occur before/after the sentences that contain useful term match

    40. 40 Outline Introduction Related works Our videoQ/A method Experiments Settings Results Conclusion

    41. 41 Settings The testing question data set (about 250) is mainly collected by Web-logs We use the MRR, precision, pattern-recall score to estimate the proposed Q/A method Pattern-recall: number of answer patterns found in top-5 rank To compare with state-of-the-art, we adopted six effective but multilingual portable ranking algorithms TFIDF, BM-25, Language Model, INQUERY, Cosine, SiteQ

    42. 42 For askers

    43. 43 For askers

    44. 44 Statistics of the collected Discovery videos

    45. 45 Comparison

    46. 46 Results (character-level)

    47. 47 Results (word-level)

    48. 48 Large-scale experiments

    49. 49 Auto-Translate into English

    50. 50 Re-ranking the six retrieval models

    51. 51 Conclusion This paper propose a new passage ranking algorithm for Chinese video QA 250 collected questions are evaluated in the 75.6 Hrs videos Outperform the BM-25, Language Model, INQUERY, etc. Applying the word-segmentation for video QA is not a good idea: drop Avg. 10% for most retrieval models Can we parse the OCR transcripts as articles?? Word-segmentation (0.94), POS tagging (0.91-0.92), Parsing (0.846)

    52. 52 Future Directions Speech is another important clue Now we are investigating some well-known toolkits CMUs SphinX, Cambrige’s HTK Effective parse the transcript (especially to Asian-like languages) How to improve the error-recognized and false-alarm words Domain adaptation (from news articles to the video)

    53. 53 Thanks Prof. Yue-Shi Lee and Chia-Hui Chang gave great amount of comments and everything support Database Lab (National Central Univ.) and Data Mining Lab (Ming-Chuan Univ.) for usage testing and truly comments

    54. 54 References Yang, H., Chaison, L., Zhao, Y., Neo, S. Y., & Chua, T. S. (2003). VideoQA: question answering on news video. In Proceedings of the 11th ACM international conference on multimedia (ACMM) (pp. 632-641). Zhang, D., & Nunamaker, J. (2004). A natural language approach to content-based video indexing and retrieval for interactive E-learning. Journal of IEEE Transactions on Multimedia, 6, 450-458. Wu, Y. C., Lee, Y. S., & Chang, C. H. (2004). CLVQ: cross-language video question/answering system. In Proceedings of 6th IEEE international symposium on multimedia software engineering (MSE) (pp. 294-301). Lyu, M. R., Song, J., & Cai, M. (2005). A comprehensive method for multilingual video text detection, localization, and extraction. Journal of IEEE transactions on circuits and systems for video technology, 15, 243-255. Lienhart, R., & Wernicke, A. (2002). Localizing and segmenting text in images and videos. Journal of IEEE transactions on circuits and systems for video technology, 12, 256-268. Lin, C. J., Liu, C. C., & Chen, H. H. (2001). A simple method for Chinese videoOCR and its application to question answering. Computational linguistics and Chinese language processing, 6, 11-30. Cao, J., & Nunamaker, J. F. (2004). Question answering on lecture videos: a multifaceted approach. In Proceedings of the joint conference on digital libraries (JCDL) (pp. 214-215). Cao, J., Roussinov, D., Robles, J., & Nunamaker J. F. (2005). Automated question answering from videos: NLP vs. pattern matching. Hawaii international conference on system science (HICSS) (pp. 43(b)-43(b)). Kishida, K., Chen, K. H., Lee, S., Kuriyama, K., Kando, N., Chen, H. H., & Myaeng, S. H. (2007). Overview of CLIR task at the sixth NTCIR workshop. In Proceedings of the 6th NTCIR Workshop.

    55. 55 SPVQA system

    56. 56 SPVQA system

    57. 57 SPVQA system

    58. 58 SPVQA system

    59. 59 Online-Demonstration

    60. 60 Discussions Our method outperforms TF-based and density-based methods Our method is suitable for videoOCR transcripts Even OCR error appears within keywords Question: ??????????? Answer: ????????????????????

    61. 61

    62. 62 Error Analysis OCR error in key question words “Where is the headquarter of FBI?” But most “FBI” were incorrect identified as 98I or 28I Synonyms and anaphora Our method focus on the surface terms Failed to identify the “It”, “He”, “She”, … etc.

    63. 63 Error Analysis Lake of language-dependent analysis Chinese word tokenization Chinese stopword removal ??????????? ?????? obtains more weight In this case We should focus on “??(?)” and “???” Machine translation errors Out-of-vocabulary Hotshepsut => conspicuous Zhai (???)

    64. 64 In natural language, this is quite uncommon

    65. 65 Repeat pattern does not hurt Q/A Most repeat words are: Meaningless words: ?, ? Punctuations: , . ? OCR false-alarm words: ? ? Experiment also demonstrates that with employing simple Longest Common Subsequence is the same as using the proposed method (or enumerate all of the state sequences)

    66. 66 Video Processing (Experiments) We use a small subset of Discovery videos 30 short clips NTSC 352x240 MPEG-1 1684 frames (sample 2 frames per second) 2166 text areas

    67. 67 Experimental Result (Text detection)

    68. 68 Experimental Result (OCR)

    69. 69 VideoOCR Efficiency Analysis

More Related