Loading in 2 Seconds...
Loading in 2 Seconds...
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Extraction of Text Objects in Video Documents: Recent Progress Jing Zhang and Rangachar Kasturi University of South Florida Department of Computer Science and Engineering
Acknowledgements The work presented here is that of numerous researchers from around the world. We thank them for their contributions towards the advances in video document processing. In particular we would like to thank the authors of papers whose work is cited in this presentation and in our paper.
Outline • Introduction • Recent Progress • Performance Evaluation • Discussion
Introduction • Since 1990s, with rapid growth of available multimedia documents and increasing demand for information indexing and retrieval, much effort has been done on text extraction in images and videos.
Introduction • Text Extraction in Video • Text consists of words that are well-defined models of concepts for humans communication. • Text objects embedded in video contain much semantic information related to the multimedia content. • Text extraction techniques play an important role in content-based multimedia information indexing and retrieval.
Introduction • Extracting text in video presents unique challenge over that in scanned documents:
Introduction • Caption Text which is artificially superimposed on the video at the time of editing. • Scene Text which naturally occurs in the field of view of the camera during video capture. • The extraction of scene text is a much tougher task due to varying lighting, complex movement and transformation. Scene Text Caption Text
Introduction • Five stages of text extraction in video: 1)Text Detection: finding regions in a video frame that contain text; 2)Text Localization: grouping text regions into text instances and generating a set of tight bounding boxes around all text instances; 3)Text Tracking: following a text event as it moves or changes over time and determining the temporal and spatial locations and extents of text events; 4)Text Binarization: binarizing the text bounded by text regions and marking text as one binary level and background as the other; 5)Text Recognition: performing OCR on the binarized text image.
Introduction Video Clips Text Detection Text Localization Text Tracking Text Binarization Text Recognition Text Objects
Introduction • The goal of Text detection, text localizationandtext tracking is to generate accurate bounding boxes of all text objects in video frames and provide a unique identity to each text event which is composed of the same text object appearing in a sequence of consecutive frames.
Introduction • This presentation mainly concentrates on the approaches proposed for text extraction in videos in the most recent five years, to summarize and discuss the recent progress in this research area.
Introduction • Region Based Approach utilizes the different region properties between text and background to extract text objects. • Bottom-up: separating the image into small regions and then grouping character regions into text regions. • Color features, edge features, and connected component methods • Texture Based Approach uses distinct texture properties of text to extract text objects from background. • Top-down: extracting texture features of the image and then locating text regions. • Spatial variance, Fourier transform, Wavelet transform, and machine learning methods.
Outline • Introduction • Recent Progress • Performance Evaluation • Discussion
Recent Progress • Text extraction in video documents, as an important research branch of content-based information retrieval and indexing, continues to be a topic of much interest to researchers. • A large number of newly proposed approaches in the literature have contributed to an impressive progress of text extraction techniques.
Recent Progress • Now • Temporal redundancy of video is utilized by almost all recent text extraction approaches. • Scene text extraction is being extensively studied. • A comprehensive performance evaluation framework has been developed. • Prior to 2003 • Only a few text extraction approaches considered the temporal nature of video. • Very little work was done on scene text. • Objective performance evaluation metrics were scarce.
Recent Progress • The progress of text extraction in videoscan be categorized into three types: • New and improved text extraction approaches • Text extraction techniques adopted from other research fields • Text extraction approaches proposed for specific text types and specific genre of video documents
Recent Progress • New and improved text extraction approaches: The new and improved approaches play an important role in the recent progress of text extraction technique for videos. These new approaches introduce not only new algorithms but also new understanding of the problem.
Recent Progress-New and improved text extraction approaches H. Tran, A lux, H.L. Nguyen T. and A. Boucher, A novel approach for text detection in imagesusing structural features, The 3rd International Conference on Advances in Pattern Recognition, LNCS Vol. 3686, pp. 627-635, 2005 A text string is modeled as its center line and the skeletons of characters by ridges at different hierarchical scales. First line: Images with rectangle showing the text region. Second line: Zoom on text regions. Third line: ridges detected at two scales (red in high level, blue in small level) in the text region that represent local structures of text lines whatever the type of text.
H. Tran, A lux, H.L. Nguyen T. and A. Boucher, A novel approach for text detection in images using structural features, The 3rd International Conference on Advances in Pattern Recognition, LNCS Vol. 3686, pp. 627-635, 2005 • Abstract. We propose a novel approach for finding text in images by using ridges at several scales. A text string is modelled by a ridge at a coarse scale representing its center line and numerous short ridges at a smaller scale representing the skeletons of characters. Skeleton ridges have to satisfy geometrical and spatial constraints such as the perpendicularity or non-parallelism to the central ridge. In this way, we obtain a hierarchical description of text strings, which can provide direct input to an OCR or a text analysis system. The proposed method does not depend on a particular alphabet, it works with a wide variety in size of characters and does not depend on orientation of text string. The experimental results show a good detection. • X. Liu, H. Fu and Y. Jia.: Gaussian Mixture Modeling and learning of Neighbor Characters for Multilingual Text Extraction in Images, Pattern Recognition, Vol. 41, pp. 484-493, 2008. Abstract: This paper proposes an approach based on the statistical modeling and learning of neighboring characters to extract multilingual texts in images. The case of three neighboring characters is represented as the Gaussian mixture model and discriminated from other cases by the corresponding ‘pseudo-probability’ defined under Bayes framework. Based on this modeling, text extraction is completed through labeling each connected component in the binary image as character or non-character according to its neighbors, where a mathematical morphology based method is introduced to detect and connect the separated parts of each character, and a Voronoi partition based method is advised to establish the neighborhoods of connected components. We further present a discriminative training algorithm based on the maximum–minimum similarity (MMS) criterion to estimate the parameters in the proposed text extraction approach. Experimental results in Chinese and English text extraction demonstrate the effectiveness of our approach trained with the MMS algorithm, which achieved the precision rate of 93.56% and the recall rate of 98.55% for the test data set. In the experiments, we also show that the MMS provides significant improvement of overall performance, compared with influential training criterions of the maximum likelihood (ML) and the maximum classification error (MCE).
Recent Progress-New and improved text extraction approaches • X. Liu, H. Fu and Y. Jia, Gaussian Mixture Modeling and learning of Neighbor Characters for Multilingual Text Extraction in Images, Pattern Recognition, Vol. 41, pp. 484-493, 2008. The GMM based algorithm treats the text features of three neighboring characters as three mixed Gaussian models to extract text objects. (a) (b) (c) An example of neighborhood computation.In each figure,the image (a) shows a binary image, where black dots denotecentroids of CCs; the image (b) shows the Delaunay triangulationof centroids, where each triangle is correspondingwith a neighbor set. However, the neighborhoods of characters cannot be completely reflected in the Delaunay triangulation. (c) The solution by taking all three nodes which are joined one by one in the convex hull of the centroid set as neighbor sets.
Recent Progress-New and improved text extraction approaches P. Dubey, Edge Based Text Detection for Multi-purpose Application, Proceedings of International Conference Signal Processing, IEEE, Vol. 4, 2006 Only the vertical edge features are utilized to find text regions based on the observation that vertical edges can enhance the characteristic of text and eliminate most irrelevant information. (a) (b) (c) (d) (a) Original image, (b) detected group of vertical lines, (c) extracted text region, (d) result
Recent Progress-New and improved text extraction approaches • K. Subramanian, P. Natajajan, M. Decerbo, and D. Castanon, Character-Stroke Detection for Text-Localization and Extraction, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE, pp. 33-37, 2007 Character-stroke is used to extract text objects by utilizing three line scans (a set of pixels along the horizontal line of an intensity image) to detect image intensity changes. (a) Original image, (b) Intensity plots along the blue line l, l-2, and l+2, is the stroke width, (c) threshold Ig 0.35, (d) The thresholded image after morphological operations and connected component analysis.
Recent Progress-New and improved text extraction approaches D. Crandall, S. Antani, R. Kasturi, Extraction of special effects caption text events from digital video, International Journal on Document Analysis and Recognition, Vol. 5, pp. 138-157, 2003 8×8 block-wise DCT is applied on each video frame. For each block, 19 optimal coefficients that best correspond to the properties of text are determined empirically. The sum of the absolute values of these coefficients is computed and regarded as a measure of the “text energy” of that block.The motion vectors of MPEG-compressed videos are used for text objects tracking. (a) Original image (c) Tracking result (b) Text energy
Recent Progress-New and improved text extraction approaches In addition, many former text extraction approaches have been enhanced and extended recently. By extracting and integrating more comprehensive characteristics of text objects, these new approaches can provide more robust performance than previous approaches. Besides new approaches, many improved approaches are presented to overcome the limitations of former approaches.
Recent Progress-New and improved text extraction approaches S Lefevre, N Vincent, Caption localization in video sequences by fusion of multiple detectors, Proceedings of Eighth International Conference on Document Analysis and Recognition, IEEE, pp. 106-110, 2005 Color-related detector, wavelet-based texture detector, edge-based contour detector and temporal invariance principle are adopted to detect candidate caption regions. Then a parallel fusion strategy C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text Extraction, Proceedings of IEEE International Conference on Iimage Processing, pp. 985-988, 2006. Euclidean distance based and Cosine similarity based clustering methods are applied on GRB color space complementarily to partition the original image into three clusters: textual foreground, textual background, and noise. Overview of the proposed algorithm combining colorand spatial information.
Recent Progress-New and improved text extraction approaches M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection, localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, pp. 243-255, 2005. The sequential multi-resolution paradigm can remove the redundancy of parallel multi-resolution paradigm. No text edges can appear several times at different resolution levels. Sequential multiresolution paradigm
Recent Progress-New and improved text extraction approaches J. Gllavata, E. Qeli and B. Freisleben, Detecting Text in Videos Using Fuzzy Clustering Ensembles, Proceedings of the Eighth IEEE International Symposium on Multimedia, pp. 283-290, 2006. Fuzzy C-means based individual frame clustering is replaced by the fuzzy clustering ensemble (FCE) based multi-frame clustering to utilize temporal redundancy. Fuzzy cluster ensemble for text detection in videos
Recent Progress 2. Text extraction techniques adopted from other research fields: Another encouraging progress is that more and more techniques that have been successfully applied in other research fields have been adapted for text extraction. Because these approaches were not initially designed for the text extraction task, many unique characteristics of their original research fields are embedded in them intrinsically. Therefore, by using these approaches from other fields, we can view the text extraction problem from the viewpoints of other related research fields and benefit from them. It is a promising way to find good solutions for text extraction task.
Recent Progress-Text extraction techniques adopted from other research fields K.I. Kim, K. Jung and J.H. Kim, Texture-based approach for text detection in image using support vector machine and continuously adaptive mean shift algorithm, IEEE Transcation Pattern Analysis and Machine Intelligence, Vol. 25, No. 12, pp. 1631-1638, 2003. The continuously adaptive mean shift algorithm (CAMSHIFT) was initially used to detect and track faces in a video stream. Example of text detection using CAMSHIFT. (a) input image (540×400), (b) initial window configuration for CAMSHIFT iteration (5×5-sized windows located at regular intervals of (25, 25)), (c) texture classified region marked as white and gray level (white: text region, gray: non-text region), and (d) final detection result
Recent Progress-Text extraction techniques adopted from other research fields H.B. Aradhye and G.K. Myers, Exploiting Videotext “Events” for Improved Videotext Detection, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEEpp. 894-898, 2007. The multiscale statistical process control (MSSPC) was originally proposed for detecting changes in univariate and multivariate signals. Substeps involved in the use of MSSPC for videotext event detection
Recent Progress-Text extraction techniques adopted from other research fields D. Liu and T. Chen, Object Detection in Video with Graphical Models, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 5, pp 14-19, 2006. Discriminative Random Fields (DRF) was initially applied to detect man-made building in 2D images. (a) 2D DRF, with state si and one of its neighborssj . (b) 3D DRF, with multiple 2D DRFs stacked over time.(c) 2D DRF-HMM type(A), with intra-frame dependenciesmodelled by undirected DRFs, and inter-frame dependenciesmodelled by HMMs. States are shared between the two models.
Recent Progress-Text extraction techniques adopted from other research fields W. M. Pan, T. D. Bui, and C. Y. Suen, Text Segmentation from Complex Background Using Sparse Representations, Proceedings of Ninth International Conference on Document Analysis and Recognition, IEEE,pp. 412-416, 2007. Sparse representation was initially used for research on the receptive fields of simple cells. (a) (b) (C) (a) Camera Captured Image; (b) foreground text generated by image decomposition via sparse representations; (c) binarized result of (b) using Otsu’s method.
Recent Progress 3. Text extraction approaches proposed for specific text types and specific genre of video documents: Besides general text extraction approaches, an increasing number of approaches have been proposed for specific text types. Based on domain knowledge, these specific approaches can take advantages of unique properties of specific text type or video genre and often achieve better performance than general approaches.
Recent Progress-Text extraction approaches proposed for specific text types and specific genre of video documents W. Wu, X. Chen and J. Yang, Detection of text on road signs from video, IEEE Transactions on Intelligent Transportation Systems, Vol. 6, pp. 378-390, 2005. This approach is composed of two stages: 1. localizing road signs; 2. detecting text. Architecture of the proposed framework
Recent Progress-Text extraction approaches proposed for specific text types and specific genre of video documents C. Choudary, and T. Liu, Summarization of Visual Content in Instruction videos, IEEE Transactions on Multimedia, Vol. 9, pp. 1443-1455, 2007. content fluctuation curve based on the number of chalk pixels is used to measure the content in each frame of instructional videos. The frames with enough chalk pixels are extracted as key frames. Hausdorff-distance and connected-component decomposition are adopted to reduce the redundancy of key frames by matching the content and mosaicking the frames. (a) (b) (C) (d) Comparison of our summary frames with the key frames obtained using different key frame selection methods in a test video. (a) our summarization algorithm; (b) fixed sampling; (c) dynamic clustering; (d) tolerance band. Our summary frames are rich in content and more appealing.
Recent Progress-Text extraction approaches proposed for specific text types and specific genre of video documents Additional References: • C. Mancas-Thilou, B. Gosselin, Spatial and Color Spaces Combination for Natural Scene Text Extraction, Proceedings of IEEE international Conference on Image Processing, pp. 985-988, 2006. • D. Q. Zhang and S. F. Chang, Learning to Detect Scene Text Using a Higher-order MRF with Belief Propagation, IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2004. • L. Tang and J.R. Kender, A unified text extractionmethod for instructional videos, Proceedings of IEEE international conference on image processing, Vol. 3, pp11-14, 2005. • M.R. Lyu, J Song, M. Cai, A Comprehensive method for multilingual video text detection, localization, and extraction, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, pp. 243-255, 2005. • S Lefevre, N Vincent, Caption localization in video sequences by fusion of multiple detectors, IEEEProceedings of Eighth International Conference on Document Analysis and Recognition, pp. 106-110, 2005. • CC Lee, YC Chiang, CY Shih, HM Huang, Caption localization and detection for news videos using frequency analysis and wavelet features, Proceedings of IEEE international conference on tools with artificial intelligence, Vol. 2 ,pp 539-542, 2007. • …
Outline • Introduction • Recent Progress • Performance Evaluation • Discussion
Performance Evaluation Evaluation Metrics: Video Analysis and Content Extraction (VACE) R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M. Boonstra, V. Korzhova, and J. Zhang, Framework for Performance Evaluation of Face, Text, and Vehicle Detection and Tracking in Video: Data, Metrics, and Protocol, to appear IEEE Transactions on Pattern Analysis Machine Intelligence, 2008. (http://doi.ieeecomputersociety.org/10.1109/TPAMI.2008.57)
Text: Task Definition Detection Task: Spatially locate the blocks of text in each video frame in a video sequence Text blocks (objects) contain all words in a particular line of text where the font and size are the same Tracking Task: Spatially/temporally locate and track the text objects in a video sequence Recognition Task: Transcribe the words in each frame, including their spatial location (detection implied)
Task Definition Highlights Annotate oriented bounding rectangle around text objects (The reference annotation was done by VideoMining Inc., State College, PA) Detection and Tracking task Line level annotation with IDs maintained Rules based on similarity of font, proximity and readability levels Recognition task Word Level (IDs maintained) Documents Annotation guidelines - Evaluation protocol Tools ViPER (Annotation) - USF-DATE (Scoring)
Micro-corpus:a small amount of data that was created after extensive discussions with the research community to act as a seed for initial annotation experiments and to provide new participants with a concrete sampling of the datasets and the tasks. Data Resources • VIDEO
These discussions were coordinated as a series of weekly teleconferences with VACE contractors and other eminent members of the CV community. The discussions made the research community a partner in the evaluations and helped us in: selecting the video recordings to be used in the evaluations, creating the specifications for the ground truth annotations and scoring tools defining the evaluation infrastructure for the program. Data Resources
Data Resources • MPEG–2 standard, progressive scanned at 720 × 480 resolution. GOP (Group of Pictures) of 12 for the broadcast news corpus where the frame-rate was 29.97 fps (frames per second) and GOP of 10 for the surveillance dataset where the frame-rate was 25 fps. * Distributed by the Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu ** i-LIDS [Multiple Camera Tracking/Parked Vehicle Detection/Abandoned Baggage Detection] scenario datasets were developed by the UK Home Office and CPNI. (http://scienceandresearch.homeoffice.gov.uk/hosdb/cctv-imagingtechnology/video-based-detection-systems/i-lids/)
Text Ground Truth: Every new text area was marked with a box when it appeared in the video. The box was moved and scaled to fit the text as it moved in successive frames. This process was done at the text line level until the text disappeared from the frame. Reference Annotations Three readability levels: READABILITY = 1 (white) Completely unreadable text READABILITY = 1 (gray) Partially readable text READABILITY = 2 (black) Clearly readable text
Text regions were tagged based on a comprehensive set of rules: All text within a selected block must contain the same readability level and type. Blocks of text must contain the same size and font. The bounding box should be tight to the extent that there is no space between the box and the text. Text boxes may not overlap other text boxes unless the characters themselves are superimposed atop one another. Reference Annotations
The Frame Detection Accuracy (FDA) measure calculates the spatial overlap between the ground truth and system output objects as a ratio of the spatial intersection between the two objects and the spatial union of them. The sum of all of the overlaps was normalized over the average of the number of ground truth and detected objects Detection Metric Frame Detection Accuracy (FDA) Gi denotes the ith ground truth object at the sequence level and Gi(t) denotes the ith ground truth object in frame t. Di denotes the ith detected object at the sequence level and Di(t) denotes the ith detected object in frame t. N(t)G and N(t)D denote the number of ground truth objects and the number of detected objects in frame t respectively.
The Sequence Frame Detection Accuracy (SFDA), is essentially the average of the FDA measure over all of the relevant frames in the sequence. Detection Metric Sequence Frame Detection Accuracy (SFDA) Range: 0 to 1 (higher is better) Nframesis the number of frames in the sequence
The Average Tracking Accuracy (ATA) is a spatio-temporal measure which penalizes fragmentations in both the temporal and spatial dimensions while accounting for the number of objects detected and tracked, missed objects, and false positives. Tracking Metric Sequence Track Detection Accuracy (STDA) Average Tracking Accuracy (ATA) Range: 0 to 1 (higher is better) NG and ND denote the number of unique ground truth objects and the number of unique detected objects in the given sequence respectively. Uniqueness is defined by object IDs.
Example Detection Scoring Spatial alignment error (ratio = .4505) 3 false alarm objects Correctly detected object – perfect overlap (ratio = 1.0) 3 missed objects Green: Detected box Red: Ground truth box Yellow: Overlap in mapped boxes