1 / 27

A Search Engine for Historical Manuscript Images

A Search Engine for Historical Manuscript Images. Toni M. Rath, R. Manmatha and Victor Lavrenko Center for Intelligent Information Retrieval University of Massachusetts SIGIR2004. Introduction. The first known automatic retrieval system for handwritten historical manuscript

deliz
Download Presentation

A Search Engine for Historical Manuscript Images

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Search Engine for Historical Manuscript Images Toni M. Rath, R. Manmatha and Victor Lavrenko Center for Intelligent Information Retrieval University of Massachusetts SIGIR2004

  2. Introduction • The first known automatic retrieval system for handwritten historical manuscript • The obvious approach to this problem is to use handwriting recognition but the error rate is excess 50% • This is an system search handwritten manuscript using text queries without recognition

  3. Introduction • Seem the problem as an image annotation or cross-lingual retrieval problem that use text word to query image word • Learn a statistical relevance model by training on a transcribed set (word for word) of pages • Two models • Probabilistic Annotation Model • Direct Retrieval Model

  4. An Example An example image from George Washington collection

  5. Related Work • Obvious approach • Handwriting recognition + text search engin • Image annotation • Duygulu –Translation model • Blei –Latent dirichlet allocation model • Jeon –Cross-media relevance model (CMRM) • Lavrenko –Cross-lingual relevance model

  6. Related Work (2) • Handwriting recognition+ text search engine • Advantage • Can be used for every English word • Disadvantage • Well Know segment error • Image Annotation (Convert)

  7. Related Work (3) • The different between image annotation and their model • Use shape feature instead color and texture feature • Do not using cluster or blob • Learn the relation between features and English texts to instead blobs and English texts

  8. Image Annotation

  9. Model In This Paper

  10. System Overview • Probabilistic Annotation Model 1.Training relations between features and English word 2.Each word image in the testing set is annotated with every term in the annotation vocabulary and a corresponding probability 3.The result in 2 will be store in an inverted list for quick access so typical query times are less than one second

  11. Probabilistic Annotation Model

  12. System Overview • Direct Retrieval 1.Training relations between features and English word 2.Use query to estimate a distribution over the feature vocabulary that one would expect to observe jointly with the query 3.Compare this distribution with a distribution of the feature vocabulary of each word image using Kullback-Liebler divergence,one may rank all word images in the testing set at query time

  13. Direct Retrieval

  14. Demo A demo at http://ciir.cs.umass.edu/research/wordspotting

  15. Word Image Representation • Simple shape features: like width and height. Use a total 5 such feature • Fourier coefficient of profile feature: detail descriptions of a word’s shape can be obtain with profile features, such as the upper and lower profiles (see the picture), each profile have 7 features and one image word obtain 3x7=21 profile feature One image word have totally 21+5= 26-dimensional continuous-space feature vector

  16. Word Image Representation • Dividing the range of observed values in each feature dimension into 10 bins of equal size, and associate a unique feature vocabulary term with each bin • Repeat the process but 9 bins in this time • Each word image will have 2x26=52 features • There are (10+9)x26=494 features in the feature vocabulary

  17. Model Formulation • Probabilistic Annotation Model w: an English word f : a feature word k: feature number=52 I: image word in the training set i: position in the training set |T|:words number in the training set

  18. Model Formulation V : vocabulary in training set • Smoothing δ ( x ∈ { wi.fi1…ftk } )= 1 , if x ∈ {wi.fi1…ftk} , else 0 x : w or f ג : 0~1

  19. Model Formulation • Page Retrieval Pg : a page Q:query text q1~qm

  20. Model Formulation • Direct Retrieval Q : query word W : image word Use (6) to estimate P(Q|W) and P(f|W)

  21. Reordering • An image word can be represent by a vector of 494 entries and 52 1’s • Change retrieved images and training images for the given query to that form • Reordering was performed using the average dot product of the retrieved images and training images for the given query

  22. Data Collection • George Washington collection at Library of Congress • Contains 150000 ages • Image were digitized from microfilm at 300dpi, 8 bit grayscale from thesepages • Training set • 100 pages (24665 words,3087 vocabulary) • Testing set • 987 pages (234754 words)

  23. Experimental Eval. - queries • Mixture of proper names, places, nouns, number in the form of a year • Have reasonably frequent words in the training set • It is possible that some of the query words may not occur in the test set

  24. Eval. – Word Image Retrieval • A number of words are incorrect segment • Direct retrieval model did not retrieve any instances of deserter and disobedience, while probabilistic annotation model found one disobedience • The low turnout may be caused either by insufficient training or the lack relevant images in the testing collection

  25. Eval. - Page Image Rtrieval In one word queries the performance is quite good, even higher than in the single word retrieval without reordering. In two word queries the results seem low, but believe that a more thorough evaluation with ground true data would yield better results.

  26. Conclusions • Results show that retrieval can be done even when recognition of handwriting remain a challenging task • Adapting statistical relevance models produce good results, much remains to be done. Better models are needed • Large datasets can be handled either by using a cluster of processors, or by improving the efficiency of both the feature processing and retrieval model stages

  27. Conclusions • The lack of training data requires attention. We are current investigating synthetic training data as a possible solution

More Related