1 / 31

A System for Understanding Imaged Infographics and Its Applications

A System for Understanding Imaged Infographics and Its Applications Weihua Huang, Chew Lim Tan School of Computing National University of Singapore Outline Introduction Syntactic and semantic information in scientific charts Chart recognition Chart interpretation Applications

jaden
Download Presentation

A System for Understanding Imaged Infographics and Its Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A System for Understanding Imaged Infographics and Its Applications Weihua Huang, Chew Lim Tan School of Computing National University of Singapore

  2. Outline • Introduction • Syntactic and semantic information in scientific charts • Chart recognition • Chart interpretation • Applications • Experiment results • Conclusion

  3. Introduction • Information graphics (infographics) are frequently used in various kinds of documents. • Recognition and interpretation of infographics is important for automatic document processing and information retrieval. • What are the elements/components in an infographic? Recognition task • What does an infographic try to tell? Interpretation task • This paper focus on one type of infographics: scientific charts

  4. Introduction • Imaged infographics are harder to recognize and interpret: Because everything is in pixels!

  5. Outline • Introduction • Syntactic and semantic information in scientific charts • Chart recognition • Chart interpretation • Applications • Experiment results • Conclusion

  6. Y-axis ticks Y-axis end Y-axis unit Chart Title Y-axis label X-axis end Origin X-axis Title X-axis label X-axis ticks Data components Scientific Charts • Syntactic elements:

  7. Comparison, trend, distribution, etc. Graphical representation Intended message Tabular Data Scientific Charts • Semantic information: • Recognition and interpretation is the reverse process

  8. Outline • Introduction • Syntactic and semantic information in scientific charts • Chart recognition • Chart interpretation • Applications • Experiment results • Conclusion

  9. Text/graphics separation Edge detection Text components The original image Graphical image Edge map Chart Recognition • Preprocessing • Text/graphics separation: connected component analysis • Edge detection: Canny edge detector

  10. Chart Recognition • Graphical symbol construction • Vectorization • Detection of coordinate lines • Geometric constraint between candidate lines • Coverage of other lines in the candidate plot area • Attachment of text blocks Edge Map DSCC Straight segments Ellipse fitting Circular arcs, Elliptic arcs

  11. Chart Recognition • Graphical symbol construction (cont.) • Construction of data components • Bottom up process with the vectorized edges and intersections • Model based parsing rules using the domain knowledge • Example: BarChart = {x-axis, y-axis, BarSet}, where BarSet = {Bar}, where number of elements ≥ 2 and Bar = {l1, l2, l3 | l1 ┴ l3, l2 ┴ l3, l3 || x-axis, CE(l1, l3), CE(l2, l3), EL(l1, x-axis), EL(l2, x-axis)} Constraints: a || b: line a is parallel to line b. a ┴ b: line a is perpendicular to b. CE(a, b): shape a and b share one common endpoint. EL(a, b): one end point of shape a lies on shape b.

  12. Chart Recognition • Text grouping • Yuan’s method to group connected components: • Text recognition • Omnipage Scansoft Capture SDK 12.0 • Errors are manually corrected.

  13. Chart Recognition • Sample result: Green: bars bar1: (281,249), (345,248), (346,301), (281,302) Bar2: (430,109), (494,108), (499,298), (435,299) Bar3: (581,134), (645,132), (648,296), (585,298) …… Red: axis X: (239,304) to (994,290) Y: (239,304) to (236,100) Type: bar chart

  14. Outline • Introduction • Syntactic and semantic information in scientific charts • Chart recognition • Chart interpretation • Applications • Experiment results • Conclusion

  15. Chart Interpretation • Associating text with graphics • Assign syntactic role to each text block • Label graphical symbols using the text blocks • 11 roles of text in the scientific charts identified • The problem is modeled as classification of text blocks

  16. Chart Interpretation • Associating text with graphics (cont.) • To train the classifier and classify a new text block, 4 features are defined: • Distance to the nearest graphical symbol • Type of the nearest graphical symbol • Relative position of the text block and the graphical symbol • Type of the text string itself • Centricity of a text block • Learning algorithm C4.5 is used for building decision tree.

  17. D1 D2 Chart Interpretation • Obtaining the tabular data • Assign label to each data entry if its label is not directly presented. D1: Distance to nearest label on the left. D2: Distance to nearest label on the right If (D1 < D2) label = L1 Else if (D1 > D2) label = L2 Else label = L1 + L2

  18. H1 H2 Chart Interpretation • Obtaining the tabular data (cont.) • Calculate value for each data entry if its value is not directly presented. H1: Data height H2: Unit height Value per unit height: 30 Data value: H1 * 30 / H2

  19. Chart Interpretation • Generating chart description • XML format description • Keeping data in the tabular form • Good for querying on data value or label • Natural language description • Fact based sentences generated from templates • Good for factoid question

  20. Outline • Introduction • Syntactic and semantic information in scientific charts • Chart recognition • Chart interpretation • Applications • Experiment results • Conclusion

  21. Applications • Enriching OCR output • Traditional OCR output: Text + Figures • The information in figures are not extracted • The proposed system helps to extract more information • The tabular data obtained can be used to reproduce the document in machine readable form. (Electronic) (Image format)

  22. OCR Electronic text Imaged text Segmentation Layout information Scanned document Imaged infographic The proposed system XML description Document Reproduction Applications • Enriching OCR output (cont.) • Approach: • Question: where to insert the infographics? Clue: Look for the figure number in the text.

  23. Applications • Assisting QA systems • Question type 1: factoid question • Example: “How many fatalities were there in the year 1984?” • Solution: Add the NL description of the infographics into the original text • Question parsing and answer extraction: Cui et al’s method based on soft pattern matching

  24. Applications • Assisting QA systems (cont.) • Question type 2: query-like question • Example: “What is the maximum number of fatalities among all years?” • Solution: Translate the question into one of the pre-defined queries • Question translation: Semantic parser proposed by Mooney et al

  25. Outline • Introduction • Syntactic and semantic information in scientific charts • Chart recognition • Chart interpretation • Applications • Experiment results • Conclusion

  26. Experiment Results • Chart recognition and classification: using 200 scientific chart image collected

  27. Experiment Results • Text block classification: using 200 scientific chart images collected

  28. Experiment Results • Question answering: using 10 scanned document pages from the UW database I

  29. Outline • Introduction • Syntactic and semantic information in scientific charts • Chart recognition • Chart interpretation • Applications • Experiment results • Conclusion

  30. Conclusion • A system for recognizing and interpreting imaged infographics is introduced. • Current focus is on scientific charts, a commonly used type of infographics • The system can be generalized to handle more variety of infographics • The system can be enhanced to handle more complex layout and special effects etc.

  31. Thank you! Questions?

More Related