1 / 30

Document Analysis: Segmentation & Layout Analysis

Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008. Document Analysis: Segmentation & Layout Analysis. Outline. Objectives of layout analysis Classification of layout analysis methods Splitting methods Grouping methods Text-Graphics-Image Separation

janus
Download Presentation

Document Analysis: Segmentation & Layout Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008 Document Analysis:Segmentation & Layout Analysis

  2. Outline Objectives of layout analysis Classification of layout analysis methods Splitting methods Grouping methods Text-Graphics-Image Separation Text line segmentation Word and character segmentation Field extraction from forms

  3. Objectives of layout analysis and segmentation The role of segmentation is to split a document image into regions of interest Regions of interest may be of different granularity levels: graphics or text blocs, text lines, words, characters The goal of layout analysis is to get a hierarchical description of segmented objects

  4. Segmentation strategies Segmentation produces a hierarchy of physical objects Two strategies can be used top-down segmentation: starting with the entire image, split it recursively down to elementary shapes bottom-up segmentation: starting at pixel level, detect connected components and group them hierarchically Hybrid methods combine both strategies Segmentation methods can be data-driven using only data properties (without contextual knowledge) model-driven, i.e., using contextual knowledge

  5. Top-down methods Top-down methods decompose the entire page into a hierarchy of rectangular regions Top-down approaches perform recursive XY-cuts horizontal and vertical projection profile analysis white streams (spaces) analysis run length smoothing algorithm (RLSA)‏

  6. Recursive XY-Cut The page is cut alternatively horizontally and vertically according to white spaces Robust for most printed modern documents Supposes page images to be unskewed Does not work for all kind of layouts Non rectangular formatting Complex mosaics (illustration next) Resulting hierarchy may not reflect the natural structure (illustration below)‏

  7. Top-Down Segmentation Recursive splitting can be performed by horizontal and vertical profile analysis images need to be "unskewed" !

  8. Top-Down Segmentation (2) Order in which X-Y cuts are performed is critical

  9. White streams analysis Principle: detect maximal rectangular white blocs split regions recursively according to thresholds

  10. Run Length Smearing Algorithm (RLSA)‏ The Run Length Smearing Algorithm (RLSA) is a morphological operator it replaces white runs that are smaller or equal to a given threshold by black runs it can be applied horizontally as well as vertically

  11. RLSA based segmentation RLSA can be used to segment a page into blocs using three steps applied horizontally applied vertically combined by logical and operator Threshold values are critical and have to be chosen according to document class using statistical white space analysis

  12. Bottom-up methods Bottom-up methods start at pixel levels and groups them together in a hierarchy of multi-rectangular regions (shapes delimited by horizontal and vertical segments)‏ arbitrary shapes Bottom up methods use connected component extraction region grouping

  13. Connected components In a binary image, a connected component is a set of black pixels connected by 4- or 8-adjacency five 4-connected components two 8-connected components

  14. Extraction of connected components Connected components can be extracted by different algorithms By a one pass full image scanning process, from top to bottom and from left to right By a border following algorithm, using as first pixel a border pixel supposed to be known

  15. Scanning based CC Extraction for each scan line ly for each black run r if on line ly-1there is no run k-adjacent to r create a new component containing r elseif on line ly-1there exist one run r’ k-adjacent to r add r to the component containing r’ elseif on line ly-1there exist several runs rik-adjacent to r merge all components containing such a ri add r to that component merge

  16. Border following algorithm consider P0S having a 4-neighbor Q0 S P ← P0 ; Q ← Q0 ; d ←direction of Q according to P ; repeat let Ribe the neighbor of P in direction (d+i) mod 8 if R2 S then Q ← R2 ; d ← (d+2) mod 8; else if R1 S then P ← R2; Q ← R1; else P ← R1; d ← (d2) mod 8; add P to thecontour until P = P0and Q = Q0 d R1 Q R2 P R2 d P Q

  17. Illustration of connected components

  18. Connected components from RLSA Connected components can be used to detect characters Word can be located using RLSA

  19. Grouping components Grouping connected components is non trivial Grouping rules are based on relative positioning distances and thresholds component classification Parameters can be estimated statistically

  20. Allen's relations in 2D space Relative positioning of two rectangles generate 169 configurations !

  21. Threshold estimation Thresholds can be estimated on statistical distributions of horizontal spaces for character grouping into words and word grouping into text lines vertical spacing for grouping text lines into text blocs

  22. Distributions of component sizes Components can be classified into symbols letters hairlines punctuation according to their size

  23. Region grouping

  24. Docstrum The docstrum method [O'Gorman] is using a graph that connects each connected component to its k closest neighbors

  25. Model driven layout analysis [Azokly95]

  26. Generic macrostructures • In a model-driven approach, generic macrostructures are used • a formal language describes margins and separators

  27. Formal description of macrostructures VOLUME Article IS WIDTH = 160; HEIGHT = 240; PAGE Garde IS ... END; PAGE Paire IS HSEP hs1 = (4, 3, LEFT, RIGHT, BLANK); LAYER Principal IS VSEP vs1 = (40, 65, TOP, hs1, BLANK); VSEP vs2 = ([50,60], 4, hs1, BOTTOM, BLANK); REGION Centre = (vs2, RIGHT, hs1, BOTTOM, ANY, NORMAL); REGION Marge = (LEFT, vs2, hs1, BOTTOM, TEXT, SMALL); ... END; LAYER Secondaire IS HSEP hs2 = ([10,220], 2, LEFT, RIGHT, BLANK) SUBST hs1; HSEP hs3 = ([20,240], 2, LEFT, RIGHT, BLANK) SUBST BOTTOM; REGION Figure = (LEFT, RIGHT, hs2, hs3, {TABLE, GRAPHICS}); END; END; PAGE Impaire IS ... END; END;

  28. Evaluation of segmentation results • Segmentation is rarely perfect; it generates • undersegmentation : real components are merged • oversegmentation : a single component is split • Special metrics have been developed to evaluate a segmentation result • In ICDAR'03 and ICDAR'05 scientific contests were organized

  29. Conclusion • Segmentation is a crucial step in document analysis • Segmentation is almost solved for • printed documents with regular layout • form analysis • Results are rarely perfect • Contextual knowledge may improve the results • Advanced pattern recognition method are required • Segmentation remains an open problem for uncontrolled handwriting and graphical documents

  30. Component hierarchy

More Related