1 / 13

Inferring Structure Information from Typography

Inferring Structure Information from Typography. Christian Fuß Dipl.-Inform. Felix Gatzemeier Michael Kirchhof Dipl.-Inform. Oliver Meyer Department of Computer Science III, RWTH Aachen. Overview. Context Deriving Structure Information: Partitioning Typographic abstraction

howard
Download Presentation

Inferring Structure Information from Typography

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inferring Structure Informationfrom Typography Christian Fuß Dipl.-Inform. Felix Gatzemeier Michael Kirchhof Dipl.-Inform. Oliver Meyer Department of Computer Science III, RWTH Aachen

  2. Overview • Context • Deriving Structure Information: • Partitioning • Typographic abstraction • Determine Type • Conclusion • Cooperation project of • Prototype aTool in the WEP goupof the Global-Info Project (www.global-info.org)

  3. Conversion Standard format Author Writing < > < > Proprietary document format < > < > < > < > Today’s Publication Chain Publisher Copy Editing Web Publ. Reader Typesetting Reading

  4. Unformatted Formatted Somehow Formatted Structured(XML) Somehow Formatted Structured(XML) Classification of Submissions Submissions TEX MS Word Unformatted Formatted Somehow Formatted Correctly Formatted

  5. Basic Assumptions Known target document type Textual Nature Typographic markup Consistent markup

  6. Deriving Structure Information In: MS Word document • Record Formatting (Format Tuples) • Locate the Elements • Reduce Format Tuples to Patterns • Determine Types Out: XML documentAlso interactively

  7. Format Tuples • The basic typographic abstraction • FormatTuple("Is this a dagger?") = [Times, 22pt, regular, roman] • Here: Font, Size, Weight, Variation • Planned: Search expressions modulo Text • More general: Including regular expressions of text content or context.

  8. Locate the Elements • Tree-Partitioning of Formatted Character Streams on • Format Tuple changes • Paragraphs breaks • Nesting of Inline Elements • Is this a dagger? <ft1> • Is this a dagger?<ft1 <ft2> ft1> • Is this a dagger?<ft1 <ft2> > • Is this a dagger?<ft1 <ft2 <ft3> > > • Format-To-Type Map: FormatTuple ElementType ft1(times, 22pt, reg, roman) dummyType1 ft2 (times, 22pt, bold, roman) dummyType2 ft3 (times, 22pt, reg, italic) dummyType3

  9. FormatPattern ElementType fp1(*, *, regular, *) dummyType1 fp2 (*, *, bold, *) dummyType2 fp2b (*, *, bold, roman) dummyType2fp3 (*, *, regular, italic) dummyType3 Format patterns • Identity too restrictive  wildcard generalizationIs this a dagger? (,,)Times Times Times *22pt 22pt 22pt *regular bold regular boldroman roman roman * • (, a, b) = (a, a, b); (a, b, ) = (a, b, b) • (, a, ) propagated to paragraph level • Format-To-Type Map:

  10. FormatPattern ElementType (*, *, regular, *) Body (*, *, bold, *) FirstTerm (*, *, bold, roman) FirstTerm (*, *, regular, italic) Emphasis Determine Types • Replace dummy types in Format-To-Type Map • Preconfiguration by publisher • Controlled Learning from the author

  11. Further useable information • Allowed context from the DTD • Paragraph standard format • Text patterns • Bullets • Enumeration • Whitespace • ASCII Markup (Is *this* a dagger?) • Format pattern match confidence

  12. Motivational aspects • Quick feedback on formal correctness • Publication preview while keeping format freedom • (Via XSL) flexible previews of other formats • New structure-based functionality: • Structure editing • Structure evaluation • Document templates

  13. Conclusion • Summary • 4-step inference • Record format tuples • Locate the elements • Reduce tuples to patterns • Determine types • Increase efficiency of publication chain • Provide unobtrusive structuring for non-expert authors • Plans • Cautious extension of inference • Validation of document • Evaluation with authors

More Related