Open standards in use in localisation
1 / 25

About the Author - Andrés Vega - PowerPoint PPT Presentation

  • Uploaded on

Open standards in use in localisation - an engineering approach Andrés Vega, LRC XIII Localisation4All, Dublin, Ireland 2 nd October 2008. About the Author - Andrés Vega. 8+ years of experience as a Localisation Engineer with Tek Translation International.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' About the Author - Andrés Vega ' - sook

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Open standards in use in localisation - an engineering approachAndrés Vega, LRC XIII Localisation4All, Dublin, Ireland 2nd October 2008

About the author andr s vega
About the Author - Andrés Vega

  • 8+ years of experience as a Localisation Engineer with Tek Translation International.

  • Specializing in complex project engineering with special focus on CMS, encodings and complex scripts.

  • Previous work as a programming languages teacher: OO programming, C and Java.

  • Background in Chemistry and Healthcare.


  • Why Standards?

  • Unicode

  • OpenType Fonts

  • XML

  • CMS

  • TMX


  • TBX and SRX

  • Final thoughts and Q&A

Why standards
Why Standards?

  • Allow faster technology development

    • Assembling standard components

    • Concentrating effort on specialisation

    • Increase competence, focused on features (not compatibility)

  • Facilitate inter-operability

    • Open standards allow information to be shared

      (Not locked on proprietary standards)

    • Complementary tools may be developed

    • Choose tool/resource for each job

    • Guarantee future compatibility

  • Provide conformance validation mechanisms

    • Standard verification serves as QA procedure


  • Challenges

    • Too Many Character sets:

      Three great ‘families’ (ANSI, DBCS, BiDi): three application types

    • Multilingual data (storage, display, processing)

    • Cross-platform and character set inter-conversion issues

  • What Unicode is

    • Universal character encoding standard by the Unicode Consortium

    • 21-bit character set with 3 main encoding forms (UTF-32, UTF-16, UTF-8)

    • Not just the character set

      • Character properties (Name, Category, Casing, Decomposition, …)

      • Annexes, Technical Reports: (Comparison, Sorting, Hyphenation, …)

  • What Unicode is not

    • Glyph repertoire: glyphs provided are examples, not canonical!

    • Unicode alone does not provide language support!

Unicode benefits and issues
Unicode (Benefits and Issues)

  • Unicode benefits

    • One vendor neutral encoding standard for all languages

    • Stable, but it keeps evolving

    • Multilingual rendering/storage/transfer (No conversion - No corruption)

    • Unified content processes (Globalized, Web enabled)

    • Internationalisation

    • Easy conversion from/to/between legacy codepages

  • Issues or drawbacks with Unicode

    • Size (ANSI: 1byte, DBCS: 2byte, UTF-8 1-4 byte, UTF-16 2-4 byte)

    • UniHan related (Font dependence, ‘Gaiji’ and variants)

    • Inconsistencies on implementation choices across scripts

    • Several ways to generate pre-composed characters

  • Implementation issues

    • Script Enabling requires: Input, Display, Storage, Retrieval, Output

    • Bidirectional support, Complex Scripts issues

  • Implementation status

Unicode transition issues
Unicode (Transition Issues)

  • Transition issues

    • Mixed content: legacy and UTF8 (FrameMaker)

      FM7 FM8 + update Import old corrupted Filter version English seen OKvars & template variables corrupts ANSI

    • Localisation tools, filters, etc not fully adapted or tested

      Example: Style names containing extended characters

      New filter for FrameMaker 8: English names are OK (UTF-8 = ASCII)

      German designed file:Filter does not accept UTF-8 Style names

    • Backwards conversions: Unicode version saved as non-Unicode version

UTF-8 Content

ANSI Variables

ANSI Template

ANSI Content

ANSI Variables

ANSI Template

UTF-8 Content

ANSI Variables

ANSI Template

UTF-8 Content

Corrupt Vars

ANSI Template


Unicode workflow
Unicode Workflow

  • Pre-Unicode Workflow (FrameMaker)

    Character corruption risks in all orange (middle 3 groups) steps

    Final document presents issues in TOC and index generation and in searches

  • Unicode Workflow:

Back Conversion

Files to localize

File Preparation

Translation & Review

DTP and Merge

  • Western RTF and fonts

  • CE RTF and fonts

  • Cyrillic RTF and fonts

  • Turkish RTF and fonts

  • Greek RTF and fonts

  • Baltic RTF and fonts

  • FM (Design font)

  • FM (CE font)

  • FM (Cyrillic font)

  • FM (Turkish font)

  • FM (Greek font)

  • FM (Baltic font)



With Design


  • Multilingual

  • Target

  • Document

  • With several

  • ANSI fonts

  • Western RTF

  • CE RTF

  • Cyrillic RTF

  • Turkish RTF

  • Greek RTF

  • Baltic RTF

  • UTF-8 FM with original design fonts



Design Fonts


Document &

Design Fonts

  • UTF-8 XML

UTF-16 TTX and fonts

Opentype fonts
OpenType fonts

  • Challenges

    • Two font families (TrueType and PostScript), two font technologies

    • Inter-platform issues

  • Benefits of Open Type

    • Support large character sets (Unicode, multiscript)

    • Glyph variants supported: Solves Unicode UniHan ambiguities

    • Supports advanced typography

    • Font embedding control

  • Features

    • Contain both TrueType and PostScript outline data

    • Glyph substitution

    • Glyph positioning

    • Script and language information


  • eXtensible Markup Language (Meta-language for markup languages)

  • Used to define, share and validate information (data and structure)

  • An XML document contains

    • XML declaration : <?xml version='1.1' encoding='UTF-8' standalone='yes'?>

    • Document Type declaration(s) <!DOCTYPE root SYSTEM “rootDTD.dtd" >

    • Elements <element attribute=“value”>Content</element> or<element/>

    • Other: comments, entities/NCRs, instructions, conditional sections

  • Specific Syntax (well-formed XML)

    • Only one root element

    • Tags in nested open/close pairs: <tag> </tag>

    • Element names obey certain conventions

    • Elements may contain attributes

  • DTD (Valid XML)

    • Defines rules on structure, valid tags and attributes and valid data

    • Guarantees reliable data exchange between different systems

    • Can be included in each XML, but is normally external

Xml benefits
XML (Benefits)

  • Benefits

    • Simple (XML is plain text) but can embed any content type

    • Platform independent, Unicode encoded

    • Content is easily validated cross-platform: data transfer is safer

    • Structured (defines structural relationships within data)

    • Open and Extensible well supported standard

    • Metadata and version control capable

    • Format independent

    • Powerful data transformation tools (XSL): Multiple outputs

Xml localisation benefits and issues
XML (Localisation benefits and issues)

  • Localisation benefits

    • Structured: Content detached & merged (updates handling)

    • XML support easily implemented on Localisation processes/tools

    • Easy validation versus DTD

    • Extensible: XML based localisation standards: XLIFF, TMX, TBX,...

    • Metadata (source/target version control, updates, element status)

    • Format independent

      • Single-sourcing (localized once, published into many formats)

      • Source content and formatting changes are not inter-dependant

      • Content localisation and proofreading before formatting (DTP)

  • Issues

    • Transition needs to be well planned and performed

    • Segmentation issues (DTD needs to be multilingual aware)


  • What are Content Management Systems?

    • Set of tools configured around a data repository (database)

    • Designed to manage information in small meaningful bits

    • Information is isolated from format

    • Have workflow capabilities, version control and change tracking

    • Store localized content layers (as other alternative content layers)

  • General benefits

    • Granularity (no redundancy)

    • Reuse (content reuse and multi output)

    • Improved Quality and Consistency

    • Single-source and multi-publishing

    • Easy rebranding/reformatting

    • Metadata info and version control

    • Workflow and Automation

  • Localisation benefits

    • Workflow status control features

    • Localisation of updates via content deltas: improved time-to-market

    • Localisation independent from output format (better matching)

Cms issues
CMS (Issues)

Translation in XML

LF not visible

Broken segmentation LF also formats lists


Remove meaningless LF Export remaining as tags


LF converted to tag

Meaningful tags internal

  • Issues

    • Authoring for reuse (topic model, single-source, cross-reference)

    • Segmentation issues

      LF Chars (0A) No Validation! Segmentation issue

    • Localisation readiness

      • CMS must be multilingual enabled (storage, I/O, processing)

      • Localisation workflow support

      • Strong version control and version rollback

      • Capability to export up-to-date paired TM content

      • Integration with LQA tools

    • Not to increase ROI in the short run (DTP is still needed!!)



Xxxx Xxxx

Xxxx xxxx

Xxxx xxxx

Cms localisation workflow
CMS Localisation Workflow


Client Validators


Select only delta content

Translation(TTX format)

Revision(TTX format)



Content Validation in

Tracked-changes RTF

Prepared for Proofreading(Colour-coded RTF format)

Insertion of Validationchanges (TTX & TMs)



Full document in XML

Preprocessing of XML

Layout & Consistence Validation in PDF file

Import to FrameMaker

Delivery in FrameMaker

DTP in FrameMaker


  • What is TMX?

    • Translation Memory eXchange

    • Standard by LISA (Localisation Standards Industry Association)

    • Provides a standard method for TM data description

    • XML-compliant (validated against its TMX DTD)

    • Uses other ISO standards for date, time, lang, country

    • Consists of

      • Container format specification

        • Translation unit elements <tu>

        • Optional format description elements (font change,...)

        • Subflows (footnotes, index entries)

      • Low-level meta-markup format for segment content

        • Segment element <seg>

Tmx benefits and drawbacks
TMX (Benefits and Drawbacks)

  • Benefits

    • Transfer TM assets across tools/vendors

    • Provides clients with control over their translated assets

      • Non-proprietary and vendor neutral

      • Can be integrated with LQA tools

    • Provides Translators/Vendors with freedom of tool choice

      • Specialized tools share TM assets

      • Tools may be outdated, assets will not

      • Facilitates work distribution/outsourcing

  • Issues

    • Tag handling

      • TMX DTD cannot validate inline codes

      • TMX compliance level

    • Segmentation issues


  • Xml Localisation Inter-exchange File Format

  • Standard by LISA Special Interest Group OSCAR

  • Tool-neutral XML-based standard localisation resource container format

  • To store/transfer/manipulate localizable content, context and other info

  • Has Built-in support for CAT tools and related standards (TBX, TMX)

  • Features:

    • Translation suggestions (TM, Glossary, MT) to approve or edit

    • Metadata: Translate, notes, context info, version

    • Hierarchical data structures

    • Abstraction of formatting and inline codes:

      • Structural formatting stored in the skeleton file

      • Inline formatting can be dealt with two ways

        Replaced by g (paired) and x (isolated) tags (OpenTag style)

        Encapsulated into bpt, ept (paired), it or ph (isolated) tags

Xliff description
XLIFF (Description)

  • Separates localizable and non-localizable content

    • Non-localisable: Skeleton (separate or embedded)

    • Localizable 'file' Elements with Header (metadata) and Body

  • Body can contain 'trans-unit' and 'bin-unit' elements

  • Each trans-unit can have

    <trans-unit id="abc123" resname="resourceID" restype="string" translate="yes">

    unique id, resource id, resource type, translate yes/no

    <source xml:lang="en-US">Translatable content.</source>

    Translatable content source and language

    <target xml:lang="es" state="needs-review-translation">Traducción.</target>

    Currently validated translation

    <alt-trans match-quality="100%" tool="TM"> <source>Translatable content.</source> <target xml:lang="es">Contenido traducible.</target> </alt-trans>

    alt-trans translation suggestion(s)

    </trans-unit> (closing tag)

Xliff benefits and drawbacks
XLIFF (Benefits and Drawbacks)

  • Benefits: For the translation process

    • One common format on which to translate

    • Control on Translatable/Non-translatable content

    • Better information handling (context, notes, metadata)

    • Better TM matching due to formatting abstraction

    • Concurrent tool processing visible at review stage

    • Support for all localisation phases

    • Supports metrics info on each trans-unit

  • Benefits: For localisation tool developers

    • Common platform for tool developers to write to

    • Easy adoption of new formats (new filters to XLIFF)

    • All generic XML processing benefits

  • Drawbacks

    • Conversion tools needed into XLIFF and back

    • Many XLIFF features are not implemented by most tools

    • Segmentation is inherent to XLIFF file generation

    • As opposed to tailored tools, WYSIWYG is difficult to attain

Xliff workflow
XLIFF Workflow

Translator A

Translator B

Translator A

Translator B

Reviewer B

Reviewer A

Reviewer A

Reviewer B

  • No XLIFF Scenario

  • XLIFF Scenario

Many Formats!

SGML Editor





Software Editor




SGML Editor

Many Filters!








Software Editor


Other lisa standards tbx srx
Other LISA standards: TBX, SRX

  • TBX

    • What is TBX?

      • Term Base eXchange standard by LISA

      • XML based, vendor-neutral, open standard

    • Benefits

      • Better control of terminology (source consistency)

      • Reduced glossarisation effort (localisation phase)

      • Platform and tool independent glossaries (global consistency)

    • Current status

      • TBX Basic (Lighter approach)

      • TBX Checker

  • SRX

    • What is SRX?

      • Segmentation Rules eXchange format

      • Describes how localisation tools segment text for processing

    • Benefits

      • Standardises segmentation process (avoid segmentation issues)

Final thoughts
Final Thoughts

  • Unicode

    • Use Always: If tool does not support it, convert at end stage

  • XML

    • Powerful for single-source, multi-output requirements

  • CMS

    • Costly. Depends on volume. First consider XML model, then migrate

  • TMX

    • Use for safe TM tool to tool transfer, specially software into doc


    • Not fully implemented. Good alternative for Java or Web content.

    • Use it to unify side processes (LQA)

  • TBX

    • Use to exchange glossary info. Good for clients

  • SRX

    • Very much need but lacks implementation.

About Tek: Multilingual translation and localisation business solutions designed to meet the needs of Life Sciences, IT and Manufacturing

  • Since 1961

  • Over 65 languages

  • Expert Resources and Service

  • Located in US, Spain, Brazil, China Ireland, UK, Denmark

  • Scalability

  • Simplification and standardisation

  • ISO 9001:2000 certification

  • Follow-the-sun

  • Solutions-based approach for best business value

  • TekOneWorld Platform for your language & industry needs

  • Business Intelligence

  • Language Quality Solutions

  • Open Connectivity, WW Collaboration

Thank You Q & AAndrés Vega MuñozLocalisation EngineerTek Translation InternationalEmail: [email protected]