Eacc to unicode migration
Download
1 / 56

EACC to Unicode Migration - PowerPoint PPT Presentation


  • 595 Views
  • Updated On :

OCLC CJK Users Group 2006 Annual Meeting April 8 2006, San Francisco. EACC to Unicode Migration. Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk. Contents. Migrating systems from EACC to Unicode environments Why migrating?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'EACC to Unicode Migration' - benjamin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Eacc to unicode migration

OCLC CJK Users Group 2006 Annual Meeting

April 8 2006, San Francisco

EACC to Unicode Migration

Ki Tat LAM

Head of Library Systems

The Hong Kong University of Science and Technology Library

lblkt@ust.hk


Contents
Contents

  • Migrating systems from EACC to Unicode environments

    • Why migrating?

    • What has been done?

    • HKIUG Unicode Initiatives

  • Issues

    • EACC/Unicode mapping table

    • Round-trip cross-walk

    • Improving searching with TSVCC Linking

    • Font display



Click for the powerpoint presentation

曆法历法

[System for determining the beginning, length and divisions of a year]


Click for the powerpoint presentation

曆法was incorrectly displayed as 歷法.Is it a data entry error? a display problem? or what?


Why migrating
Why Migrating?

  • EACC (East Asian Character Code, ANSI Z39.64-1989) was introduced into the CJK library community by RLG in the early 1980s (known as REACC at that time)

  • Its was an important milestone – for the first time, we began to have a C-J-K unified standard with a relatively large character set (about 16,000) for use in bibliographic records


Why migrating cont
Why Migrating? [cont.]

  • By adopting EACC as an alternate character set in MARC 21 (at that time it was called USMARC), libraries with East Asian collections were able to share and use CJK cataloging records via the OCLC and RLIN cataloging platforms

  • However, great effort is required for integrated library systems (ILS) to make use of the EACC-based CJK data in the records


Why migrating cont1
Why Migrating? [cont.]

  • To communicate in EACC is extremely difficult because EACC failed to be supported in the mainstream IT environment

    • Hardly you can find EACC supported by operating systems, fonts, input methods, editors, etc., both in the old days and today

    • It will also be unlikely to see EACC supported in web browsers in the current Internet era

      Why? – EACC’s three-byte coding structure is alien to the binary computing world


Why migrating cont2
Why Migrating? [cont.]

  • Due to its unpopularity, EACC became a frozen standard and there is no way to fix errors and add characters

  • If EACC is stored natively in the bibliographic database, then in order to input and display CJK characters at the application layers (such as OPAC and record editor), ILS will have to rely on lossy mapping tables to map EACC to other character encodings (e.g. BIG5, GB, JIS, KSC and UTF-8)


Why migrating cont3
Why Migrating? [cont.]

  • Unicode comes to the rescue

    • Single standard for written texts of almost all languages in the world

    • Has more than 96,000 characters, most of them are CJK

    • An active standard, with constant updates

    • Widely adopted and supported in the current IT environment – major operating systems and web browsers, plus many devices and applications, speak the Unicode language


Why migrating cont4
Why Migrating? [cont.]

  • With more than 25 years’ influence by EACC, it is unlikely that all library systems and data can be migrated overnight to the Unicode mainstream

  • It is anticipated that there will be a period of parallel operation, with co-existing EACC and Unicode bibliographic data interchanging among systems, resulting in confusion and data loss

  • Even if systems have migrated to Unicode, there are still problems that require attention


What has been done
What has been done?

  • MARC 21 specifications for MARC-8 and UCS/Unicode environment

  • LC’s code tables for mapping between MARC-8 and Unicode

  • OCLC WorldCat migration to Unicode platform

  • OCLC Connexion’s Unicode support

  • LC’s Voyager upgrade

  • INNOPAC/Millennium

  • HKIUG Unicode Initiatives


Marc 21 specifications
MARC 21 Specifications

  • In 2000, the Library of Congress issued:

    Specifications to distinguish the encoding of MARC 21 records in the original (MARC-8) environment and in the new UCS/Unicode environment[http://www.loc.gov/marc/specifications/speccharintro.html]

  • MARC-8 means characters are encoded in one 8-bit byte (e.g. ASCII) and three 8-bit bytes (e.g. EACC)


Click for the powerpoint presentation

21 62 62 21 39 25 21 30 21

黃 大 一

A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in EACC in MARC-8 environment


Marc 21 specifications cont
MARC 21 Specifications [cont.]

  • UCS/Unicode Environment[http://www.loc.gov/marc/specifications/speccharucs.html]

    • Use UTF-8 as character encoding

    • Leader position 9 contains value “a”

    • Field 066 (Character Sets Present) is not needed

    • The script identification information in subfield 6 (Linkage) can be dropped

    • Lengths specified by number of 8-bit bytes, rather than number of characters.


Marc 21 specifications cont1
MARC 21 Specifications [cont.]

  • Unicode combining rule for diacritics, i.e. combining marks follow rather than precede the character they modify


Click for the powerpoint presentation

A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment


Marc 21 specifications cont2
MARC 21 Specifications Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment[cont.]

  • LC issued code tables for mapping between MARC-8 and UCS/Unicode:

    • Not only for EACC, but also for other Latin and non-Latin scripts such as ANSEL, Hebrew, Cyrillic, Arabic and Greek

    • Provide essential information for ILS’s Unicode implementation


Marc 21 specifications cont3
MARC 21 Specifications Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment[cont.]

  • UNICODE-MARC Discussion List[http://listserv.loc.gov/listarch/unicode-marc.html]

    • Since July 2005

    • Active discussion on issues concerning Unicode implementation in MARC 21

    • Some of the discussion was summarized as MARC Proposal 2006-04, "Technique for conversion of Unicode to MARC-8,” and was approved by MARBI in January 2006, with changes.[http://www.loc.gov/marc/marbi/2006/2006-04.html]


Oclc worldcat and connexion
OCLC WorldCat and Connexion Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

  • WorldCat – migrated to Oracle with Unicode support

  • Released Connexion client software

    • Unicode-based, running on Windows

    • Comprehensive CJK support

    • Rely on Windows’ IME for input of CJK characters

    • Export and import of records in both MARC-8 and UCS/Unicode environments.


Lc s catalog
LC’s Catalog Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

  • Its Voyager system was upgraded recently to provide Unicode support

  • Capable of displaying and searching CJK data in 880 fields

  • Allows export of records in MARC-8 and Unicode environments

  • Issued a cataloging policy position paper for the Unicode implementation at LC (March 2006), with details on current implementation and future opportunities[http://www.loc.gov/catdir/cpso/unicode.pdf]


Innopac millennium
INNOPAC/Millennium Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

  • INNOPAC has been supporting EACC, and CJK in general, since its implementation at HKUST Library 15 years ago

  • Millennium clients run on Windows XP with Unicode support

  • CJK records are stored in EACC internally; but provides option to migrate the storage to Unicode

  • HKIUG Unicode Task Force is working with the vendor to improve the Unicode storage


Hkiug unicode initiatives
HKIUG Unicode Initiatives Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

  • HKIUG – Hong Kong Innovative Users Group

    • Founded in 1996

    • Members from all 15 INNOPAC libraries in Hong Kong and Macau, including the eight Hong Kong government-funded universities

  • HKIUG Unicode Initiatives – since 2003, to work closely with the ILS vendor (Innovative Interfaces Inc.) to improve INNOPAC / Millennium’s CJK support


Hkiug unicode initiatives cont
HKIUG Unicode Initiatives Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment[cont.]

  • Achievements:

    • Developed HKIUG Version of the EACC to Unicode mapping table

    • Resolved EACC to Unicode multi-mapping problem

    • Developed TSVCC (Traditional, Simplified, Variant Chinese Characters) linking tables

  • HKIUG Unicode Task Force - to maintain the Unicode and TSVCC tables and to assist the vendor on Unicode migration; members from CUHK, CITYU, HKUST and HKU


Migration issues
Migration Issues Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

  • The need of EACC/Unicode mapping table

  • Multi-mapping and round trip failure problems

  • TSVCC linking

  • Font display problem


Hkiug eacc unicode table
HKIUG EACC/Unicode Table Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

  • First released in September 2003; last revised in August 2005

  • Contains:

    • 15672 EACC characters

    • 7043 pure CCCII characters

  • Mapping for EACC characters - follows LC as much as possible

  • Contains 7043 “Pure CCCII” that have no EACC equivalent - includes them to avoid too many missing characters


Hkiug eacc unicode table cont
HKIUG EACC/Unicode Table Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment[cont.]

  • Identified:

    • 160 multi-mapping linked cases, e.g.

    • 49 multi-mapping unlinked cases, e.g.

  • Causing failure in round-trip crosswalk


Click for the powerpoint presentation

Round-trip Crosswalk Failure Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

EACC

Library

1. Library contributes 历in EACC {274349}, which is the simplified form of 曆

4. Library receives 历 in EACC {27462A}, which is the simplified form of 歷

Step 2:

U+7CFB 系

Export from OCLC

Import to OCLC

3. Connexion finds {274349} and {27462A} in mapping table and decides to output历in EACC {27462A}

2. Connexion finds {274349} in mapping table andstores 历in Unicode U+5386

OCLCWorldCat

Unicode


Click for the powerpoint presentation

U+5386 Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment


Click for the powerpoint presentation

Export Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment


Click for the powerpoint presentation

Export output is {27 46 2A} Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment– incorrect!


Tsvcc linking
TSVCC Linking Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

  • When searching 历法 “Li fa”, you will prefer to retrieve records that have:

    • 历法

    • 曆法

      where 曆 and 历 have Traditional – Simplified relationship

  • Similarly, when searching 屏, you will prefer to retrieve its Variant屛

  • Requires linking T,S,V forms during searching


Click for the powerpoint presentation

In LC’s Online Catalog, searching title Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment曆法will retrieve 3 hits.


Click for the powerpoint presentation

Searching with Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment历,the simplified formof 曆, will however retrieve 3 other hits.


Click for the powerpoint presentation

慈禧太後 Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment? Excuse me, are they typos! Shouldn’t it be 慈禧太后?


Click for the powerpoint presentation

Google is capable linking Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment餘and 余


Tsvcc linking cont
TSVCC Linking Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment[cont.]

  • HKIUG Unicode Task Force constructed two versions of TSVCC Linking tables

    • EACC Version [released November 2004]

    • Unicode Version [draft created March 2006]

      for ILS’s that store characters in EACC and in Unicode respectively


Tsvcc linking cont1
TSVCC Linking Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment[cont.]

  • EACC Version

    • Table M (80 entries)– linking relationship is not purely from EACC, e.g.

      214349 曆| 274349 历| 2D4349 暦| 21462A 歷| 27462A 历| 4B462A 歴| #U+5386 multi-mapped 27462A,274349

    • Table V (3065 entries) – linking relationship is purely from EACC, e.g.

      21306C 仇| 2D306C 讎| 33306C 讐| 4B306C 雠


Tsvcc linking cont2
TSVCC Linking Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment[cont.]

  • Unicode Version

    • Still in draft construction

    • So far has 3061 entries, e.g.

      U+5C5B 屛| U+5C4F 屏| U+6452 摒| #EACC link ([27/21]415A) AND Variant form of U+5C4F is U+5C5B

      U+965D 陝| U+965C 陜| U+9655 陕| #EACC link ([23/29]4A44) AND Simplified form of U+965D is U+9655 is


Tsvcc linking cont3
TSVCC Linking Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment[cont.]

  • Plan to include linking of New/Old forms in the TSVCC Unicode Version, e.g.


Tsvcc linking cont4
TSVCC Linking Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment[cont.]

  • Results of implementing TSVCC Linking:

    • Improvement in searching – higher recall

    • Trade-off – lower precision

    • If search results are sorted/displayed in TSVCC normalized form, misleading and inaccurate display may occur - such as the OCLC Connexion browse list display problem mentioned previously


Font issues
Font Issues Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

  • Do not believe in What you see is what you have, because What you see varies with fonts !

  • For example, the following glyphs have different code points in EACC:


Font issues1
Font Issues Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

  • But in Unicode, they are assigned the same code points. Depending on the font in use, you will see different glyphs:


Conclusion
Conclusion Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

  • How far are we?

    • Both LC and OCLC have done enormous work in enabling and promoting the use of Unicode in MARC records

    • ILS vendors are working very hard to implement and enhance the Unicode support

    • Libraries and CJK experts are providing advice and suggesting solutions


Conclusion cont
Conclusion Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment[cont.]

  • We have reviewed various migration issues:

    • The need for an accurate EACC/Unicode mapping table

    • Extending to non-EACC characters

    • Multi-mappings and round-trip failure

    • TSVCC Linking

    • Font display issues


Conclusion cont1
Conclusion Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment[cont.]

  • The failure of round-trip crosswalk between systems will continue to be a problem until everyone interchanges MARC records purely in Unicode. This will only happen when the majority of systems store and use data natively in Unicode.

  • Unlike EACC, Unicode does not have a build-in linking relationship. Implementing TSVCC is essential for improving searching.


Additional references
Additional References Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

  • Assessment of Options for Handling Full Unicode Character Encodings in MARC 21 -- Part 1: New Scripts ( January 2004) and Part 2: Issues (June 2005).[http://www.loc.gov/marc/marbi/list-report.html]

  • Joan M. Aliprand. The structure and content of MARC 21 records in the Unicode environment. Information technology and libraries, v.24, no.4, December 2005, p.170-179.

  • Wong, Philip and K.T. Lam. HKIUG’s Unicode projects : untangling the chaotic codes. HKIUG Annual Meeting 2005. [http://hdl.handle.net/1783.1/2429]


Thank you
Thank You! Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment