1 / 40

HKIUG Unicode Task Force and the EACC to Unicode Migration

7 th Annual Hong Kong Innovative Users Group Meeting 11 and 12 December 2006 HKUST Library. HKIUG Unicode Task Force and the EACC to Unicode Migration. Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk. Contents.

Download Presentation

HKIUG Unicode Task Force and the EACC to Unicode Migration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 7th Annual Hong Kong Innovative Users Group Meeting11 and 12 December 2006 HKUST Library HKIUG Unicode Task Force and the EACC to Unicode Migration Ki Tat LAM Head of Library Systems The Hong Kong University of Science and Technology Library lblkt@ust.hk

  2. Contents • HKIUG Unicode Task Force • CJK/Unicode Resources and the Unicode Version of TSVCC Table • Migrating INNOPAC’s storage environment from EACC to Unicode • MARC-8 and Unicode Environments • Outstanding Issues

  3. Observations …

  4. 曆法历法 [System for determining the beginning, length and divisions of a year]

  5. 曆法was incorrectly displayed as 歷法.Is it a data entry error? a display problem? or what?

  6. Observation #1: • Although OCLC WorldCat’s storage environment has been migrated to Unicode and its Connexion client is Unicode-based, works are not finished yet. There are still problems that require attention • How about INNOPAC and its Unicode Storage Environment? How ready is it for existing EACC-based sites to migrate to?

  7. U+5386

  8. Export(in MARC-8)

  9. Export output is {27 46 2A} – incorrect!

  10. Round-trip Crosswalk Failure EACC Library 1. Library contributes 历in EACC {274349}, which is the simplified form of 曆 4. Library receives 历 in EACC {27462A}, which is the simplified form of 歷 Step 2: U+7CFB 系 Export from OCLC Import to OCLC 3. Connexion finds {274349} and {27462A} in mapping table and decides to output历in EACC {27462A} 2. Connexion finds {274349} in mapping table andstores 历in Unicode U+5386 OCLCWorldCat Unicode

  11. Observation #2: • The failure of round-trip crosswalk between systems will continue to be a problem until everyone interchanges MARC records purely in Unicode. This will only be achieved when majority of systems store and use data natively in Unicode • Immediate need for INNOPAC sites to migrate to Unicode storage environment!

  12. HKIUG Unicode Task Force • In 2003-2004, an ad hoc group of systems librarians and catalogers from member libraries worked closely with Innovative Interfaces, Inc. (III) on issues related to CJK and the EACC to Unicode mappings. • Developed HKIUG Version of the EACC to Unicode mapping table • Resolved EACC to Unicode multi-mapping problem • Began drafting TSVCC (Traditional, Simplified, Variant Chinese Characters) table

  13. HKIUG Unicode Task Force [2] • February 2005, the HKIUG Unicode Task Force was officially established to: • maintain the CJK/Unicode resources produced in 2003-2004; • develop new resources, such as the Unicode Version of the TSVCC table; • facilitate the searching, display and retrieval of CJK records in library catalogs; and • assist member libraries in migrating from EACC-based character encoding to Unicode

  14. HKIUG Unicode Task Force [3] • Member of the Task Force: • CHAN Wai Ming (Secretary), University of Hong Kong • HO Yee Ip, Chinese University of Hong Kong • LAM Ki Tat (Chair), The Hong Kong University of Science and Technology • Joanna PONG, City University of Hong Kong • SUN Zehua, The Hong Kong University of Science and Technology • Mr. Philip WONG, City University of Hong Kong • Recruiting new members – we welcome colleagues to join force …

  15. HKIUG Unicode Task Force [4] • Achievements in 2006: • July 2006 - finished and released the Unicode Version of the TSVCC Table • August 2006 - released the CJK/Unicode Resources developed over the past three years to the Internet for open access [http://hkiug.ln.edu.hk/unicode/] • November 2006 – visited Hong Kong Shue Yan College (HKSYC) Library to study its Unicode Storage Environment; and reported outstanding issues to III.

  16. TSVCC Table - Unicode Version • When searching 历法 “Li fa”, you will prefer to retrieve records that have: • 历法 • 曆法 where 曆 and 历 have a Traditional – Simplified relationship • Similarly, when searching 屏, you will prefer to retrieve its Variant屛 • Requires linking T,S,V forms during searching

  17. TSVCC Table - Unicode Version [2] • Results of implementing TSVCC Linking: • Improvement in searching – higher recall • Trade-off – lower precision • If search results are sorted/displayed in TSVCC normalized form, misleading and inaccurate display may occur - such as the OCLC Connexion browse list display problem mentioned previously

  18. TSVCC Table - Unicode Version [3] • HKIUG Unicode Task Force constructed two versions of TSVCC tables • EACC Version [1.0 released August 2005] • Unicode Version [1.0 released July 2006] for INNOPAC systems that store characters in EACC and in Unicode respectively

  19. TSVCC Table - Unicode Version [4] • TSVCC link cases collected in the Unicode Version are: • derived from the EACC Version, e.g.EACC link, U+XXXX multi-mapped; • harvested from Unicode Consortium’s Unihan Database, e.g.kSimplifiedVariant, kZVariant; • proposed by the Unicode Task Force members, e.g.hkiugSimplifiedVariant, hkiugZVariant

  20. TSVCC Table - Unicode Version [5] • Examples of Link Cases in Unicode Version: U+66C6 曆 | U+5386 历 | U+66A6 暦 | U+6B77 歷 | U+6B74 歴 | U+F98B 曆 | U+F98C 歷 | #EACC link ([21/27/2D]4349),([21/27/4B]462A) AND U+5386 multi-mapped 27462A,274349 AND kZVariant of U+F98B is U+66C6 AND kZVariant of U+F98C is U+6B77 U+5C5B 屛 | U+5C4F 屏 | U+6452 摒 | #EACC link ([27/21]415A) AND hkiugZVariant of U+5C4F is U+5C5B

  21. TSVCC Table - Unicode Version [6] • Support linking of CJK Compatibility Ideographs • e.g. [U+F92F勞]in theprevious screen dump, a variant from KS C5601-1987 • Support linking offorms used differently in Mainland China and in Hong Kong, for example:

  22. TSVCC Table - Unicode Version [7] • We welcome contribution from CJK experts and colleagues of member libraries to enhance the TSVCC tables • e.g. projects to establish TSVCC links from Hangul Syllables, Hiragana and Katakana to CJK ideographs

  23. MARC-8 and Unicode Environments • In 2000, the Library of Congress issued: Specifications to distinguish the encoding of MARC 21 records in the original (MARC-8) environment and in the new UCS/Unicode environment[http://www.loc.gov/marc/specifications/speccharintro.html] • MARC-8 means characters are encoded in one 8-bit byte (e.g. ASCII) and three 8-bit bytes (e.g. EACC)

  24. 21 62 62 21 39 25 21 30 21 黃 大 一 A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in EACC in MARC-8 environment

  25. MARC-8 and Unicode Environments [2] • UCS/Unicode Environment[http://www.loc.gov/marc/specifications/speccharucs.html] • Use UTF-8 as character encoding • Leader position 9 contains value “a” • Field 066 (Character Sets Present) is not needed • The script identification information in subfield 6 (Linkage) can be dropped • Lengths specified by number of 8-bit bytes, rather than number of characters.

  26. MARC-8 and Unicode Environments [3] • Unicode combining rule for diacritics, i.e. combining marks follow rather than precede the character they modify

  27. A MARC 21 bibliographic record in ISO2709 format viewed in Notepad, showing CJK characters encoded in UTF-8 in UCS/Unicode environment

  28. Migrating from EACC to Unicode • The following INNOPAC systems are in Unicode Storage Environment: • HKSYC (Hong Kong Shue Yan College) • HKALL (the INN-Reach system for the eight universities in Hong Kong) • HKUST Tool Testing Database

  29. Migrating from EACC to Unicode [2] • HKSYC Visit • A group of systems librarians and catalogers from member libraries visited HKSYC Library in November 2006 to learn how its INNOPAC system works in Unicode Storage Environment • A number of outstanding issues were identified and/or confirmed • If you have migrated to Unicode storage or plan to migrate now, you might also face the same problems

  30. Migrating from EACC to Unicode [3] • Outstanding Issues • TSVCC Linking not turned on; and even if turned on, it would not be using the latest HKIUG version • When entering CJK characters via Millennium Editor, such as U+8AAC 説 and U+7CB5 粵, and saving the record, these characters would be stripped away and not saved - destructive bug awaiting fixing

  31. Migrating from EACC to Unicode [4] • Export from INNOPAC - only export in MARC-8 Environment was provided. There should be option for users to export in Unicode Environment • III replied that this option is available • Import (Load) into INNOPAC - only import in MARC-8 Environment was provided. There should be option for users to load MARC records in Unicode Environment (i.e. in UTF-8). • III replied that this option is available

  32. Migrating from EACC to Unicode [5] • It seemed that sorting at HKSYC is still EACC-based • Sorting key seemed to be constructed from:[No. of strokes][EACC code value] • For example, as observed from WebPAC’s URL, sorting key for 中國 is: “04{213034}11{21376f}”.It should instead be sorted in Unicode code value, i.e. “04{u4e2d}11{u570b}”

  33. Migrating from EACC to Unicode [6] • Also need to fix the illogical sorting orders as found in HKUST’s Tool Testing Database: 1: ASCII space/punctuations (e.g. :) 2: ASCII numerals (e.g. 1) 3: CJK characters with pinyin (e.g. 中) 4: ASCII Alphabets (e.g. a) 5: CJK characters without pinyin (e.g. を)

  34. Migrating from EACC to Unicode [7] • Pure Unicode Storage Environment • Once migrated to Unicode Storage Environment, there should not be needs for mapping back and forth between EACC and Unicode, except for some necessary conversion routines • In order to maintain a natively Unicode environment, EACC dependence should be identified and eliminated

  35. Conclusion • How far are we towards native Unicode? • Both LC and OCLC have done enormous work in enabling and promoting the use of Unicode in MARC records • ILS vendors including III are working very hard to implement and enhance the Unicode support • Libraries and CJK experts are providing advice and suggesting solutions

  36. Conclusion [2] • Migrating INNOPAC to Unicode • We have reviewed various outstanding issues as found in INNOPAC’s Unicode Storage Environment • We hope these issues will be resolved quickly so that HKIUG member libraries can start to migrate their systems to Unicode • HKIUG Unicode Task Force will continue to work closely with III to enable a smooth migration

  37. Additional Readings • K.T. Lam. EACC to Unicode migration. OCLC-CJK Users Group 2006 Annual Meeting.[http://hdl.handle.net/1783.1/2500] • Wong, Philip and K.T. Lam. HKIUG’s Unicode projects : untangling the chaotic codes. HKIUG Annual Meeting 2005. [http://hdl.handle.net/1783.1/2429]

  38. Thank You!

More Related