1 / 13

Charset to UTF

Charset to UTF. Good Old Old Days. Is there any other language but American ?? EBCDIC ASCII. Good Old Days. Ascii: 1-127 – latin 127-256 – French,Italian, German etc. or Greek or Hebrew or Russian etc. Multibyte.

kurt
Download Presentation

Charset to UTF

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Charset to UTF

  2. Good Old Old Days Is there any other language but American ?? EBCDIC ASCII

  3. Good Old Days Ascii: 1-127 – latin 127-256 – French,Italian, German etc. or Greek or Hebrew or Russian etc.

  4. Multibyte • Japanese – SJIS, EUC • Chinese – Big5, GB • Korean

  5. Babel’s Tower http://www.i18nguy.com/unicode/codepages.html#czyborra

  6. Many Languages • Hebrew • Japanese • Arabic In the same doc/line/screen

  7. Unicode • All Languages • Each char – 2 bytes – 63000+ • problem: Not string - wide char

  8. UTF8 • One to one with Unicode • 1-3 regular chars • Well defined algorithm

  9. Hebrew to Unicode 05D0 60  HEBREW LETTER ALEF05D1 61  HEBREW LETTER BET05D2 62  HEBREW LETTER GIMEL05D3 63  HEBREW LETTER DALET05D4 64  HEBREW LETTER HE05D5 65  HEBREW LETTER VAV05D6 66  HEBREW LETTER ZAYIN05D7 67  HEBREW LETTER HET05D8 68  HEBREW LETTER TET05D9 69  HEBREW LETTER YOD05DA 6A  HEBREW LETTER FINAL KAF05DB 6B  HEBREW LETTER KAF05DC 6C  HEBREW LETTER LAMED05DD 6D  HEBREW LETTER FINAL MEM05DE 6E  HEBREW LETTER MEM and likewise for each charset

  10. Need for Conversion • Existing Data • New data: Editors work in specific charsets, not in utf/unicode

  11. Brute Force Foreach org_char convert to utf

  12. Perl way 1 use ENCODE; ($if, $of)=@ARGV; open my $in, "<:encoding(iso-8859-8)", $if; open my $out, ">:encoding(utf8)", $of; while(<$in>) { print $out $_; } close $in;

  13. Perl way 2 perl -MEncode -e '($if, $of)=@ARGV;open my $in, "<:encoding(iso-8859-8)", $if;open my $out, ">:encoding(utf8)", $of;while(<$in>){ print $out $_; }' infile outfile

More Related