CORRECTION - Re: [ILUG] Editing unicode text files.

Brian Foster blf at blf.utvinternet.ie
Tue Feb 20 08:27:23 GMT 2007


  | Date: Mon, 19 Feb 2007 22:15:36 +0000
  | From: "Francis Daly" <francisdaly at gmail.com>
  |[ ... ]
  | >  for practical purposes, UTF-16 and UCS-4 (also
  | >  called UTF-32) also both roundtrip.
  | 
  | By the same analogy [ difference between UCS-2 and UTF-16 ...],
  | there probably is a difference between UCS-4 and UTF-32; but it
  | will only kick in many bits above the 21 that Unicode uses.
  | So for this discussion, and for anything we're ever likely to
  | care about, they're the same.

 correct, but this is now diving into politics,
 and in particular, Redmond vs. RoW:  UTF-16 can
 only encode the initial 2^21 UCS codepoints (what
 I'll call the “Unicode range”) but none larger.
 it is technically impossible.†  in contrast, UCS-4
 and UTF-32 (and UTF-8) can encode everything (the
 complete range of 2^31 UCS/ISO-10646 codepoints).

 UCS-4 and UTF-32 are bit-for-bit identical.
 so why the two names?

 in a word, M$.  M$ (at least) is (or at least was,
 I'm not sure what the current status is) pushing a
 definition of UTF-8 which  (1) is used to encode
 only the 2^21 Unicode range;  and  (2) must start
 with a BOMb.  (in M$'s world, BOMb-less UTF-8 is
 called UTF-8N, but IIRC, is still used only for
 the 2^21 Unicode range despite being capable of
 encoding the full 2^31.)

 similarly, M$ is pushing a definition of UTF-32
 which is UCS-4 but used to encode only the 2^21
 Unicode range (and, I assume, starts with a BOMb).

 those are not the ISO definitions.  (however, I
 vaguely recall they have crept into the Unicode
 Consortium's terminology?)

 having said that, except for the BOMb issue, it
 doesn't really matter:  ISO has agreed to not
 define codepoints larger than Unicode's 2^21 cutoff
 (actually, it's U+10FFFF (IIRC), sometimes written
 2^21.5 (IIRC), but that's neither here nor there).

 upshot is UCS-4 and UTF-32 are indeed “the same”.
 both do indeed “cover everything”.  so does UTF-8,
 and (in practice) UTF-16, but not UCS-2.

cheers!
	-blf-

  †  you could, I suppose, do a similar extension
     trick that builds UTF-16 from UCS-2, but no
     such extension has been defined.  ergo, it's
     (currently) technically impossible to encode
     codepoints larger than U+10FFFF (the "2^21")
     in UTF-16.

  | And they both cover "everything".
  | 
  | Cheers,
  | 
  | 	f
-- 
Experienced (>25 yrs) kernel/software Eng: | Brian Foster   Montpellier,
 • Unix, embedded, &tc;  • Linux;  • doc;  | blf at utvinternet.ie   FRANCE
 • IDL, automated testing, process, &tc.   |  Stop E$$o (ExxonMobile)!
Résumé (CV) http://www.blf.utvinternet.ie  |     http://www.stopesso.com



More information about the ILUG mailing list