CORRECTION - Re: [ILUG] Editing unicode text files.
blf at blf.utvinternet.ie
Tue Feb 20 08:27:23 GMT 2007
| Date: Mon, 19 Feb 2007 22:15:36 +0000
| From: "Francis Daly" <francisdaly at gmail.com>
|[ ... ]
| > for practical purposes, UTF-16 and UCS-4 (also
| > called UTF-32) also both roundtrip.
| By the same analogy [ difference between UCS-2 and UTF-16 ...],
| there probably is a difference between UCS-4 and UTF-32; but it
| will only kick in many bits above the 21 that Unicode uses.
| So for this discussion, and for anything we're ever likely to
| care about, they're the same.
correct, but this is now diving into politics,
and in particular, Redmond vs. RoW: UTF-16 can
only encode the initial 2^21 UCS codepoints (what
I'll call the “Unicode range”) but none larger.
it is technically impossible.† in contrast, UCS-4
and UTF-32 (and UTF-8) can encode everything (the
complete range of 2^31 UCS/ISO-10646 codepoints).
UCS-4 and UTF-32 are bit-for-bit identical.
so why the two names?
in a word, M$. M$ (at least) is (or at least was,
I'm not sure what the current status is) pushing a
definition of UTF-8 which (1) is used to encode
only the 2^21 Unicode range; and (2) must start
with a BOMb. (in M$'s world, BOMb-less UTF-8 is
called UTF-8N, but IIRC, is still used only for
the 2^21 Unicode range despite being capable of
encoding the full 2^31.)
similarly, M$ is pushing a definition of UTF-32
which is UCS-4 but used to encode only the 2^21
Unicode range (and, I assume, starts with a BOMb).
those are not the ISO definitions. (however, I
vaguely recall they have crept into the Unicode
having said that, except for the BOMb issue, it
doesn't really matter: ISO has agreed to not
define codepoints larger than Unicode's 2^21 cutoff
(actually, it's U+10FFFF (IIRC), sometimes written
2^21.5 (IIRC), but that's neither here nor there).
upshot is UCS-4 and UTF-32 are indeed “the same”.
both do indeed “cover everything”. so does UTF-8,
and (in practice) UTF-16, but not UCS-2.
† you could, I suppose, do a similar extension
trick that builds UTF-16 from UCS-2, but no
such extension has been defined. ergo, it's
(currently) technically impossible to encode
codepoints larger than U+10FFFF (the "2^21")
| And they both cover "everything".
Experienced (>25 yrs) kernel/software Eng: | Brian Foster Montpellier,
• Unix, embedded, &tc; • Linux; • doc; | blf at utvinternet.ie FRANCE
• IDL, automated testing, process, &tc. | Stop E$$o (ExxonMobile)!
Résumé (CV) http://www.blf.utvinternet.ie | http://www.stopesso.com
More information about the ILUG