CORRECTION - Re: [ILUG] Editing unicode text files.

Francis Daly francisdaly at gmail.com
Mon Feb 19 22:15:36 GMT 2007


On 19/02/07, Brian Foster <blf at blf.utvinternet.ie> wrote:
>   | Date: Sun, 18 Feb 2007 14:29:58 +0000
>   | From: "Francis Daly" <francisdaly at gmail.com>
>   |

>  good essay!
>  just one pedantic semi-correction ...

Thanks. And pedantry is good when you're trying to be complete, and
especially so when you're trying to be correct. So also thanks for
pointing out this bit.

>   | On the conversions, utf-8 and ucs-2 are reversible in both directions
>   | since they are just encodings of unicode [ ... ]

>  UCS-2 can only roundtrip if
>  all the characters are in the first 2^16 UCS
>  codepoints (U+0000..U+FFFF).  (and that is also
>  way UCS-2 is obsolete, replaced by UTF-16.)

Yes, you're right.

ascii defines 128 characters we fit in an octet. utf-8 is identical to
ascii for those 128 characters, and uses only the remaining 128
characters in the octets to encode "all" other codepoints -- where
"all" is "enough to cover all of Unicode (which is 21 bits)"

ucs-2 defines 62k characters we fit in two octets. utf-16 is identical
to ucs-2 for those 62k characters, and uses only the remaining 2k
characters in the two-octets to encode "all" other codepoints -- where
"all" also covers the 21 bits needed for all of Unicode.

So ascii does not cover everything, and utf-8 is not a synonym for
ascii; but if you stay in the (limited) ascii range of "unaccented
english", their encodings are identical

And ucs-2 does not cover everything, and utf-16 is not a synonym for
ucs-2; but if you stay in the (limited, but less limited than ascii)
ucs-2 range of the "Basic Multilingual Plane", their encodings are
identical

>  in practice, most characters/codepoints are in
>  that range, but IIRC, Klingon (as an example)
>  is not.  if yer text did contain Klingon,
>  converting to UCS-2 would be a disaster.

Also true.

To make a safe roundtrip for a particular codepoint, the thing you're
tripping to must be able to encode it; for a random codepoint, that
means "an encoding that covers everything"; but if you already know
the limits of possible initial codepoints, you may get away with an
incomplete encoding.

>  for practical purposes, UTF-16 and UCS-4 (also
>  called UTF-32) also both roundtrip.

By the same analogy above, there probably is a difference between
UCS-4 and UTF-32; but it will only kick in many bits above the 21 that
Unicode uses. So for this discussion, and for anything we're ever
likely to care about, they're the same.

And they both cover "everything".

Cheers,

	f



More information about the ILUG mailing list