[ILUG] Editing unicode text files.

Brian Foster blf at blf.utvinternet.ie
Sat Feb 17 10:42:04 GMT 2007


  | Date: Fri, 16 Feb 2007 16:31:43 +0000
  | From: "Aine Douglas" <aine.douglas at gmail.com>
  | 
  | Can anyone recommend a commandline text editor that is capable of
  | editing unicode text files?
  | 
  | I've got some webpages to edit which contain chinese script, and when
  | I open them in vi i get long strings of @@@@@^^^???@@ etc, and its a
  | pain downloading them for really small edits.

 I don't quite grok what it is you want to do?
 First, “Unicode” is ambiguous to the point of meaningless;
 what matters is the encoding, not what is encoded.
 ( Briefly:  Every character is in the UCS (Universal
  Character Set, ISO-10646, also called “Unicode”†).
  A character's binary representation is an encoding.
  US-ASCII, e.g., is the first 128 charaters of the UCS;
  ISO-8859-1 is the first 256; ISO-8859-15 is a slightly
  different set of 256; UTF-8 is all two billion; and
  there are many other encodings. )

 Second, how will the editor be used without downloading
 the files in question?

 And third, by “command line” do you mean something like
 sed(1), or just an editor you can launch from the shell
 (like the vi(1) mentioned?).

 Editors that can handle the full UCS/Unicode in a variety
 of encodings include vim(1), mined, and yudit.  Some other
 editors, such as joe(1), handle UTF-8 but not necessarily
 an arbitrary encoding.

 I've only used `vim' in anger (in several senses! ;-) ):
 `vim', at least, will autodetect the file's encoding and
 map it to yer locale's, and hence you can use `vim' to
 edit a SJIS file on a UTF-8 system.  The file is saved
 in its original encoding.  Almost needless to say, this
 mapping works best if the system/locale uses UTF-8 (on
 Linux), since UTF-8 round-trips the full UCS.  Result is,
 provided you are displaying UTF-8 correctly (mostly a
 matter of fonts), `vim' works quite well (albeit keying
 in non-keyboard characters can be a pain:  I tend to use
 gucharmap(1) and copy-and-paste).

cheers!
	-blf-
 
  †  Pedantically, “Unicode” means three different things,
    and is not a synonym for the UCS.

-- 
Experienced (>25 yrs) kernel/software Eng: | Brian Foster   Montpellier,
 • Unix, embedded, &tc;  • Linux;  • doc;  | blf at utvinternet.ie   FRANCE
 • IDL, automated testing, process, &tc.   |  Stop E$$o (ExxonMobile)!
Résumé (CV) http://www.blf.utvinternet.ie  |     http://www.stopesso.com



More information about the ILUG mailing list