help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 in path / filename


From: Peter Dyballa
Subject: Re: UTF-8 in path / filename
Date: Sat, 26 Aug 2006 11:36:34 +0200


Am 26.08.2006 um 01:09 schrieb Miles Bader:

Peter Dyballa <Peter_Dyballa@Web.DE> writes:
There won't be a perfect solution with GNU Emacs in the near future ...

You constantly seem to be having problems with UTF-8, but it works
absolutely perfectly for me, filenames, dired, everything (using emacs 22).

[It works perfectly even if I do `emacs -Q' to avoid loading my init
file, though I normally use (set-language-environment 'japanese).]

AFAIK the main thing is that your LANG environment variable be set to
something mentioning utf-8 -- I use "ja_JP.UTF-8".


        pete 39 /\ .
        /Users/pete
        pete 40 /\ env | egrep -i 'LC|LANG'
        LANG=de_DE.UTF-8
        LC_CTYPE=de_DE.UTF-8
        pete 41 /\  /usr/local/bin/emacs-22.0.50 -Q &

Files with UTF-8 characters in them are shown in dired (has -u: in mode-line, i.e. uses UTF-8) à la <vowel><empty box>. Some UTF-8 characters like ß or Û show up as themselves. In the same manner they appear in the buffer's mode-line, once visited, and also in the list of buffers buffer (C-x b), completely unreadable in the Buffers menu from menu bar and in another completely unreadable fashion in the "Buffer Menu" pop-up. The font used for the vowels, the empty boxes, or the other characters is taken from the Java SDK and quite rich (1425 mapped characters for mostly European and some near eastern scripts):

-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- ISO10646-1 (#x61) -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- ISO10646-1 (#x308) -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- ISO10646-1 (#xDF) -B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- ISO10646-1 (#x20AC)

Somehow this looks like a mixture of ISO 8859 characters (#x61, #xDF) and Unicode (#x20AC) and something else (#x308) ­ or are some representations just abbreviations that leave away the 'leading zeros?'

The other information from C-u C-x = on the examples is:

  character: a (97, #o141, #x61, U+0061)
    charset: ascii (ASCII (ISO646 IRV))
code point: #x61
     syntax: w  which means: word
   category: a:ASCII l:Latin
buffer code: #x61
  file code: #x61 (encoded by coding system mule-utf-8)

  character:  (332488, #o1211310, #x512c8, U+0308)
charset: mule-unicode-0100-24ff (Unicode characters of the range U+0100..U+24FF.)
code point: #x25 #x48
     syntax: w  which means: word
   category: ^:Combining diacritic or mark
buffer code: #x9C #xF4 #xA5 #xC8
  file code: #xCC #x88 (encoded by coding system mule-utf-8)

  character: ß (2271, #o4337, #x8df, U+00DF)
charset: latin-iso8859-1 (Right-Hand Part of Latin Alphabet 1 (ISO/IEC 8859-1): ISO-IR-100.)
code point: #x5F
     syntax: w  which means: word
   category: l:Latin
buffer code: #x81 #xDF
  file code: #xC3 #x9F (encoded by coding system mule-utf-8)

  character: Û (342604, #o1235114, #x53a4c, U+20AC)
charset: mule-unicode-0100-24ff (Unicode characters of the range U+0100..U+24FF.)
code point: #x74 #x4C
     syntax: w  which means: word
buffer code: #x9C #xF4 #xF4 #xCC
  file code: #xE2 #x82 #xAC (encoded by coding system mule-utf-8)

An excerpt from the fontset's description (I am missing ISO 8859-16!):

Fontset: -*-*-medium-r-*-*-10-*-*-*-m-*-fontset-startup
CHARSET or CHAR RANGE   FONT NAME
---------------------   ---------
ascii -b&h-lucidatypewriter-medium-r-normal-sans-10-100-75-75-m-60- iso10646-1
     [-Adobe-Courier-Medium-R-Normal--10-100-75-75-M-60-ISO10646-1]
[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60- ISO10646-1] [-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- ISO10646-1]
latin-iso8859-1         -b&h-lucidatypewriter-*-iso10646-1
[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60- ISO10646-1] [-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- ISO10646-1]
latin-iso8859-2         -*-iso8859-2
latin-iso8859-3         -*-iso8859-3
latin-iso8859-4         -*-iso8859-4
thai-tis620             -*-*-*-tis620-*
greek-iso8859-7         -*-iso8859-7
arabic-iso8859-6        -*-iso8859-6
hebrew-iso8859-8        -*-iso8859-8
katakana-jisx0201       -*-jisx0201-*
latin-jisx0201          -*-jisx0201-*
cyrillic-iso8859-5      -*-iso8859-5
latin-iso8859-9         -*-iso8859-9
latin-iso8859-15        -*-iso8859-15
latin-iso8859-14        -*-iso8859-14
...
mule-unicode-2500-33ff  -b&h-lucidatypewriter-*-iso10646-1
mule-unicode-e000-ffff  -b&h-lucidatypewriter-*-iso10646-1
mule-unicode-0100-24ff  -b&h-lucidatypewriter-*-iso10646-1
[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60- ISO10646-1] [-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60- ISO10646-1]
...

IMO the display of UTF-8 characters is not sufficient.


If that doesn't work, I dunno, maybe it's something screwy about the mac.


There is something special, possibly screwy, in Mac OS X's (or better: HFS+', the file system's) way to store UTF-8 characters in file names: they get de-composed, i.e. an ä becomes a¨, an à becomes a`, etc. (and only these, a file's contents does not get de-composed ­ how would such a JPEG picture look like?). So two or three octets in the string on disk are expanded to a pair of one octet and (mostly ?) two octets. GNU Emacs should be able to detect that: if a character is from the category (see above) "Combining diacritic or mark" it can't stand alone by nature, but must be combined with the character on the left in a left to right writing system or with the character on the right in a right to left writing system (I have no idea of the rules in a top to bottom writing system like Mongolian ­ and whether these have combining characters). And it should be able to handle the character categories correctly.

--
Greetings

  Pete

What¹s the difference between OS X and Vista?

Microsoft employees are excited about OS XŠ







reply via email to

[Prev in Thread] Current Thread [Next in Thread]