Re: UTF-8 in path / filename

help-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 in path / filename

From:	Peter Dyballa
Subject:	Re: UTF-8 in path / filename
Date:	Sat, 26 Aug 2006 11:36:34 +0200


Am 26.08.2006 um 01:09 schrieb Miles Bader:

Peter Dyballa <Peter_Dyballa@Web.DE> writes:

There won't be a perfect solution with GNU Emacs in the nearfuture ...


You constantly seem to be having problems with UTF-8, but it works

absolutely perfectly for me, filenames, dired, everything (usingemacs 22).


[It works perfectly even if I do `emacs -Q' to avoid loading my init
file, though I normally use (set-language-environment 'japanese).]

AFAIK the main thing is that your LANG environment variable be set to
something mentioning utf-8 -- I use "ja_JP.UTF-8".


        pete 39 /\ .
        /Users/pete
        pete 40 /\ env | egrep -i 'LC|LANG'
        LANG=de_DE.UTF-8
        LC_CTYPE=de_DE.UTF-8
        pete 41 /\  /usr/local/bin/emacs-22.0.50 -Q &

Files with UTF-8 characters in them are shown in dired (has -u: inmode-line, i.e. uses UTF-8) à la <vowel><empty box>. Some UTF-8characters like ß or Û show up as themselves. In the same manner theyappear in the buffer's mode-line, once visited, and also in the listof buffers buffer (C-x b), completely unreadable in the Buffers menufrom menu bar and in another completely unreadable fashion in the"Buffer Menu" pop-up. The font used for the vowels, the empty boxes,or the other characters is taken from the Java SDK and quite rich(1425 mapped characters for mostly European and some near easternscripts):

-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-ISO10646-1 (#x61)-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-ISO10646-1 (#x308)-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-ISO10646-1 (#xDF)-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-ISO10646-1 (#x20AC)

Somehow this looks like a mixture of ISO 8859 characters (#x61, #xDF)and Unicode (#x20AC) and something else (#x308) or are somerepresentations just abbreviations that leave away the 'leading zeros?'


The other information from C-u C-x = on the examples is:

  character: a (97, #o141, #x61, U+0061)
    charset: ascii (ASCII (ISO646 IRV))
code point: #x61
     syntax: w  which means: word
   category: a:ASCII l:Latin
buffer code: #x61
  file code: #x61 (encoded by coding system mule-utf-8)

  character:  (332488, #o1211310, #x512c8, U+0308)

charset: mule-unicode-0100-24ff (Unicode characters of the rangeU+0100..U+24FF.)

code point: #x25 #x48
     syntax: w  which means: word
   category: ^:Combining diacritic or mark
buffer code: #x9C #xF4 #xA5 #xC8
  file code: #xCC #x88 (encoded by coding system mule-utf-8)

  character: ß (2271, #o4337, #x8df, U+00DF)

charset: latin-iso8859-1 (Right-Hand Part of Latin Alphabet 1(ISO/IEC 8859-1): ISO-IR-100.)

code point: #x5F
     syntax: w  which means: word
   category: l:Latin
buffer code: #x81 #xDF
  file code: #xC3 #x9F (encoded by coding system mule-utf-8)

  character: Û (342604, #o1235114, #x53a4c, U+20AC)

charset: mule-unicode-0100-24ff (Unicode characters of the rangeU+0100..U+24FF.)

code point: #x74 #x4C
     syntax: w  which means: word
buffer code: #x9C #xF4 #xF4 #xCC
  file code: #xE2 #x82 #xAC (encoded by coding system mule-utf-8)

An excerpt from the fontset's description (I am missing ISO 8859-16!):

Fontset: -*-*-medium-r-*-*-10-*-*-*-m-*-fontset-startup
CHARSET or CHAR RANGE   FONT NAME
---------------------   ---------

ascii -b&h-lucidatypewriter-medium-r-normal-sans-10-100-75-75-m-60-iso10646-1

     [-Adobe-Courier-Medium-R-Normal--10-100-75-75-M-60-ISO10646-1]

[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-ISO10646-1][-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-ISO10646-1]

latin-iso8859-1         -b&h-lucidatypewriter-*-iso10646-1

[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-ISO10646-1][-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-ISO10646-1]

latin-iso8859-2         -*-iso8859-2
latin-iso8859-3         -*-iso8859-3
latin-iso8859-4         -*-iso8859-4
thai-tis620             -*-*-*-tis620-*
greek-iso8859-7         -*-iso8859-7
arabic-iso8859-6        -*-iso8859-6
hebrew-iso8859-8        -*-iso8859-8
katakana-jisx0201       -*-jisx0201-*
latin-jisx0201          -*-jisx0201-*
cyrillic-iso8859-5      -*-iso8859-5
latin-iso8859-9         -*-iso8859-9
latin-iso8859-15        -*-iso8859-15
latin-iso8859-14        -*-iso8859-14
...
mule-unicode-2500-33ff  -b&h-lucidatypewriter-*-iso10646-1
mule-unicode-e000-ffff  -b&h-lucidatypewriter-*-iso10646-1
mule-unicode-0100-24ff  -b&h-lucidatypewriter-*-iso10646-1

[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-ISO10646-1][-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-ISO10646-1]

...

IMO the display of UTF-8 characters is not sufficient.

If that doesn't work, I dunno, maybe it's something screwy aboutthe mac.

There is something special, possibly screwy, in Mac OS X's (orbetter: HFS+', the file system's) way to store UTF-8 characters infile names: they get de-composed, i.e. an ä becomes a¨, an à becomesa`, etc. (and only these, a file's contents does not get de-composedhow would such a JPEG picture look like?). So two or three octetsin the string on disk are expanded to a pair of one octet and(mostly ?) two octets. GNU Emacs should be able to detect that: if acharacter is from the category (see above) "Combining diacritic ormark" it can't stand alone by nature, but must be combined with thecharacter on the left in a left to right writing system or with thecharacter on the right in a right to left writing system (I have noidea of the rules in a top to bottom writing system like Mongolianand whether these have combining characters). And it should be ableto handle the character categories correctly.


--
Greetings

  Pete

What¹s the difference between OS X and Vista?

Microsoft employees are excited about OS X

[Prev in Thread]

Current Thread

[Next in Thread]

UTF-8 in path / filename, Grégory SCHMITT, 2006/08/24
- Re: UTF-8 in path / filename, Noah Slater, 2006/08/24
- Re: UTF-8 in path / filename, Peter Dyballa, 2006/08/25
- Message not available
  - Re: UTF-8 in path / filename, Grégory SCHMITT, 2006/08/25
    - Re: UTF-8 in path / filename, Peter Dyballa, 2006/08/25
    - Re: UTF-8 in path / filename, Grégory SCHMITT, 2006/08/25
    - Re: UTF-8 in path / filename, Peter Dyballa, 2006/08/25
    - Message not available
    - Re: UTF-8 in path / filename, Miles Bader, 2006/08/25
    - Re: UTF-8 in path / filename, Peter Dyballa <=
    - Re: UTF-8 in path / filename, James Cloos, 2006/08/26
    - Re: UTF-8 in path / filename, Peter Dyballa, 2006/08/27
    - Re: UTF-8 in path / filename, James Cloos, 2006/08/28
    - Re: UTF-8 in path / filename, Peter Dyballa, 2006/08/28
    - Message not available
    - Re: UTF-8 in path / filename, Harald Hanche-Olsen, 2006/08/27
    - Message not available
    - Re: UTF-8 in path / filename, Grégory SCHMITT, 2006/08/25
    - Message not available
    - Message not available
    - Re: UTF-8 in path / filename, Grégory SCHMITT, 2006/08/25
    - Re: UTF-8 in path / filename, Miles Bader, 2006/08/25

Prev by Date: last-input-event and X/non-X
Next by Date: Re: last-input-event and X/non-X
Previous by thread: Re: UTF-8 in path / filename
Next by thread: Re: UTF-8 in path / filename
Index(es):
- Date
- Thread