help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: character encoding confusion


From: Pascal J. Bourguignon
Subject: Re: character encoding confusion
Date: Wed, 08 Dec 2010 15:18:00 -0000
User-agent: Gnus/5.101 (Gnus v5.10.10) Emacs/23.2 (gnu/linux)

patrol <patrol_boat@hotmail.com> writes:

> On Jul 7, 6:37 pm, p...@informatimago.com (Pascal J. Bourguignon)
> wrote:
>>
>> Remember that C only deals with integer.  There is no character type in C.
>
> I thought there was a char data type. Well, not exactly sure the
> relevence of that...

Yes, but char is defined as being some integer between MIN_CHAR and MAX_CHAR.

In C, there is no character type.

There are several ways to write literal integers,  such as 0101, 65,
'A', or 0x41, and there are several ways to write vector literals,
such as: {65,66,67,0} or "ABC", but there is no character.  And
therefore no string.

Until you write:

    typedef struct {
       unsigned code;
    }   character;

    typedef struct {
       int allocated;
       int length;
       character* contents;
    }   string;

and the associated functions.



>> So, what happens when you call: printf("%c",176); ?
>
> Well as I said in my post, I get a shaded square. 

When you call  printf("%c",176);, one byte of value 176 is sent to the
output stream (file or terminal).  That's all, as far as C is concerned.



> But printf("%c", 248) yields the degree sign. But like I said, under
> Latin-1 and UTF-8, 176 is the degree sign, not 248.

So what?

(Alternatively, you may try to find in what coding system 248 is a
degree sign:

CL-USER> (do-external-symbols (cs :charset) 
           (ignore-errors 
             (let ((str (ext:convert-string-from-bytes #(248) cs)))
               (when (string-equal  (char-name (character str)) "DEGREE_SIGN")
                 (print (list cs str))))))

(CHARSET:CP866 "°") 
(CHARSET:CP860 "°") 
(CHARSET:CP861 "°") 
(CHARSET:CP862 "°") 
(CHARSET:CP863 "°") 
(CHARSET:CP863-IBM "°") 
(CHARSET:CP869 "°") 
(CHARSET:CP861-IBM "°") 
(CHARSET:CP862-IBM "°") 
(CHARSET:CP437 "°") 
(CHARSET:CP852-IBM "°") 
(CHARSET:CP857 "°") 
(CHARSET:CP850 "°") 
(CHARSET:CP869-IBM "°") 
(CHARSET:CP852 "°") 
(CHARSET:CP775 "°") 
(CHARSET:CP860-IBM "°") 
(CHARSET:CP865-IBM "°") 
(CHARSET:CP737 "°") 
(CHARSET:CP437-IBM "°") 
(CHARSET:CP865 "°") 
NIL

But this only tells us that your terminal is configured to convert the
bytes it receives using some Microsoft-specific coding system.
)



>> Have a look at setlocale, LC_ALL, etc, and libiconv.
>
> I don't have any experience with this, but I did printf("%d", LC_ALL),
> which returned 0. Don't know what that means, but I'm not sure why
> locale settings should matter. Aren't Latin-1 and UTF-8 universal
> encodings? If a file is encoded in Latin-1, wouldn't the degree sign
> map to 176 regardless of locale?

LC_ALL is an environment variable in a POSIX system that informs
libraries and programs what language and character encoding the
current user and terminal expect.  There is a set of associated
variables.

setlocale(3) is a library function that let you indicate the language
and character encoding should be used for the current user and
terminal.

Type:
    man 3 setlocale
and read also all the manual pages listed in the SEE ALSO section.

Type
    man 3 iconv


For example, in my ~/.bash_env, I have:

    LC_CTYPE=en_US.UTF-8
    export LC_CTYPE

This defines an environment variable named LC_CTYPE whose value is
en_US.UTF-8.  This value indicate that I want the messages in USA
English, and encoded in UTF-8.  So programs may call getenv(3) to get
the value of these environment variables, and pass it to setlocale to
inform the libraries what encoding to use, and can itself use iconv(3)
to convert its own strings from their original encoding to the
encoding required by the terminal.



So, it seems you're writing your program on a Microsoft system, that I
was oblivious of that fact, and that I don't know anything about
programming Microsoft systems.   When I have to use a Microsoft system
(temporarily), I download http://www.cygwin.com/setup.exe, and use
cygwin, which gives me a POSIX environment, and if I have to develop a
program, I install Linux instead.  Perhaps the POSIX API I mention
here doesn't apply to Microsoft programs.  

In any case, coding systems may vary depending on the output device.
If your program writes to a terminal or console, you have to deal with
the coding systems configured in the terminal.  If it displays text
thru a GUI, you have to deal with the coding system expected by the
GUI toolkit you're using.  If it writes a file, you have to deal with
the encoding that must generated in that file.

The C compilers just take bytes and store bytes (if it wasn't
specified in the language, they're just C programs themselves!,
usually), so if you encode your sources in ISO-8859-1, you will have
ISO-8859-1 literal bytes in your program.  If you need to output to a
device that expects another encoding, then your program will have to
find what encoding is expected, and it will have to convert the
strings.


-- 
__Pascal Bourguignon__                     http://www.informatimago.com/


reply via email to

[Prev in Thread] Current Thread [Next in Thread]