tinycc-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorre


From: Christian Jullien
Subject: Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorrectly
Date: Fri, 1 Sep 2017 05:54:28 +0200

Hello,

I'm not sure you can assume that a character having code >= 0x80 is part of 
UTF-8. Beyond what is called "basic character set" which is globally the ASCII 
7bits, there is the "extended character set" which is implementation defined.

For example, the euro sign € may be part of 8859-15 and perfectly well encoded 
on 8bits with 0xA4 see https://en.wikipedia.org/wiki/ISO/IEC_8859-15

Microsoft VC++ has the following flags

/utf-8 set source and execution character set to UTF-8
/validate-charset[-] validate UTF-8 files for only legal characters

That controls how source code is encoded.

gcc (more specifically cpp the C preprocessor) processes source file using 
UTF-8 but, as VC++ has a flag to control input-char

       -finput-charset=charset
           Set the input character set, used for translation from the
           character set of the input file to the source character set used by
           GCC.  If the locale does not specify, or GCC cannot get this
           information from the locale, the default is UTF-8.  This can be
           overridden by either the locale or this command-line option.
           Currently the command-line option takes precedence if there's a
           conflict.  charset can be any encoding supported by the system's
           "iconv" library routine.

Now, tcc should be compatible with both. I mean:

- Native Windows tcc port should NOT assume characters are UTF-8 encoded and 
-utf-8 flag should change this behavior (+ -finput-charset=xxx for gcc 
compatibility)
- Other ports (I mean Linux & alt.) should assume characters are UTF-8 encoded 
and -finput-charset=xxx flag should change this behavior (+ -utf-8 for VC++ 
compatibility)

To summarize, which should add both utf-8 and -finput-charset=xxx support and 
set the default behavior based on native port.

Wdyt?

Christian


-----Original Message-----
From: Tinycc-devel [mailto:address@hidden On Behalf Of ???
Sent: mercredi 30 août 2017 09:31
To: address@hidden
Subject: [Tinycc-devel] BUG: wide char in wide string literal handled 
incorrectly

Hello,

   I found that when TCC processing wide string literal, it behaves like 
directly casting each char in original file to wchar_t and store them in wide 
string. This will work for ASCII chars. However, it might not work for real 
wide chars. For example:
   The Euro-sign (€, U+20AC) stored in UTF-8 is "E2 82 AC". In GCC, this char 
stored in wide string will be "000020AC". However, in TCC, this char is stored 
as 3 wide chars "000000E2 00000082 000000AC".
   I provided a patch, a test program and two screenshots that describe this 
problem, they are in attachments. I solve this problem by making assumptions 
that input charset is UTF-8. Although it's not a perfect solution, it's still 
better than "directly casting char to wchar_t". I'm wondering if that is 
appropriate, so please review the code carefully.

Thanks
Zhang Boyang




reply via email to

[Prev in Thread] Current Thread [Next in Thread]