[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorre
From: |
Christian Jullien |
Subject: |
Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorrectly |
Date: |
Sun, 3 Sep 2017 07:50:45 +0200 |
Managing UTF-8 (and Unicode) correctly on all platforms is a nightmare. I did
it only partially for my Lisp.
It's hard to say that your code is correct or not but I have the impression it
is not since you don’t use MB_LEN_MAX nor MB_CUR_MAX. Hence you don't handle
all possible multi-bytes character len.
There is a system dependent constant named MB_LEN_MAX that tells you the max
number of multi-bytes. (see for example
http://man7.org/linux/man-pages/man3/MB_LEN_MAX.3.html)
As you can read here it must be used with MB_CUR_MAX, a locale dependent value.
With "most common" locales you can leave with 5 to 6 bytes but I'm discovering
that MB_LEN_MAX is now 16 on Linux!!!
>From Linux <limits.h>
/* Maximum length of any multibyte character in any locale.
We define this value here since the gcc header does not define
the correct value. */
#define MB_LEN_MAX 16
>From VC++ 14
#define MB_LEN_MAX 5 // max. # bytes in multibyte char
The ISO C standard defines two macros that provide this information.
Macro: int MB_LEN_MAX
MB_LEN_MAX specifies the maximum number of bytes in the multibyte sequence for
a single character in any of the supported locales. It is a compile-time
constant and is defined in limits.h.
Macro: int MB_CUR_MAX
MB_CUR_MAX expands into a positive integer expression that is the maximum
number of bytes in a multibyte character in the current locale. The value is
never greater than MB_LEN_MAX. Unlike MB_LEN_MAX this macro need not be a
compile-time constant, and in the GNU C Library it is not.
MB_CUR_MAX is defined in stdlib.h.
If it helps, you can adapt use:
/*
* Retuns the number of multiple bytes needed to store MB character c.
*/
#define OLMBCLEN_USES_TABLE
#if defined( OLMBCLEN_USES_TABLE )
static const unsigned char olbytesForUTF8[256] = {
/* ASCII 7bit char -> 0xxxxxxx */
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 00 */
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 10 */
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 20 */
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 30 */
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 40 */
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 50 */
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 60 */
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 70 */
/* invalid UTF-8 char -> 10xxxxxx */
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* 80 */
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* 90 */
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* A0 */
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* B0 */
/* (c & 0xE0) == 0xC0 -> 110xxxxx */
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, /* C0 */
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, /* D0 */
/* (c & 0xF0) == 0xE0 -> 1110xxxx */
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, /* E0 */
/* (c & 0xF8) == 0xF0 -> 11110xxx */
#if (OLMB_LEN_MAX == 4)
4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0 /* F0 */
#else
4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 0, 0 /* F0 */
#endif
};
size_t
olmbclen( int c )
{
return( (size_t)olbytesForUTF8[ c & 0xFF ] );
}
#else
size_t
olmbclen( int c )
{
if ((c & 0x80) == 0x00) {
return( 1 );
} else if( (c & 0xE0) == 0xC0) {
return( 2 );
} else if( (c & 0xF0) == 0xE0 ) {
return( 3 );
} else if( (c & 0xF8) == 0xF0) {
return( 4 );
#if (OLMB_LEN_MAX > 4)
} else if ((c & 0xFC) == 0xF8) {
return( 5 );
#endif
#if (OLMB_LEN_MAX > 5)
} else if ((c & 0xFE) == 0xFC) {
return( 6 );
#endif
}
return( 0 );
}
#endif
-----Original Message-----
From: Tinycc-devel [mailto:address@hidden On Behalf Of ???
Sent: samedi 2 septembre 2017 19:12
To: address@hidden
Subject: Re: [Tinycc-devel] BUG: wide char in wide string literal handled
incorrectly
Hello,
Here is the new patch, which fixed the UTF-16 truncate problem on Windows.
Zhang Boyang
在 2017年09月01日 19:50, Christian JULLIEN 写道:
> Given platforms tcc supports, I think you can assume, wchar_t uses 2 bytes on
> Windows and 4 bytes on all other platforms (I'm not totally sure, but think
> you can force wchar_t to be 2 bytes on macOS).
> I've never heard about other implementation for wchar_t (I don't recall how
> zOS encodes wchar_t but I doubt someone will port tcc on this system which
> still uses EBCDIC natively).
>
>
> Le : 01 septembre 2017 à 11:02 (GMT +02:00) De : "张博洋"
> <address@hidden> À : "address@hidden"
> <address@hidden> Objet : Re: [Tinycc-devel] BUG:
> wide char in wide string literal handled
> incorrectly
>
>
> Hello,
>
> Thanks for your reply.
>
> My assumptions only applicable to wide string literals. The behavior
> for plain strings literals of both original tcc and my patched tcc is
> "copy bytes in plain string as is". And for wide strings, original tcc
> "read each char and cast them to wchar_t", my patched tcc "decode them
> as
> UTF-8 sequences".
>
> After some consideration, I found the assumption I made was "wide
> string literals are written in UTF-8, and wchar_t is always UTF-32".
> That leads to two problems. First, wide string encoding in source file
> is definitely same as the encoding of source file, which might not be
> UTF-8. This will cause problems as you mentioned. Second, wchar_t is
> not always UTF-32. It's UTF-16 on Microsoft Windows. Some chars, like
> emojis , will get corrupted because of value truncation.
>
> Although there are problems, if the second problem got fixed (which is
> easy), my patched tcc will always perform better than original tcc. If
> something breaks, it will also breaks on original tcc. I provided a
> table in attachments describing every situation and corresponding behaviors.
>
> The ideal solution is to provide charset options as you mentioned.
> After doing some search on internet, I found that there are 3 command
> line options that controls char encoding:
> -fexec-charset=charset
> -fwide-exec-charset=charset
> -finput-charset=charset
> In order to make these feature works correctly, tcc must do two conversions:
> (1) convert all plain string literal from input-charset to
> exec-charset
> (2) convert all wide string literal from input-charset to
> wide-exec-charset However, providing these feature requires external
> libraries like iconv, doing this might make Tiny C Compiler not tiny.
>
> My problems are:
> (1) Is wchar_t either UTF-32 or UTF-16 on all platforms?
> (2) Should we provide full support for charset using external librarys?
>
>
> Thanks
> Zhang Boyang
>
>
>
> 在 2017年09月01日 11:54, Christian Jullien 写道:
> > Hello,
> >
> > I'm not sure you can assume that a character having code >= 0x80 is
> part of UTF-8. Beyond what is called "basic character set" which is globally
> the ASCII 7bits, there is the "extended character set" which is
> implementation defined.
> >
> > For example, the euro sign EUR may be part of 8859-15 and
> perfectly well encoded on 8bits with 0xA4 see
> https://en.wikipedia.org/wiki/ISO/IEC_8859-15
> >
> > Microsoft VC++ has the following flags > > /utf-8 set
> source and execution character set to UTF-8 > /validate-charset[-]
> validate UTF-8 files for only legal characters > > That controls
> how source code is encoded.
> >
> > gcc (more specifically cpp the C preprocessor) processes source
> file using UTF-8 but, as VC++ has a flag to control input-char >
> > -finput-charset=charset
> > Set the input character set, used for translation from the
> > character set of the input file to the source character set
> used by
> > GCC. If the locale does not specify, or GCC cannot get this
> > information from the locale, the default is UTF-8. This can
> be
> > overridden by either the locale or this command-line option.
> > Currently the command-line option takes precedence if
> there's a
> > conflict. charset can be any encoding supported by the
> system's
> > "iconv" library routine.
> >
> > Now, tcc should be compatible with both. I mean:
> >
> > - Native Windows tcc port should NOT assume characters are UTF-8
> encoded and -utf-8 flag should change this behavior (+
> -finput-charset=xxx for gcc compatibility) > - Other ports (I mean
> Linux & alt.) should assume characters are UTF-8 encoded and
> -finput-charset=xxx flag should change this behavior (+ -utf-8 for VC++
> compatibility) > > To summarize, which should add both utf-8 and
> -finput-charset=xxx support and set the default behavior based on native port.
> >
> > Wdyt?
> >
> > Christian
> >
> >
> > -----Original Message-----
> > From: Tinycc-devel [mailto:address@hidden On Behalf Of ???
> > Sent: mercredi 30 août 2017 09:31 > To:
> address@hidden > Subject: [Tinycc-devel] BUG: wide char in
> wide string literal handled incorrectly > > Hello, >
> > I found that when TCC processing wide string literal, it behaves
> like directly casting each char in original file to wchar_t and store them in
> wide string. This will work for ASCII chars. However, it might not work for
> real wide chars. For example:
> > The Euro-sign (EUR, U+20AC) stored in UTF-8 is "E2 82 AC". In GCC,
> this char stored in wide string will be "000020AC". However, in TCC, this
> char is stored as 3 wide chars "000000E2 00000082 000000AC".
> > I provided a patch, a test program and two screenshots that describe
> this problem, they are in attachments. I solve this problem by making
> assumptions that input charset is UTF-8. Although it's not a perfect
> solution, it's still better than "directly casting char to wchar_t". I'm
> wondering if that is appropriate, so please review the code carefully.
> >
> > Thanks
> > Zhang Boyang
> >
> >
> > _______________________________________________
> > Tinycc-devel mailing list
> > address@hidden
> > https://lists.nongnu.org/mailman/listinfo/tinycc-devel
> >
>
> _______________________________________________
> Tinycc-devel mailing list
> address@hidden
> https://lists.nongnu.org/mailman/listinfo/tinycc-devel
>
>
>
> _______________________________________________
> Tinycc-devel mailing list
> address@hidden
> https://lists.nongnu.org/mailman/listinfo/tinycc-devel
>
--
张博洋 - 复旦大学2014级计算机科学与技术
我的手机: 18600020982
我的个人网站: http://www.zbyzbyzby.com