bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: printf "%c" and numbers above 255


From: Aharon Robbins
Subject: Re: printf "%c" and numbers above 255
Date: Mon, 12 May 2008 01:03:10 +0300

Greetings. Re this:

> Date: Sat, 10 May 2008 00:30:44 +0200
> From: Hermann Peifer <address@hidden>
> Subject: printf "%c" and numbers above 255
> To: address@hidden
>
> According to the printf documentation in the Gawk manual:
>
> |%c  |This prints a number as an ASCII character; thus, `printf "%c", 
> 65' outputs the letter `A'.
>
> My observations is that this works fine for numbers 0..127 and even for 
> 128..255 (although resulting characters of the latter range are perhaps 
> not ASCII characters).
>
> For numbers above 255, printf basically prints the character of 
> number%256, e.g.:
>
> $ awk 'BEGIN{printf "%c\n",65+256}'
> A
> $ awk 'BEGIN{printf "%c\n",65+256+256}'
> A
>
> I'm not quite sure if this is a bug or feature. For me it was at least 
> somewhat surprising.
>
> Hermann

It's what the gawk doc refers to as a "dark corner".  I just posted
a longer note about some of these issues in comp.lang.awk.  The other
industrial strength awk's (mawk, nawk) also behave this way.

It is not clear what the "right" behavior is, since assuming Unicode
is wrong; not all locales are Unicode.

There are other cases where the single-byte nature of the code
leaks through, such as

        sprintf("%c", 65+256+256) ~ /[[:alpha:]]/

since gawk does not use iswalpha or any other wide-character isXXX
routine.

> PS
> A nice enhancement would be if printf could print Unicode characters, 
> similar to /usr/bin/printf, which has the formats \uHHHH and \UHHHHHHHH:
>
> \uHHHH   Unicode (ISO/IEC 10646) character with hex value HHHH (4 digits)
> \UHHHHHHHH  Unicode character with hex value HHHHHHHH (8 digits)

Gawk isn't C, nor is POSIX awk the same as C 99, and while this looks like
a good idea at first blush, I don't think it is, since there are systems where
wide character are only two bytes (Windows being the most notable one),
and again, not all systems are Unicode.

The combination of wide characters, multibyte characters, and locales
leads to One Big Mess, and I don't think I'm ready to jump into this
particular cesspool just yet. (I've spent enough time there already.)

I thank you for the bug report, and I'm sorry I don't have a mnore
immediately pleasing response.

Arnold




reply via email to

[Prev in Thread] Current Thread [Next in Thread]