bug-m4
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: format bug


From: Eric Blake
Subject: Re: format bug
Date: Thu, 31 May 2007 21:05:04 -0600
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.10) Gecko/20070221 Thunderbird/1.5.0.10 Mnenhy/0.7.5.666

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Daniel Richard G. on 5/31/2007 7:06 PM:
> On Thu, 2007 May 31 13:23:28 -0600, Eric Blake wrote:
>>> Leery of making "%c" == "%.1s". This behavior doesn't seem terribly useful.
>> But it would match what /bin/printf does, and nobody complained about that
>> being not terribly useful when it was standardized in POSIX.
> 
> Given the amount of griping that goes on about POSIX in general, I find 
> that difficult to believe.

POSIX standardized existing practice.  printf(1) was invented years ago,
in the Ninth Edition system (before I was even using Unix), and parsed the
argument for %b, %c, and %s as a string, and for all other specifiers as
integers (the original printf(1) did not support %e, %f, or %g; although
most modern implementations of printf(1) support that as an extension).

> 
> Anyway, matching what printf(1) does is an advantageous property, but not 
> so much (IMHO) as to justify following every quirk. You could argue that 
> the longtime behavior of %c better matches the C semantic---which is 
> familiar to a lot more people, and is more authoritative on this than 
> printf(1).

That's the point of this thread - I'm offering to implement escape
sequences (also part of printf(1)) to make up for the fact that I am about
to change %c semantics to be consistent with other, more standardized,
utilities.

The problem is that with the shell (printf) and with m4 (format), all
arguments start out as strings, with no strong typing to tell whether the
string should be interpreted as a number or left as a string.  So seeing
the string "1" makes it impossible to tell whether the user meant the
character with value 1 (C's '\1') or the literal 1 (C's '1').  And I tend
to value consistency (it's easier to state that m4 is like the shell, than
it is to say that m4 is a special case and behaves differently), as long
as there is also a way to accomplish the alternate interpretation.  And it
is also why I will be implementing \nnn octal and \xnn (required by POSIX
in printf), but was not planning on \unnnn or \Unnnnnnnn at this point
(also not required by POSIX), because implementation-defined multibyte
characters don't make sense until m4 can do multibyte characters to begin
with.

> 
> Okay, I forgot about eval()'s radix (and width) arguments. So that makes it 
> possible. But do you really think that e.g.
> 
>       define(`codepoint', `8995')dnl Unicode "SMILE" character
> 
>       format(`\u'eval(codepoint, `16', `4'))
> 
> is an improvement over
> 
>       format(`%lc', codepoint)

The fact that %lc ever worked for you is an undocumented happenstance and
a sign of non-portability; it did not exist in printf(3) when Rene' first
implemented m4 back in 1990, so it was not part of the original design of
GNU m4.  In reality, m4 has never been locale-aware, and more platforms
probably got %lc wrong than those that seemed to get it right.  I can
consider reenabling it, now that you have brought it up, but since m4
still operates on bytes and not multibyte characters, I'm not sure it is
the right thing to knowingly enable something that is likely broken.  In
general, using undocumented aspects of a program is subject to these sorts
of changes in behavior.

> (I've been using a chr() composite implemented with %lc, which I thought 
> was extremely cool on finding that m4 supported it. Now I'm a bit miffed 
> that the latest CVS code no longer recognizes this...)

Well, since you are the first to bring it up, maybe we can consider
documenting it, and making it something we support (and adding regression
tests, to make sure we don't inadvertently break it again in the future).

Meanwhile, you've made enough comments about how m4 behaves that you may
want to consider assigning copyright to the FSF and so that you can
contribute patches to help move it along to better meet your needs.  I
tend to work on the things that bother me, and I have not yet been
bothered by the inability of using multibyte characters in m4 to the point
to make the code changes necessary to support locales properly.

> 
> 
> On a separate note, this point you mentioned earlier caught my notice:
> 
>> - - no portable way to convert a character to an integer short of a
>> 255-element reverse-lookup table (you could use a forloop recursion
>> construct, but be sure your iterator and quote characters are
>> multi-character for the duration of the loop to avoid parse problems; hmm,
>> maybe I should code this up and add it to the examples directory)
> 
> I've come up against this issue as well.
> 
> May I suggest borrowing a page from Perl, and adding a builtin like ord()? 
> That would be much more efficient than a lookup table, and it would be able 
> to handle multi-byte characters.

No, we don't need another builtin.  For 1.4.9 and earlier it is too late,
we can't re-release those versions with a new builtin, so you are already
stuck with implementing it as a hairy composite.  And with 1.4.10 and 2.0,
format(`%d',`"a') is the quick and easy way to do this, so you can easily
write an ord() composite that wraps format.  My comment was that there is
no simply way to do this portably across both 1.4.9 and 1.4.10
simultaneously, so I hope to code up some sort of composite that can do
the trick in 1.4.9 and add it to the examples directory.

- --
Don't work too hard, make some time for fun as well!

Eric Blake             address@hidden
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGX4zg84KuGfSFAYARAsf/AKC+fDz91L6QbBMfa0lwgbDh3X3j2gCePg3j
kIYLYzv2Blbmn4jXsNJT1zo=
=BB9J
-----END PGP SIGNATURE-----




reply via email to

[Prev in Thread] Current Thread [Next in Thread]