[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#17196: UTF-8 printf string formating problem
From: |
Pádraig Brady |
Subject: |
bug#17196: UTF-8 printf string formating problem |
Date: |
Tue, 08 Apr 2014 01:11:13 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 |
On 04/07/2014 10:57 PM, Eric Blake wrote:
> [adding the Austin Group]
>
> On 04/07/2014 07:08 AM, Pádraig Brady wrote:
>> On 04/06/2014 07:24 PM, Bob Proulx wrote:
>>> Pádraig Brady wrote:
>>>> Yes printf follows the C standard which only considers bytes.
>>>> ...
>>>> I don't think we'd be able to change the current operation of printf
>>>> due to backwards compat reasons? Though we might be able to somehow
>>>> leverage
>>>> the existing multibyte character aware alignment/truncation code in:
>>>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
>>>
>>> Dan Douglas pointed out in the corresponding discussion in bug-bash
>>> that ksh uses the L modifier.
>>>
>>> http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
>>>
>>> Dan Douglas wrote:
>>> > ksh93 already has this feature using the "L" modifier:
>>> >
>>> > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>>> > ★★★
>>>
>>> At least there is prior art for it.
>>
>> So we can count bytes, chars or cells (graphemes).
>>
>> Thinking a bit more about it, I think shell level printf
>> should be dealing in text of the current encoding and counting cells.
>> In the edge case where you want to deal in bytes one can do:
>> LC_ALL=C printf ...
>>
>> I see that ksh behaves as I would expect and counts cells,
>> though requires the explicit %L enabler:
>> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>> á★★
>> $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>> A★
>> $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
>> A
>>
>> zsh seems to just count characters:
>> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>> á★
>> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
>> á★
>> $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>> A★★
>>
>> I see that dash gives invalid directive for any of %ls %Ls %S.
>>
>> Pity there is no consensus here.
>> Personally I would go for:
>> printf '%3s' 'blah' # count cells
>> printf '%3Ls' 'blah' # count chars
>> LANG=C '%3Ls' 'blah' # count bytes
>> LANG=C '%3s' 'blah' # count bytes
>
> Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
> and currently states that %Ls is undefined. But I would LOVE to have a
> standardized spelling for counting characters instead of bytes. The
> extension %Ls looks like a good candidate for standardization, precisely
> because counting characters when printing a multibyte string is more
> useful than counting bytes (you do NOT want to end in the middle of a
> multibyte character), and because ksh offers it as existing practice.
Note ksh seems to count cells with %Ls
> Your idea for counting "cells" (by which I'm assuming you mean one or
> more characters that all display within the same cell of the terminal,
> as if the end user saw only one grapheme), on the other hand, does not
> seem to have any precedence, and I would strongly object to having %s
> count by cells because %s already has a standardized (if unfortunate)
> meaning of counting by bytes. Maybe yet another extension is warranted
> (perhaps %LLs?) as a new notion for counting by cells instead of
> characters, but it's harder to justify that without existing practice.
At the shell level I expect that the vast majority
of uses would prefer to be specifying cell counts.
I thought there might not be much backwards compat issues
with doing that, especially since zsh and gawk adjust
the meaning of %s according to the locale
(albeit for char rather than cell count).
But it's a fair point that there may be scripts
that don't consider the zsh behavior.
If we had to make it explicit for backwards compat reasons,
then I suppose counting by characters is the least useful,
so we could just standardize the existing ksh behavior and have:
printf '%3s' 'blah' # count bytes
printf '%3Ls' 'blah' # count cells
LANG=C '%3Ls' 'blah' # count bytes
This has the disadvantage of not degrading gracefully
on dash for example where %Ls is rejected.
thanks,
Pádraig.
- bug#17196: UTF-8 printf string formating problem, Jan Novak, 2014/04/06
- bug#17196: UTF-8 printf string formating problem, Pádraig Brady, 2014/04/06
- bug#17196: UTF-8 printf string formating problem, Pádraig Brady, 2014/04/06
- bug#17196: UTF-8 printf string formating problem, Bob Proulx, 2014/04/06
- bug#17196: UTF-8 printf string formating problem, Steffen Nurpmeso, 2014/04/09
- bug#17196: UTF-8 printf string formating problem, Rich Felker, 2014/04/10
- bug#17196: UTF-8 printf string formating problem, Steffen Nurpmeso, 2014/04/10
- bug#17196: UTF-8 printf string formating problem, Chet Ramey, 2014/04/10
- bug#17196: UTF-8 printf string formating problem, Steffen Nurpmeso, 2014/04/11
- bug#17196: UTF-8 printf string formating problem, Chet Ramey, 2014/04/11
- bug#17196: UTF-8 printf string formating problem, Steffen Nurpmeso, 2014/04/11