bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM


From: Eli Zaretskii
Subject: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Sun, 03 Jul 2022 16:26:54 +0300

> Cc: rgm@gnu.org, schwab@linux-m68k.org, 48324@debbugs.gnu.org
> Date: Sun, 03 Jul 2022 16:00:47 +0300
> From: Eli Zaretskii <eliz@gnu.org>
> 
> > From: Lars Ingebrigtsen <larsi@gnus.org>
> > Cc: rgm@gnu.org,  schwab@linux-m68k.org,  48324@debbugs.gnu.org
> > Date: Sun, 03 Jul 2022 14:07:43 +0200
> > 
> > Lars Ingebrigtsen <larsi@gnus.org> writes:
> > 
> > > Hm...  I guess the only reliable solution across all coding systems is
> > > (like your comment in the code says) to drop the encode-every-char and
> > > try encoding strings, and then see whether the result is short enough.
> > > That could be done somewhat efficiently using a binary search.  I'll
> > > have a go at it...
> > 
> > And while I was at it, I changed it to return complete glyphs, not just
> > complete code points.
> > 
> > There's a behavioural change, though.  This: 
> > 
> > (string-limit "foóá" 6 t 'utf-16)
> > 
> > Now returns a string with a BOM, whereas previously it didn't.
> 
> So you get 6 characters + the BOM?

I see that it's actually 6 bytes _including_ the BOM.  So I think this
is confusing: if we are going to return a string with the BOM, we
should not count the BOM as part of the LENGTH bytes.  Because if I
requested to get characters which fit into N bytes, I should get those
N bytes of payload.  Or maybe we should have an optional argument to
control whether LENGTH includes or excludes the BOM.

In any case, we should mention this aspect in the doc string, I think.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]