[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mbcel module for Gnulib?, incomplete multibyte sequences
From: |
Bruno Haible |
Subject: |
Re: mbcel module for Gnulib?, incomplete multibyte sequences |
Date: |
Sat, 05 Aug 2023 01:05:52 +0200 |
Paul Eggert wrote in
<https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>:
> Here are the results:
>
> noop mbiterf mbuiterf mbcel mbucel
> a 1.546 2.038 1.931 1.639 1.968
> b 1.545 2.016 1.936 1.705 1.969
> c 2.153 8.160 8.255 6.160 5.550
> d 2,153 6,692 6,686 6,322 5,520
> e 2.234 6.866 6.817 6.214 5.781
> f 1.843 95.276 104.270 58.667 60.902
> g 1,844 76,126 83,899 65,541 66,983
> h 3.297 75.727 83.735 64.261 65.686
> i 1.941 31.064 34.887 26.262 27.072
> j 1.310 29.620 33.698 24.751 25.550
To me, the columns timings of mbiterf and mbuiterf are good enough.
The rows c and f will improve when glibc's C locale bug is fixed.
For the rest, I see the timings as close enough to the optimum.
> But in another sense these benchmarks are biased against mbcel, as this
> benchmark compiles in a special environment that defines NDEBUG, which
> most Gnulib-using code doesn't do
Yes, it's so tempting to take out all assert()s from production code :)
But I won't do it nevertheless, because I fear platform bugs in the mb* code.
> For applications that need MEE, it'd be easy for mbcel to supply it.
> Something like the following patch would do it, though it should be a
> different function (e.g., mbcel_scanmee) instead of being a change to
> mbcel_scan itself. However, diffutils doesn't need to bother with such a
> change as it can use SEE like most other programs do.
>
> --- a/lib/mbcel.h
> +++ b/lib/mbcel.h
> @@ -191,3 +191,3 @@ mbcel_scan (char const *p, char const *lim)
> if (_GL_UNLIKELY ((size_t) -1 / 2 < len))
> - return (mbcel_t) { .err = *p, .len = 1 };
> + return (mbcel_t) { .err = *p, .len = len == (size_t) -2 ? lim - p : 1
> };
With a patch like this, i.e. with MEE, I would welcome mbcel and mbucel in
Gnulib.
(Although maybe you may want to align the module name to be similar
to mbiterf and mbuiterf : maybe mbitervf and mbuitervf for "very fast"?)
> SEE is simpler and is common practice elsewhere, notably Emacs.
Emacs is a complex beast. I can understand if the Emacs developers want
an implementationally-simple behaviour, rather than a simple-from-the-
user-perspective behaviour. But other GNU programs can and should do
it better, since modules 'mb*iter*f' already exist that hide the
implementational complexity.
Paul Eggert wrote in
<https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00149.html>:
> It would be better to change mbu?iterf? to use
> single-byte-per-encoding-error ("SEE") behavior
No. No way. This would be a regression.
> This is because mbu?iterf? doesn't implement MEE either.
>
> For MEE, mbiterf would need something like the attached untested patch,
> and mbiter, mbcel, etc. would all need similar patches.
Good point. Actually, with iconv() based converters one needs such a
loop, because iconv's calling conventions don't allow to return detailed
information. (See iconv_carefully and iconv_carefully_1 in lib/striconveh.c.)
mbrtowc / mbrtoc32 is a bit different, but since upon return it does not
give detailed information about the structure of the invalid byte sequence,
it may indeed need such a loop.
For comparison, <unistr.h> u8_mbtouc returns the number of bytes, but OTOH
does not allow to distinguish U+FFFD vs. incomplete/invalid byte sequence.
I'll think more about this one...
Bruno
- Re: mbcel module for Gnulib?, incomplete multibyte sequences,
Bruno Haible <=