[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
multibyte processing - handling invalid sequences (long)
From: |
Assaf Gordon |
Subject: |
multibyte processing - handling invalid sequences (long) |
Date: |
Wed, 20 Jul 2016 02:11:18 -0400 |
Hello all,
I'd like to discuss few aspect of multibyte processing in coreutils, as a
preparation for future improvements.
To start with an "easy" topic: how to handle invalid input (i.e. input octets
that result in invalid multibyte sequence).
Previous discussion said no internal conversion to wchar_t so that invalid
sequences can be handled as C locale (
https://lists.gnu.org/archive/html/coreutils/2010-09/msg00051.html ).
Pádraig's i18n plan left the handling issue open ("How do we handle invalid
encodings; substitution, elision, leaving in place?",
http://www.pixelbeat.org/docs/coreutils_i18n/).
Is there an agreement on how to handle those?
Do we want to fall-back to C locale, and does that imply going back and
revising invalid octets and re-processing them as single-byte characters ?
If so, the implementation need to keep the N octets (up to MB_CUR_MAX), and be
able to go back and process them. Alternatively, we can treat only the last
octet (the offending one that caused the sequence to be invalid) as a
single-byte character, thus possibly losing data.
One possibility is to have all programs print an informative warning to stderr
upon the detection of the first invalid multibyte sequence, then resort to
'best-effort' (e.g. only the last octet, or something else that's easy to
implement).
My rational is that for an input file with invalid sequences, there is no one
correct solution that would satisfy all cases: some users would think the
obvious correct solution is to output invalid sequences as-is, others would
think they should be silently ignored (i.e. a program should never generate
invalid output even on invalid input).
The best we could do is warn them, and document a way to fix invalid files
(along the lines of 'iconv --byte-subst="<0x%x>"'). Users could always fallback
to forcing C locale and then all input bytes will be processed.
To be more concrete, here are some examples:
The unicode code-point U+2460 is 'CIRCLED DIGIT ONE',
in UTF-8 octal: printf '\342\221\240'
I'll use the invalid sequence '\342\221\300' as input below.
What should be the output in the following cases:
'cut': should it print '\300' or '\342' ?
printf '\342\221\300' | LC_ALL=en_US.UTF-8 cut -c1
'wc': should it print 1 (counting only '\300') or 3 (counting all octets) or 0 ?
currently it prints 0 because it doesn't count invalid multibyte characters.
printf '\342\221\300' | LC_ALL=en_US.UTF-8 wc -m
similar issue, but perhaps with different logic and rationale, with "wc -L".
'expand': should this be expanded to '\300' + 7 spaces + 'A',
or '\342\221\300' + 5 spaces + 'A' ? or something else ?
printf '\342\221\300\tA\n' | LC_ALL=en_US.UTF-8 expand
'fold': should this print: 'aa\342\n\221\300b\n' (treating them as
single-bytes), or
'aa\300\nb\n' (using only the last octet), or something else?
printf 'aa\342\221\300b\n' | LC_ALL=en_US.UTF-8 fold -w 3
'printf' - deals only with bytes. e.g. the following should be printed as-is:
env printf '%s\n' "$(env printf '\342\221\300')"
env printf "$(env printf '\342\221\300')"
'fmt' and 'pr': I assume they should print the invalid sequence as is, as they
do not break mid-words.
'head', 'tail', 'split' - not relevant as they deal with bytes, not characters.
'csplit': only indirectly relevant, as I seem to remember that standard regex
should never
match an invalid multibyte sequence?
'shuf','paste' - not relevant as it deals with complete lines.
'yes' - prints input as-is, e.g. the following works:
yes "$(env printf '\342\221\300')"
'test' - operators '-n' and '-z' work correctly with invalid sequences.
'expr': regex operations should never match (IIUC).
for 'substr', should this return '\300' or '\342' ?
LC_ALL=en_US.UTF-8 expr substr "$(printf '\342\221\300')" 1 1
for 'length', should this return 3 (treating as 3 single-bytes) or 1 (counting
the last offending octet)?
LC_ALL=en_US.UTF-8 expr length "$(printf '\342\221\300')"
for 'index', both STRING and CHAR might be invalid. Should an invalid CHAR
parameter be rejected outright ?
'numfmt' - as long as it doesn't get confused with a digit character, invalid
sequences should be printed 'as-is'.
'seq' - doesn't take any input.
'date' - should print invalid characters in format string as-is.
For now I'm going to side-step sort+join+uniq, as I think they present a more
complicated set of issues when it comes to multibyte processing.
comments very welcomed,
- assaf
- multibyte processing - handling invalid sequences (long),
Assaf Gordon <=
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/22
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
- Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/26