[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mhfixmsg character set conversion
From: |
Steven Winikoff |
Subject: |
Re: mhfixmsg character set conversion |
Date: |
Wed, 09 Feb 2022 19:48:07 -0500 |
>> Really. I'm not making this up. :-/
>
>No, I don't think you are. I think that line in both files is correctly
>UTF-8 encoded.
And now that you've explained what's going on, it's clear that you're
right.
>vim isn't the vi(1) I grew up with, and probably you too.
Definitely. The first time I used vi was in 1984, on a 68000-based Cadmus
system.
>Try ‘:se fileencoding?’ when vim-ing good and again with bad.
Good point:
$ vim good
:set fileencoding
fileencoding=utf-8
$ vim bad
:set fileencoding
fileencoding=latin1
>I expect the bad file has something earlier on which fixes vim's idea of
>the encoding to ISO 8859-1
That does seem to be the case. Do you have any idea what kind of thing
that might be? (I know you can't diagnose a file you haven't seen, but in
general, what sorts of things should I look for?)
>> But wait. It gets worse:
>>
>> $ grep -n ^Veuillez good | cut -c1-68
>> 108:Veuillez ne pas répondre au présent courriel. Il a été gén�
>>
>> $ grep -n ^Veuillez bad | cut -c1-68
>> 108:Veuillez ne pas répondre au présent courriel. Il a été gén�
>
>The worse being it is the very same line 108 you're seeing in vim which
>grep is also showing?
Exactly, because...
>(The ‘�’ at the end is to be expected.)
...this is still more evidence that you know more about character sets and
conversions than I do. As if further evidence was needed at this point. :-/
Until now, I've only ever seen that glyph when a character doesn't exist in
the font being used -- but that can't be the case here because that same
character is shown correctly five times in the same line of output.
Why is it to be expected?
>> $ LC_ALL=C perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet
>> [...]
>
>I don't understand that. The -p sets up a loop to read a line from
>good_snippet, do the substitution on it, and print the result, until
>EOF. The -l strips off the linefeed on input and puts it back on the
>output. The substitution in between changes all bytes, thanks to
>LC_ALL=C, which aren't space to tilde into a ‘<42>’ string representing
>their hex value.
Thank you for explaining that.
Just for fun, I tried the following in tcsh:
$ setenv LC_ALL C
$ perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet
Veuillez ne pas r<c3><a9>pondre au pr<c3><a9>sent courriel. Il a
<c3><a9>t<c3><a9> g<c3><a9>n<c3><a9>r<c3><a9>
As expected, this returned pretty much instantly. Then I tried this:
$ sh
$ LC_ALL=C
$ echo $LC_ALL
C
$ perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet
...and that also hung. Which in a way is good, because at least it means
bash is behaving consistently. But also not good, because it's behaving
badly. :-/
On my system, /bin/sh is a symlink to /bin/bash, which is version 5.1.016-2
as packaged by Manjaro.
...but troubleshooting bash is far outside the scope of this discussion, so
I propose to forget this particular clupea harengus of the crimson variety
unless you find it interesting in and of itself.
>Nothing wrong with od(1). If you have hexdump(1) installed then it with
>-C gives quite nice output.
Yes, I see (or -C? :-). Thanks for that tip; I hadn't known that hexdump
existed.
>> ...and both snippets are identical!
>
>Well, those lines were identical to start with before snipping.
>You could confirm this with
>
> cmp <(sed -n 108p good) <(sed -n 108p bad)
As written, this also hangs in bash (and is invalid syntax in tcsh).
But it's effectively equivalent to
$ sed -n 108p good > good.sed
$ sed -n 108p bad > bad.sed
$ cmp good.sed bad.sed
$ echo $?
0
...which behaves as expected.
>> Strangely, both snippet files look fine in vim.
>
>Because you have chopped off the non-UTF-8 which occurs earlier in bad
>which fixes vim's idea of the file's encoding.
In retrospect this should have been obvious. :-/
>> ...but for the bad file, that becomes
>>
>> "bad" [converted] 336 lines, 49471 bytes 1,1 Top
>
>Ta-da!
Indeed. :-)
Thank you.
- Steven
--
___________________________________________________________________________
Steven Winikoff |
Montreal, QC, Canada | Eschew obfuscation.
smw@smwonline.ca |
http://smwonline.ca |
- Re: mhfixmsg character set conversion, (continued)
- Re: mhfixmsg character set conversion, David Levine, 2022/02/05
- Re: mhfixmsg character set conversion, David Levine, 2022/02/06
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/06
- Re: mhfixmsg character set conversion, David Levine, 2022/02/06
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/07
- Re: mhfixmsg character set conversion, David Levine, 2022/02/07
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/08
- Re: mhfixmsg character set conversion, Ralph Corderoy, 2022/02/08
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/09
- Re: mhfixmsg character set conversion, Ralph Corderoy, 2022/02/09
- Re: mhfixmsg character set conversion,
Steven Winikoff <=
- Re: mhfixmsg character set conversion, George Michaelson, 2022/02/09
- Re: mhfixmsg character set conversion, George Michaelson, 2022/02/09
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/09
- Re: mhfixmsg character set conversion, Ralph Corderoy, 2022/02/10
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/11
- Re: mhfixmsg character set conversion, Robert Elz, 2022/02/11
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/11
- Re: mhfixmsg character set conversion, Robert Elz, 2022/02/11
- Re: mhfixmsg character set conversion, Ralph Corderoy, 2022/02/12
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/12