[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Non-ASCII characters in @include search path
From: |
Gavin Smith |
Subject: |
Re: Non-ASCII characters in @include search path |
Date: |
Sat, 26 Feb 2022 18:57:34 +0000 |
On Sat, Feb 26, 2022 at 06:50:10PM +0100, Patrice Dumas wrote:
> For an example, in the following there are only ascii strings, except -o
> encodé/ which is not decoded, and the result is that the é in encodé
> ends up not being correctly output:
>
> $ cat test_smth.texi
> \input texinfo
>
> @setfilename test_smth.info
>
> @top top
> @node Top
>
> @bye
>
> $ ./texi2any.pl -o encodé/ test_smth.texi
>
> $ ls -d encod*
> encodé
I fixed that with the following:
diff --git a/tp/Texinfo/Convert/Converter.pm b/tp/Texinfo/Convert/Converter.pm
index 4ca8a64835..3225420010 100644
--- a/tp/Texinfo/Convert/Converter.pm
+++ b/tp/Texinfo/Convert/Converter.pm
@@ -554,6 +554,16 @@ sub determine_files_and_directory($;$)
= $self->{'global_commands'}->{'setfilename'}->{'extra'}->{'text_arg'};
}
+ if ($setfilename) {
+ my $document_encoding;
+ my $ignored;
+ $document_encoding = $self->{'parser_info'}->{'input_perl_encoding'}
+ if ($self->{'parser_info'}
+ and defined($self->{'parser_info'}->{'input_perl_encoding'}));
+ ($setfilename, $ignored) = Texinfo::Common::encode_file_name(
+ $self, $setfilename, $document_encoding);
+ }
+
my $input_basename_for_outfile = $input_basename;
my $setfilename_for_outfile = $setfilename;
# PREFIX overrides both setfilename and the input file base name
The problem was that the $setfilename variable had the UTF-8 flag on while
the directory name from the SUBDIR variable had the UTF-8 flag off.
Concatenating these two strings upgraded the whole string to UTF-8 and
converted the bytes from SUBDIR to UTF-8 again, leading to a "double UTF-8"
internally.
I had already tested this patch to get @setfilename to work properly with
an ISO-8859-1 encoded file (attached), so it was a change I would have
likely made anyway. However, I doubt that supporting ISO-8859-1 filenames
in @setfilename is very important.
I've committed it but am happy for it to be reverted if we decide on a
different approach. Of course it's very likely there are other issues.
> It may be possible to fix this issue by looking at all the places where
> the SUBDIR or OUTPUT customization variable associated string interact,
> encode all the strings they interact with, also re-decode them if needed
> for error messages, or inclusion in output documents. However, the
> other option, decode everything and encode when we need to interact with
> the outside of the code seems to me to be much simpler, require much
> less time and thinking and is much less error prone.
>
> > > * many strings are used both in file names and in texts. For example
> > > the customization variable 'EXTENSION'. Even strings that are almost
> > > only used as bytes can appear in error messages, which means that we
> > > need to keep the information somewhere on how to decode them.
> >
> > It is no problem as long as the EXTENSION string is purely ASCII.
>
> I do not think so. I think that it needs to be encoded if mixed with
> non ascii strings. (Also, it could be set to something non ascii, as
> customization but this should be pretty rare).
Yes, you're right: if the EXTENSION string has the UTF-8 flag on and
it is concatenated with a string with the UTF-8 flag off but which is
encoded in UTF-8, then the same "double UTF-8" problem will occur.
>
> > > * many strings can come from documents, as character strings or from
> > > command line, possibly kept encoded. For example document file name
> > > can come from @setfilename or the command line (or customization
> > > variable).
> >
> > This is a bigger problem as the filename could be non-ASCII, unlike
> > the extension.
> >
> > I will try to understand the code and run some tests after I install
> > a non-UTF-8 locale.
>
> You don't need a non-UTF-8 locale for the issue above, or for the issue
> that prompted me to try to look seriously at the issue, which is
> tests/formatting/list-of-tests non_ascii_test_epub. Having an accented
> letter in the document name makes it very hard to determine what should
> be encoded/decoded in init/epub3.pm and upstream code, in particular in
> Texinfo/Convert/Converter.pm determine_files_and_directory(), but
> although I thought previously that it could be solved in that function
> only, it is not so simple, strings come from everywhere in
> init/epub3.pm.
I'll look at it.
texinfoFvcBQSL1dh.texinfo
Description: TeXInfo document
- Re: Non-ASCII characters in @include search path, (continued)
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/24
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/24
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/24
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/24
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/21
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/25
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/26
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/26
- Re: Non-ASCII characters in @include search path,
Gavin Smith <=
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/26
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/26
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/26
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/26
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/26
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/26
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/26
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/26
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/26
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/26