[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Non-ASCII characters in @include search path
From: |
Patrice Dumas |
Subject: |
Re: Non-ASCII characters in @include search path |
Date: |
Sun, 20 Feb 2022 13:10:16 +0100 |
On Sun, Feb 20, 2022 at 11:54:08AM +0000, Gavin Smith wrote:
> I found it was the last argument to File::Spec->catdir that led to the
> utf8 flag being on: $filename. This came from the argument to
> locate_include_file, which came from the Texinfo source file. The following
> also fixes it:
I do not think that the fact that it is utf8 is important, I believe
that it is an internal design choice in perl what matter is that it is
in the internal perl unicode encoding.
> diff --git a/tp/Texinfo/Common.pm b/tp/Texinfo/Common.pm
> index 29dbf3c8c3..36be8c5b59 100644
> --- a/tp/Texinfo/Common.pm
> +++ b/tp/Texinfo/Common.pm
> @@ -1507,6 +1507,8 @@ sub locate_include_file($$)
> my $text = shift;
> my $file;
>
> + utf8::downgrade($text);
> +
> my $ignore_include_directories = 0;
>
> my ($volume, $directories, $filename) = File::Spec->splitpath($text);
>
>
> This may be surprising as the non-ASCII characters were not in $text itself:
> $text was just "include.texi". The non-ASCII characters in the include path
> got to this function without the utf8 flag going on.
Again, I do not think that we should rely on the specific encoding of a
string. We should only track whether it is interal perl unicode string
or bytes.
> Strings coming from the Texinfo source file have to be assumed to represent
> characters, not bytes, as the Texinfo source is read with a certain encoding.
> File names, however, are a sequence of bytes (on GNU/Linux at least; on
> MS-Windows it may be different). I believe it's this conflict
> that is responsible.
I agree, that's also my interpretation. It is the same on MS-Windows.
> I propose the following fix, which doesn't touch Perl's internal string
> representation directly:
>
> diff --git a/tp/Texinfo/Common.pm b/tp/Texinfo/Common.pm
> index 29dbf3c8c3..7babba016c 100644
> --- a/tp/Texinfo/Common.pm
> +++ b/tp/Texinfo/Common.pm
> @@ -1507,6 +1507,8 @@ sub locate_include_file($$)
> my $text = shift;
> my $file;
>
> + utf8::encode($text);
> +
> my $ignore_include_directories = 0;
>
> my ($volume, $directories, $filename) = File::Spec->splitpath($text);
>
> This means that any non-ASCII characters in a filename in a Texinfo source
> file are sought in the filesystem as the corresponding UTF-8 sequences.
I think that the correct way to do that is to use
Encode::encode($text, 'utf-8');
Also I think that it should be done as late as possible, so it would be
better on $possible_file.
> A more thorough fix would obey @documentencoding and convert back to the
> original encoding, to retrieve the bytes that were present in the source
> file in case the file was not in UTF-8. I think it would be the most
> correct to always use the exact bytes that were in the source file as the
> name of the file (I assume that is what TeX would do).
I do not think so, at least not on Linux, as in Linux the files are
always encoded as UTF-8. So encoding in UTF-8 seems to always be
better. It also matches with the XS parser which converts to UTF-8.
This may be incorrect on other platforms, such as windows or mac,
however.
--
Pat
- Non-ASCII characters in @include search path, Gaël Bonithon, 2022/02/17
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/19
- Re: Non-ASCII characters in @include search path, Gaël Bonithon, 2022/02/19
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/19
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/19
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/19
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/20
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/20
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/20
- Re: Non-ASCII characters in @include search path,
Patrice Dumas <=
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/20
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/20
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/20
- Re: Non-ASCII characters in @include search path, Eli Zaretskii, 2022/02/20
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/20
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/20
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/20
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/20
- Re: Non-ASCII characters in @include search path, Patrice Dumas, 2022/02/20
- Re: Non-ASCII characters in @include search path, Gavin Smith, 2022/02/20