bug-findutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Announcing the release of GNU findutils 4.7.0


From: James Youngman
Subject: Re: Announcing the release of GNU findutils 4.7.0
Date: Fri, 30 Aug 2019 11:42:51 +0100

On Fri, Aug 30, 2019 at 10:20 AM Jean Louis <bugs@gnu.support> wrote:
>
> Hello Bernhard,
>
> Thank you, just one question:
>
> * Bernhard Voelker <address@hidden> [2019-08-30 00:44]:
> > The updatedb script now operates in the C locale only.  This means
> > that character encoding issues are now not likely to cause sort to
> > fail.  It also honours the TMPDIR environment variable if that was
> > set, and no longer sorts file names case-insensitively.
>
> Does that still allows to index unicode file names?

In reality there is no such thing as a "Unicode file name", because
there exists no mechanism to record or specify the character encoding
of a path name (i.e. just the bytes are saved n the file system, not
the associated encoding(s)).  Nothing guarantees that the character
encoding in use at the time a path name is generated is the same as
the encoding in use by the user at the time the path name is later
used.   Indeed, the sub-directory names comprising a path name can
each be in a different, incompatible, encoding.

If this seems messy, then yes, well it is.   POSIX hasn't dealt with
this very well.   If there were a do-over, path names might be
specified as UTF-8 in all cases, but that's not how it actually
happened.

In practical terms, the content of the locate database generated by
updateddb treats path names as byte sequences, which is what in fact
they are.    It preserves all valid byte sequences (path names may not
contain the NUL character).   The locate utility prints these.
Whether or not that will result in an intelligible string displayed in
your UI depends on what character encoding(s) is/are used when the
path name is generated and when it is displayed.   If you had
consistent settings for all those things at both points, you will get
the behaviour you expected.   Otherwise, not.   But updatedb is just
passing-through the bytes, it doesn't change them.

You will see locale-dependent behaviour from locate, though, since
regular expressions need to understand what the characters mean (to
offer character classes for example, and even things like '.' which
needs to match a single character).

James.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]