rdiff-backup-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[rdiff-backup-users] Python 3 migration: considering non-UTF-8 conform f


From: Eric L.
Subject: [rdiff-backup-users] Python 3 migration: considering non-UTF-8 conform filenames
Date: Sat, 3 Aug 2019 12:49:32 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0

Hi,

as I worked on migrating to Python 3, one of the "fanciest" aspects was
the change from str/unicode to bytes/str "character chains" types.

Without going into the technical details (python savvy persons will know
what I mean), it means among other things that the codeset of file names
becomes relevant and must be UTF-8. Files with a name which isn't
compliant with UTF-8 aren't backed up.

The warnings look something like:

Sat Aug  3 10:51:51 2019  Warning: unable to read ACL from 'very
complicated filename': 'utf-8' codec can't encode character '\udcb1' in
position 54: surrogates not allowed
Sat Aug  3 10:51:51 2019  Warning: ignoring file 'very complicated
filename' with wrong encoding: 'utf-8' codec can't encode character
'\udcb1' in position 54: surrogates not allowed

I don't see much options because only str (i.e. codeset-aware) can be
matched against regex, bytes can't (filenames could still be read as bytes).

Few consequences:

1. such files can't get backed-up anymore.
2. old backup repos which contain such files are seen as broken - as
long as the last version doesn't contain such files, only in increments,
it'll be usable though.

This said, non-UTF-8-compatible file systems are uncommon since many
years, so that the impact should be very limited (in my case, old
Windows files lying around since 2010).

I'm mostly concerned about the Asian room, because I've heard (but have
no experience whatsoever) that they might use other rich encodings than
Unicode. The original code was IMHO already not very clean in this
regard, the migration to UTF-8 hasn't improved things, strings are
encoded/decoded sometimes explicitly with UTF-8 sometimes without
explicit UTF-8 encoding.

If the users on this list could comment on their experience and
expectations it would be great. Doing tests with old backup repos on my
PR [1] would be even greater.

Don't expect miracles though, currently I don't see any viable
alternative to the decision I've taken. I mostly wanted to make sure
it's taken transparently.

Thanks, Eric

[1] https://github.com/sol1/rdiff-backup/pull/40



reply via email to

[Prev in Thread] Current Thread [Next in Thread]