Re: Conversion from invalid to valid utf8 strings

Without debugging I suspect the problem is in utf8-start-byte->length, which reports an encoded length of 0 for the bytes #xFE and #xFF. You can just replace those 0's with 1's and it will "skip over" these invalid bytes.

This is by design for simplicity and convenience. If you want to provide the ability to replace these with \uFFFD it should probably be a separate utility.

Alex

On Sun, May 10, 2020 at 5:11 AM Vasilij Schneidermann <address@hidden> wrote:

Hello,

I'm currently writing a Git repository viewer and stumbled upon this wonderful
repository with challenging file names [1]. After writing some code to
correctly encode links and labels I've realized that encoding UTF-8 strings
will incorrectly escape bytes inside sequences that correspond to non-ASCII
characters. Therefore I've added an `(import utf8)` to my code and ran into a
hang when processing the filename consisting of all valid path characters. See
the attachment for a minified example reproducing the hang.

Judging from ticket #1182 `string-length` hanging on an invalid utf8 string is
not considered an error and it's expected to use the (seemingly undocumented?)
`valid-string?` procedure first. An alternative way of dealing with such
strings is encoding every invalid byte sequence using a replacement character
like \uFFFD. I've added such a helper procedure to the attachment as well and
would be happy about feedback to further improve it and whether it's considered
useful enough for inclusion into the utf8 egg.

[1]: https://bitbucket.org/emg/tidbits/src/master/evilfiles/

From:	Alex Shinn
Subject:	Re: Conversion from invalid to valid utf8 strings
Date:	Sun, 10 May 2020 10:52:44 +0900