[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Conversion from invalid to valid utf8 strings

From: Vasilij Schneidermann
Subject: Conversion from invalid to valid utf8 strings
Date: Sat, 9 May 2020 22:10:53 +0200


I'm currently writing a Git repository viewer and stumbled upon this wonderful
repository with challenging file names [1].  After writing some code to
correctly encode links and labels I've realized that encoding UTF-8 strings
will incorrectly escape bytes inside sequences that correspond to non-ASCII
characters.  Therefore I've added an `(import utf8)` to my code and ran into a
hang when processing the filename consisting of all valid path characters.  See
the attachment for a minified example reproducing the hang.

Judging from ticket #1182 `string-length` hanging on an invalid utf8 string is
not considered an error and it's expected to use the (seemingly undocumented?)
`valid-string?` procedure first.  An alternative way of dealing with such
strings is encoding every invalid byte sequence using a replacement character
like \uFFFD.  I've added such a helper procedure to the attachment as well and
would be happy about feedback to further improve it and whether it's considered
useful enough for inclusion into the utf8 egg.


Attachment: test.scm
Description: Lotus Screencam

Attachment: signature.asc
Description: PGP signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]