[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Conversion from invalid to valid utf8 strings
From: |
Vasilij Schneidermann |
Subject: |
Conversion from invalid to valid utf8 strings |
Date: |
Sat, 9 May 2020 22:10:53 +0200 |
Hello,
I'm currently writing a Git repository viewer and stumbled upon this wonderful
repository with challenging file names [1]. After writing some code to
correctly encode links and labels I've realized that encoding UTF-8 strings
will incorrectly escape bytes inside sequences that correspond to non-ASCII
characters. Therefore I've added an `(import utf8)` to my code and ran into a
hang when processing the filename consisting of all valid path characters. See
the attachment for a minified example reproducing the hang.
Judging from ticket #1182 `string-length` hanging on an invalid utf8 string is
not considered an error and it's expected to use the (seemingly undocumented?)
`valid-string?` procedure first. An alternative way of dealing with such
strings is encoding every invalid byte sequence using a replacement character
like \uFFFD. I've added such a helper procedure to the attachment as well and
would be happy about feedback to further improve it and whether it's considered
useful enough for inclusion into the utf8 egg.
[1]: https://bitbucket.org/emg/tidbits/src/master/evilfiles/
test.scm
Description: Lotus Screencam
signature.asc
Description: PGP signature
- Conversion from invalid to valid utf8 strings,
Vasilij Schneidermann <=