[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [MIT-Scheme-devel] UTF-8 sequences
From: |
Taylor R Campbell |
Subject: |
Re: [MIT-Scheme-devel] UTF-8 sequences |
Date: |
Thu, 19 Feb 2015 14:19:43 +0000 |
User-agent: |
IMAIL/1.21; Edwin/3.116; MIT-Scheme/9.1.99 |
Date: Thu, 19 Feb 2015 12:32:04 +0100
From: <address@hidden>
Is there a way to globally (or for a port) tell MIT/GNU Scheme to never
slashify anything? Whatever I send in, I want out, in exactly the same
bytes. No special handling of ISO-8859-1, UTF-8 or whatever.
DISPLAY and WRITE-STRING will do exactly that, on a binary port.
WRITE never will: WRITE will always escape `"' and `\', at the very
least, and usually many other octets that are not usually graphic such
as control characters. In particular, it is designed to escape any
octets that do not represent graphic characters in ISO-8859-1.
I think that's a little silly -- it should be limited to US-ASCII, not
ISO-8859-1, by default. Currently the S-expression notation that MIT
Scheme uses is defined in terms of ISO-8859-1 sequences. If you
changed that to UTF-8 sequences, it would still work to limit the
octets written verbatim in strings to be the US-ASCII graphic ones.
But if you want a string containing the UTF-8 sequence for eszett to
be written as the UTF-8 sequence for double-quoted eszett, it's not as
simple a matter as changing the set of octets that should be escaped
when unparsing a string versus written verbatim: what are called
`strings' in MIT Scheme are more accurately `octet vectors', and do
not necessarily contain only valid UTF-8 sequences. (Operations on
`utf8-strings' are operations on strings which are expected to contain
only valid UTF-8 sequences.)
I wouldn't object to changing the S-expression notation so that it is
defined in terms of UTF-8 sequences, although maybe it should be made
a configurable option to avoid breaking any existing ISO-8859-1
S-expressions. We already have a few such configurable options, such
as the keyword style, in the parser and unparser. You might
(a) add a new parser file attribute, coding;
(b) change the parser to do (port/set-coding port <coding>)[*];
(c) change HANDLER:STRING to do (port/set-coding port* <coding>);
(d) add a new unparser variable *UNPARSER-CODING*; and
(e) add logic to the string unparser to write verbatim all longest
substrings of the string that are valid octet sequences in the current
coding system (and don't contain `"', `\', or control characters), and
escape all other octets.
Similar considerations would have to apply to character literals and
symbols. If you want to limit the allowed coding systems for the
parser and unparser to be US-ASCII, ISO-8859-1, and UTF-8, that's OK
too -- I don't think anyone actually cares about writing Scheme code
in UTF-32BE.
I know this isn't easy, and I know it's frustrating for anyone who
wants to work with languages other than English. But anything less
than this is going to cause even more problems for everybody.
[*] As an aside, our scheme for binary I/O and coding systems is not
very sensible. There should really be one concept of binary I/O
sources/sinks, and a separate concept of decoding/encoding text in
particular coding systems. But for now, maybe we should have an
operation PORT/WITH-CODING that dynamically binds the coding system,
and the parser should use that instead of modifying the port it is
given, and if you pass the parser a port in a non-binary coding system
you shouldn't expect anything good to come of it.