[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Monotone-devel] rfc: small simplification to paths.cc/constants.cc
From: |
Nathaniel Smith |
Subject: |
Re: [Monotone-devel] rfc: small simplification to paths.cc/constants.cc |
Date: |
Sun, 16 Jul 2006 22:52:46 -0700 |
User-agent: |
Mutt/1.5.11+cvs20060403 |
On Sun, Jul 16, 2006 at 01:49:14PM -0700, Zack Weinberg wrote:
> On 7/14/06, Nathaniel Smith <address@hidden> wrote:
> >> +// ??? Ensure use of UTF8 encoding internally, validate encoding here.
> >
> >^^ Hmm?
>
> I have gotten lost in the conversions and the wrappers, and cannot
> tell what encoding (if any) can be relied upon at this point in the
> code. The exclusion of characters 00-1f and 7f, but none in the 80-ff
> range, makes me think it's supposed to be utf8 (it's clearly not a
> fixed-width 16- or 32-bit encoding; if it were any single-byte 8859.n
> encoding, we should also exclude 80-9f; any other variable-width
> encoding that I know of requires rather more smarts to find bad
> characters in...)
file_paths are always utf8 internally.
> But if it _is_ guaranteed to be utf8 at this point, there are a number
> of invalid byte sequences that we ought to be weeding out: notably ED
> A0 xx .. ED BF xx and overlength encodings like E0 9F 80; unless we
> have a guarantee from elsewhere that we're not going to get them. I
> have code (from libcpp) that I can adapt to do this.
See utf8_validate, and the call to it at the top of the file_path
constructor. (utf8_validate is itself stolen from glib.)
-- Nathaniel
--
Sentience can be such a burden.