monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Non-ascii filename under win32


From: Jesper Ribbe
Subject: Re: [Monotone-devel] Non-ascii filename under win32
Date: Wed, 27 Sep 2006 20:05:52 +0200
User-agent: Thunderbird 1.5.0.7 (Windows/20060909)

Nathaniel,

sorry for sending this directly to you, I meant to send it to the list.

Nathaniel Smith wrote:
On Mon, Sep 18, 2006 at 09:49:47PM +0200, Jesper Ribbe wrote:
Hello,

I'm a new user to Monotone and am currently evaluating it under win32.
I've tried to add a filename which contain the swedish character "Å", but fail to get this to work.

The error message I get is:
mtn: fatal: std::logic_error: paths.cc:255: invariant 'I(utf8_validate(path))' violated

Right -- the theory is that monotone uses your local filesystem
charset when talking to your local filesystem, uses utf8 internally
(so as to have a canonical format that everyone can use, even if they
have different local charsets), and converts as necessary.

I assume your local character set is something non-unicode, like an
ISO-8859 variant or similar?  (If you had included the dump file
monotone makes when it crashes like that, it includes some information
on what locale settings monotone thinks you are using.)

The problem is probably that it isn't converting something when it
should.  There are some known bugs in this stuff, and no-one's gotten
around to doing a systematic audit/fixup yet.  If you're curious,
some are marked "BUG" in the source; and the main code involved is
paths.hh/.cc.  We're pretty good about this stuff when it comes time
to look for known files, or write files out; the bug you're most
likely running into is, when we go out and ask the filesystem what
files exist (the way "mtn add <directory>" does, for instance), we
don't convert from filesystem charset->utf8.

If that's the issue, a possible workaround is to pass the names of the
offending files you want to add explicitly on the command line,
instead of letting monotone find them by searching the filesystem --
command line arguments are converted correctly.



thanks for the explanation. I verified that this is indeed the case, setting CHARSET=CP1252 makes "mtn add åäö.txt" work. I've also looked a little at the code, and think I've located the place where the missing conversion should be.

However there is one complicating issue with win32:
There are 2 character sets that are in effect at the same time. Text output to the DOS shell is usually encoded in CP437 (or some other old DOS codepage). Filenames are usually encoded in Windows-1252 (for western europe, as with ISO-8859 there are some variants).
This makes the model of one character set encoding a bit problematic.
An even (in my opinion) worse restriction is that all filenames that cannot be represented in the current 1-byte encoding will be unaccessible from monotone - under modern Windows the filenames are internally stored in UTF-16.

As windows have a parallel unicode-aware API for both filesystem access and console I/O, I've looked into the possibility of changing the sourcecode to take advantage of this. That would solve the above problems and also getting rid of the need of specifying a CHARSET. However for instance the boost library seem to use the ANSI-versions, which makes this a quite intrusive change. If I look into this, would it be interesting - or is the change deemed to large for too little gain?

/Jesper






reply via email to

[Prev in Thread] Current Thread [Next in Thread]