[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Issue 2159 in lilypond: Patch: lexer.ll: Warn about non-UTF-8 charac
From: |
David Kastrup |
Subject: |
Re: Issue 2159 in lilypond: Patch: lexer.ll: Warn about non-UTF-8 characters |
Date: |
Mon, 02 Jan 2012 10:32:31 +0100 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/24.0.92 (gnu/linux) |
Hans Aberg <address@hidden> writes:
> On 1 Jan 2012, at 21:06, David Kastrup wrote:
>
>>>> Updates:
>>>> Labels: Patch-new
>>>>
>>>> Comment #2 on issue 2159 by address@hidden: Patch: lexer.ll: Warn
>>>> about non-UTF-8 characters
>>>> http://code.google.com/p/lilypond/issues/detail?id=2159#c2
>>>>
>>>> lexer.ll: Warn about non-UTF-8 characters
>>>>
>>>> Making the warnings point to the exact bad byte rather than the
>>>> enclosing construct would be nice.
>>>
>>> One way to implement this might be to use the Haskell program for Flex
>>> like UTF-8 regular expressions I made:
>>> http://xcybercloud.blogspot.com/2009/04/unicode-support-in-flex.html
>>>
>>> First make rules for the Unicode characters you want admit, followed
>>> by a '.' rule which picks up single excluded bytes.
>>
>> The "unicode characters we want admit" are not single characters, but
>> part of things like identifiers, strings and other stuff. Cf.
>> <URL:http://codereview.appspot.com/5505090#msg5>
>> for a reasoning about the current approach for this patch.
>
> I translate Unicode character classes into Flex UTF-8 regular
> expressions, so you can apply the other Flex regex operators to get
> that stuff.
What makes you think I did not get that? Did you actually _read_ the
reasoning I linked to above? You don't get a single error path in that
case, and doing a catchall with . requires _backing_ _up_ in the lexer
for every non-UTF-8 byte sequence that does not already start with an
invalid byte.
We use uncompressed tables in the lexer and make it a point to have _no_
expressions backing up. So you need to provide expressions matching any
_bad_ UTF-8 sequence even if its first bytes are identical to that of a
good UTF-8 sequence.
Please try understanding this problem before suggesting a non-fitting
solution again. I have spent days with doing analysis and trying
alternative approaches, and it is somewhat aggravating if somebody just
goes on assuming I don't know what I am talking about and showing me a
simplistic solution often enough will make me realize my stupidity.
Please run
lex -b
on a flex file of yours that checks for UTF-8 identifiers and check
whether you get any backup states in the resulting lex.backup file. I
should be quite surprised if you didn't.
--
David Kastrup
Re: Issue 2159 in lilypond: Patch: lexer.ll: Warn about non-UTF-8 characters, Hans Aberg, 2012/01/01