Bug reported regarding Unicode handling in email address

nmh-workers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Bug reported regarding Unicode handling in email address

From:	Ken Hornstein
Subject:	Bug reported regarding Unicode handling in email address
Date:	Wed, 02 Jun 2021 00:13:51 -0400

So this bug was reported yesterday:

        https://savannah.nongnu.org/bugs/?60713

And I kind of thought we got this mostly right!  So I dug into it a bit.

It turns out the problem is WAY down in the address parser.  Specifically
it is here, in sbr/mf.c:my_lex()

        if (iscntrl ((unsigned char) c) || isspace ((unsigned char) c))
            break;

This LOOKS ok.  But ... if you look at the test message, it contains the
character 'с', which is U+0441 "Cyrillic Small Letter ES".  And the UTF-8
encoding of that is 0xd1 0x81.  So we end up calling iscntrl() on
0xd1 (which is false) AND then we end up calling iscntrl() on 0x81 ...
which returns true (because that's a Unicode "control" character).  Note
this only happens IF you are in a UTF-8 locale call AND you call
setlocale() at the beginning of your program (the latter drove me nuts
because my original test program didn't work because I didn't do that).

So, it seems like the behavior of iscntrl() and isspace() if the value
is > 127 is undefined.  If you're in the UTF-8 locale MacOS X treats that
as a Unicode codepoint.  But we are NOT treating it like that in this case;
we're processing it on a character-by-character basis.

I am wondering if the simplest solution is to put in isascii() in front
of those tests in that function.  We only really care about those tests
returning "true" for ASCII characters.  Thoughts?

--Ken

[Prev in Thread]

Current Thread

[Next in Thread]

Bug reported regarding Unicode handling in email address, Ken Hornstein <=
- Re: Bug reported regarding Unicode handling in email address, Tom Lane, 2021/06/02
  - Re: Bug reported regarding Unicode handling in email address, Ken Hornstein, 2021/06/02
    - Re: Bug reported regarding Unicode handling in email address, David Levine, 2021/06/02
    - Re: Bug reported regarding Unicode handling in email address, Tom Lane, 2021/06/02
    - Re: Bug reported regarding Unicode handling in email address, Ken Hornstein, 2021/06/02
    - Re: Bug reported regarding Unicode handling in email address, Ralph Corderoy, 2021/06/07
    - Re: Bug reported regarding Unicode handling in email address, Tom Lane, 2021/06/07
- Re: Bug reported regarding Unicode handling in email address, Valdis Klētnieks, 2021/06/02
  - Re: Bug reported regarding Unicode handling in email address, Ken Hornstein, 2021/06/02
    - Re: Bug reported regarding Unicode handling in email address, Bob Carragher, 2021/06/03

Next by Date: Re: Bug reported regarding Unicode handling in email address
Next by thread: Re: Bug reported regarding Unicode handling in email address
Index(es):
- Date
- Thread