lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] Lynx bug report: mangled UTF-8


From: Tom Christiansen
Subject: Re: [Lynx-dev] Lynx bug report: mangled UTF-8
Date: Wed, 06 Oct 2010 08:01:10 -0600

On Tuesday, 5 October 2010, Thomas Dickey wrote at 5:00pm EDT:

>> I've verified this bug using the following version of Lynx, whose
>> release is notably dated just yesterday:
>>
>>    $ ./lynx -version
>>    Lynx Version 2.8.8dev.6 (04 Oct 2010)
>>    libwww-FM 2.14, ncurses 5.7.20081102
>>    Built on darwin10.4.0 Oct  5 2010 10:23:40

> Your bug might still be present, but right away I notice that it's not
> built with wide-character library of ncurses (and is not likely to work 
> as well):

>   Lynx Version 2.8.8dev.4 (21 Jun 2010)
>   libwww-FM 2.14, SSL-MM 1.4.1, OpenSSL 0.9.8o, ncurses 5.7.20101002(wide)
>   Built on linux-gnu Jun 21 2010 17:17:20

Thanks for the suggestion.  I've rebuilt with ncursesw, and my version 
now reads:

    Lynx Version 2.8.8dev.6 (04 Oct 2010)
    libwww-FM 2.14, ncurses 5.7.20081102(wide)
    Built on darwin10.4.0 Oct  5 2010 17:49:29

For more detailed library info, Darwin doesn't have ldd(1), so one
instead uses:

    $ otool -L ./lynx
    ./lynx:
    /opt/local/lib/libidn.11.dylib (compatibility version 17.0.0, current 
version 17.45.0)
    /opt/local/lib/libncursesw.5.dylib (compatibility version 5.0.0, current 
version 5.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 
125.2.0)

But the problem continued to occur--at least on Darwin; haven't
tried on OpenBSD.  

My test case is to run

    $ ~/lynx2-8-8/lynx -width=80 -display_charset=utf-8 \
        -assume_local_charset=utf-8 -dump test.html > test.utf8

Looking at test.utf8 in less(1) with LESSCHARSET set to UTF-8 reveals
this paragraph:

  Values in parentheses are for the high resolution bin: 2.71–2.59 Å for
  the SeMet GMP-bound, 2.03–2.00 Å for the native GMP-bound, 2.38–2.30 <C3>
  for the native apo form.

where less places <C3> in reverse-video to indicate a charcter that's 
either non-printable or out of repertoire; here, invalid UTF-8.

What's happening is that the UTF-8 encoding of code point U+00C5,  LATIN
CAPITAL LETTER A WITH RING ABOVE, is the two-byte pair, 0xC3 and 0x85.

But considered in ISO 8859-1, an isolated 0x85 is NEXT LINE (NEL), which 
is considered white space.  The 0x85 byte is erroneously replaced by a 
newline and a lone 0xC3 byte left dangling, which is illegal as UTF-8.

A more programmatic approach to locating the problem can be had via perl(1),
feeding the input via stdin:

    $ perl -CS  -Mwarnings=FATAL,all -lne 'print if /nonesuch/' < test.utf8
    utf8 "\xC3" does not map to Unicode at -e line 1, <> line 633.
    Exit 255

or this way with an explicitly named file:

    $ perl -CSD -Mwarnings=FATAL,all -lne 'print if /nonesuch/'   test.utf8
    utf8 "\xC3" does not map to Unicode at -e line 1, <> line 633.
    Exit 25

Where "Exit 255" and "Exit 25" are printed by tcsh because I have
printexitvalue set.

One can also set up an explicit "warning handler" if one wishes more
control over the error message and behavior, perhaps for generating a
unit-test used in regression testing.

Thank you all for all your help and suggestions.

--tom



reply via email to

[Prev in Thread] Current Thread [Next in Thread]