[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Lynx-dev] Lynx bug report: mangled UTF-8

From: Tom Christiansen
Subject: [Lynx-dev] Lynx bug report: mangled UTF-8
Date: Tue, 05 Oct 2010 11:22:45 -0600

I've verified this bug using the following version of Lynx, whose
release is notably dated just yesterday:

    $ ./lynx -version
    Lynx Version 2.8.8dev.6 (04 Oct 2010)
    libwww-FM 2.14, ncurses 5.7.20081102
    Built on darwin10.4.0 Oct  5 2010 10:23:40

This bug also occurs in all prior versions of Lynx I was able to test.


    When considering line wrapping, Lynx misconstrues all text as ISO 
    8859-1, even when producing UTF-8.  All code points whose multibyte 
    UTF-8 encoding includes bytes which are white space in 8859-1 
    [see attached program] erroneously become candidates for line wrapping.

    Multibyte expansions containing either of bytes 0x85 or 0xA0 may have
    that byte replaced by \n, a substitution which not only irrecoverably
    mangles the intended text but also generates illegal UTF-8 sequences.


I was unable to locate any mention of this bug, whether in the CHANGES or
PROBLEMS file, or via Googling.  I am also unaware of any bugs database for
Lynx, or I would have submitted this there.  I trust my simple description
should suffice to locate the offending code, but if not, sample input
file(s) manifesting the problem are available upon request.

Hope this helps.  Send mail if you have any advice or need more details.



    As of Unicode 5.2, 1,776 named code points are vulnerable to this
    Lynx bug.  These can be enumerated by running the following program.

#!/usr/bin/env perl
# spacenc - find code points with multibyte UTF-8 encodings containing
#           bytes that would be spaces if misunderstood to be ISO 8859-1
# Tom Christiansen <address@hidden>
# Tue Oct  5 10:51:18 MDT 2010
# NB: works best with Unicode version >= 5.2, hence Perl version >= 5.12

use strict;
use warnings  FATAL => qw[all];
use diagnostics;
use charnames qw[ ];
use Encode    qw[encode decode];

# omit code points < 128, as those don't multibyte-encode
for my $cp (0x00_0080 .. 0x10_FFFF) {

    # gaggy UTF-16 surrogates are illegal UTF-8 code points
    next if $cp >= 0x00_D800 && $cp <= 0x00_DFFF;

    # see "Unicode non-character %s is illegal for interchange" in perldiag(1)
    $_ = do { no warnings "utf8"; chr($cp) };

    # won't find string names for any of these, so don't bother printing
    next if m{ \p{Unassigned}           }x;
    next if m{ \p{PrivateUse}           }x;
    next if m{ \p{Han}                  }x;
    next if m{ \p{InHangulSyllables}    }x;

    # cast individual utf8 bytes into latin1 code points
    my $as_latin = decode("latin1", encode("utf8", $_));

    if ($as_latin =~ m{ \s }x) {
        printf("U+%05X in UTF8 is %v02X", $cp, $as_latin);
        printf(" %s\n", charnames::viacode($cp) || "<unnamed code point>");


    close(STDOUT) || die "can't close stdout: $!";

reply via email to

[Prev in Thread] Current Thread [Next in Thread]