bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget 1.12 utf-8 webpage with convert-links generate illeg


From: Ángel González
Subject: Re: [Bug-wget] wget 1.12 utf-8 webpage with convert-links generate illegail utf-8 sequence
Date: Sun, 10 Jun 2012 00:32:20 +0200
User-agent: Thunderbird

On 09/06/12 21:03, Micah Cowan wrote:
> Could you attach an example of the broken file contents? ...the full
> file itself is perhaps a bit large to attach in a mailing list (~85k?),
> but perhaps you could use a pastebin, or otherwise throw it up on a
> server, or just post a snippet that illustrates exactly what sort of
> corruption is taking place in your setup.
>
> Good luck,
> -mjc
That wikipedia page hasn't been edited since April 7th, so we are all
probably working with the same content.

These are the md5sums of the files I worked with:
6d887f5796a00a24e8fb284d6f78791c Without-k
341611e10271ffa117f873a56a467960 With-k

Hitoshi, if the md5 of the corrupted file is 3416... then I missed the
corruption. A simple wget should be 6d88... though.

A fragment of the relevant bytes (eg. hexdump -C) from both the original
and transformed (broken) file could be enough for finding out the cause.


The latest big change to convert.c was the CSS wonder-patch of 2008,
available in 1.12, so there shouldn't be any difference in the
conversion with the latest one.
Still, I built and tried with ftp://ftp.gnu.org/gnu/wget/wget-1.12.tar.bz2

I did found an interesting issue:

Where the file converted with current wget shows:
                <!-- logo -->
                    <div id="p-logo"><a style="background-image:
url(http://upload.wikimedia.org/...
                <!-- /logo -->
<!-- navigation -->
<div class="portal" id='p-navigation'>
    <h5>...
    <div class="body">
        <ul>
            <li id="n-mainpage">...
            <li id="n-portal">...
            <li id="n-currentevents">...
            <li id="n-newpages">...
            <li id="n-recentchanges">...

The one converted with 1.12 shows:
        <!-- panel -->
            <div id="mw-panel" class="noprint">
                <!-- logo -->
                    <div id="p-logo"><a style="background-image:
url(//upload.wikimedia.org/....
                <!-- /logo -->
<!-- navigation -->
<div class="portal" id='p-navigation'>
    <h5>...
    <div class="body">
        <ul>
            <li id="n-mainpage">...
            <li id="n-portal">...
            <li id="n-currentevents"><a
href="/wiki/Portal:http://upload.wikimedia...
                <!-- /logo -->
<!-- navigation -->
<div class="portal" id='p-navigation'>
    <h5>...
    <div class="body">
        <ul>
            <li id="n-mainpage"><a href="http://ja.wikipedia.org/wiki/...
            <li id="n-portal"><a href="http://ja.wikipedia.org/wiki/....
            <li id="n-currentevents"><a
href="http://ja.wikipedia.org/wiki/Portal:%E6%9C%80%E8...
            <li id="n-newpages"><a href="http://ja.wikipedia.org/...
            <li id="n-recentchanges"><a href="http://ja.wikipedia.org/...

In summary, the relative protocol link is not converted inside the
inline CSS (not a big bug), then the following 9 lines of the
unconverted are copied, and then the rest of the converted file
including those 9 lines again.
On a different fetch, I get an slightly differently corrupted file along
the same lines. It is likely that depending on the way the pieces
happened to copy, the UTF-8 bytes got invalid.

So there was indeed a bug on 1.12 link conversion, which seems to have
been fixed in the meantime.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]