bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] [PATCH] new option --convert-specified-links


From: Thaddeus H. Black
Subject: [Bug-wget] [PATCH] new option --convert-specified-links
Date: Wed, 14 Dec 2016 23:24:37 +0000
User-agent: Mutt/1.5.23 (2014-03-12)

Summary: a patch is attached to let the user choose which links
to convert.

I. PATCH

Please find attached an experimental but functional patch to
implement a new option, --convert-specified-links.  The patch
addresses an old, heretofore unresolved bug report [1].  It
adds about 230 lines of code.

    1: https://lists.gnu.org/archive/html/bug-wget/2010-05/msg00052.html

II. TWO, SEPARATE STAGES

Actually, the patch is only half the solution.  I am still
working on the other half.  As GCC lets the user

    1. assemble and
    2. compile

in two, separate stages, so too Wget would let the user

    1. download files and
    2. convert links

in two, separate stages.  The attached patch experimentally
implements stage 2.

Stage 1 remains to be implemented.  Stage 1 is to output
telemetry (in the form of two files, *.urls and *.links) for
stage 2 to consume.  Until stage 1 has been implemented, one
must prepare the telemetry by hand.

III. HOW TO TRY THE PATCH

Besides the patch itself, also attached are a pair of sample
telemetry files (foo.urls and foo.links) to let you try the
patch for yourself.  Here's how:

    1. Apply the patch to wget-1.18.
    2. Build.
    3. Make a temporary working directory and cd into it.
    4. Copy (or link) foo.urls and foo.links into the temporary
       working directory.
    5. Using the patched Wget, issue these two commands:

         wget -Nx http://www.thblackelectric.com/ACD/{index,page2}.html
         wget -NxkK --convert-specified-links=foo 
http://www.thblackelectric.com/ACD/page3.html

Observe that the patched Wget converts links in files
currently *and earlier* downloaded.

IV. BUGS IN THE PATCH

Unknown.

The patch is experimental and has not yet been much tested.
There probably are bugs -- maybe even some obvious bugs.  At
least the above test works, though.

If someone who already knows Wget's test harness volunteered to
supply tests, I'd gratefully accept the help.  Otherwise, I'll
probably puzzle out the harness myself, eventually.

V. XML OR JSON

Not used.

I would be happy to use either XML or JSON for the control
telemetry (like foo.urls and foo.links) if you wish; but, as
far as I can tell, Wget doesn't employ such markups.

VI. RFC 3986

Perhaps should be used, but isn't yet.

RFC 3986 is the percent-encoding standard.  (Example
from RFC 3986: "%2A" represents '*'.)  I gather that Wget
already uses this RFC in some manner, but I have not
yet investigated.

VII. SECURITY

Because the patch constructs file paths for the control
telemetry, it risks exposing a new security attack surface.  I
have tried to make the relevant code segment as humanly obvious
as possible so that one can easily audit it.

VIII. DEBIAN

I have earlier reported this bug to Debian. [2]  That's just
for reference.  The Debian bug log contains nothing that would
interest you except Noël Köthe's advice that I elevate the bug
here to Bug-wget.

    2: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=847216

IX. CHANGELOG, INFO/MAN PAGE, ETC.

Not yet updated.

X. REQUEST FOR ADVICE

The patch tries to adhere to your style, practices, and code
organization, but probably does not do all these quite right.
If you think of advice, please do tell; I would be glad to
hear it.

Stage 2 remains to be implemented.

Attachment: wget-1.18.0.1thb.diff
Description: Text Data

Attachment: foo.urls
Description: Text document

Attachment: foo.links
Description: Text document

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]