|
From: | Thaddeus H. Black |
Subject: | [Bug-wget] [PATCH] new option --convert-specified-links |
Date: | Wed, 14 Dec 2016 23:24:37 +0000 |
User-agent: | Mutt/1.5.23 (2014-03-12) |
Summary: a patch is attached to let the user choose which links to convert. I. PATCH Please find attached an experimental but functional patch to implement a new option, --convert-specified-links. The patch addresses an old, heretofore unresolved bug report [1]. It adds about 230 lines of code. 1: https://lists.gnu.org/archive/html/bug-wget/2010-05/msg00052.html II. TWO, SEPARATE STAGES Actually, the patch is only half the solution. I am still working on the other half. As GCC lets the user 1. assemble and 2. compile in two, separate stages, so too Wget would let the user 1. download files and 2. convert links in two, separate stages. The attached patch experimentally implements stage 2. Stage 1 remains to be implemented. Stage 1 is to output telemetry (in the form of two files, *.urls and *.links) for stage 2 to consume. Until stage 1 has been implemented, one must prepare the telemetry by hand. III. HOW TO TRY THE PATCH Besides the patch itself, also attached are a pair of sample telemetry files (foo.urls and foo.links) to let you try the patch for yourself. Here's how: 1. Apply the patch to wget-1.18. 2. Build. 3. Make a temporary working directory and cd into it. 4. Copy (or link) foo.urls and foo.links into the temporary working directory. 5. Using the patched Wget, issue these two commands: wget -Nx http://www.thblackelectric.com/ACD/{index,page2}.html wget -NxkK --convert-specified-links=foo http://www.thblackelectric.com/ACD/page3.html Observe that the patched Wget converts links in files currently *and earlier* downloaded. IV. BUGS IN THE PATCH Unknown. The patch is experimental and has not yet been much tested. There probably are bugs -- maybe even some obvious bugs. At least the above test works, though. If someone who already knows Wget's test harness volunteered to supply tests, I'd gratefully accept the help. Otherwise, I'll probably puzzle out the harness myself, eventually. V. XML OR JSON Not used. I would be happy to use either XML or JSON for the control telemetry (like foo.urls and foo.links) if you wish; but, as far as I can tell, Wget doesn't employ such markups. VI. RFC 3986 Perhaps should be used, but isn't yet. RFC 3986 is the percent-encoding standard. (Example from RFC 3986: "%2A" represents '*'.) I gather that Wget already uses this RFC in some manner, but I have not yet investigated. VII. SECURITY Because the patch constructs file paths for the control telemetry, it risks exposing a new security attack surface. I have tried to make the relevant code segment as humanly obvious as possible so that one can easily audit it. VIII. DEBIAN I have earlier reported this bug to Debian. [2] That's just for reference. The Debian bug log contains nothing that would interest you except Noël Köthe's advice that I elevate the bug here to Bug-wget. 2: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=847216 IX. CHANGELOG, INFO/MAN PAGE, ETC. Not yet updated. X. REQUEST FOR ADVICE The patch tries to adhere to your style, practices, and code organization, but probably does not do all these quite right. If you think of advice, please do tell; I would be glad to hear it. Stage 2 remains to be implemented.
wget-1.18.0.1thb.diff
Description: Text Data
foo.urls
Description: Text document
foo.links
Description: Text document
signature.asc
Description: Digital signature
[Prev in Thread] | Current Thread | [Next in Thread] |