bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [PATCH] Bug 40426 follow-up


From: Darshit Shah
Subject: Re: [Bug-wget] [PATCH] Bug 40426 follow-up
Date: Sun, 15 Mar 2015 01:30:59 +0530
User-agent: Mutt/1.5.23 (2014-03-12)

Hi Ander,


This is my first attempt to fix 40426. I'm expecting to continue to work on it until it's fixed, unless someone has objections.

Source of the problem:

When -r and -O - are specified together, Wget hangs. Well, it really doesn't 
hang, it's actually waiting for input. When Wget downloads a file it saves it 
to the disk first and then reads it again from there to parse it and get more 
URLs, if any. But when '-O -' was specified, Wget reads from stdin. This is 
done in wget_read_file(). As a funny game, when this happens, type some HTML, 
and then hit Ctrl-D or any other key sequence equivalent to EOF, and you'll 
effectively trick Wget into inserting arbitrary HTML.

That's an interesting observation about getting Wget to insert arbitrary HTML. I would wary about this because it could even cause Wget to download an arbitrary file.

Solution:

A patch was already proposed that simply avoided '-r' and '-O -' to be set together. But that would 
void a clause in the documentation that states that "wget -O file http://foo"; is intended to 
work like "wget -O - http://foo > file". So, my approach is to maintain that.

Initially I thought of redirecting stdout to stdin, but that would be an ugly 
hack that would probably make things difficult in the future. What's more, 
there are options that rely on stdin, such as '--input-file=-'.

So, what I did is to write to a regular file in /tmp/.wget.stdout instead of in 
stdout when '-O -' is passed, and dump everything in the end. This, apart from 
fixing the bug, maintains the documented behavior.

I think this approach opens a can of worms. Multiple Wget processes, support with other existing options and worst of all, people exercising this option in ways I can't currently think of.

In such a scenario, I don't mind editing the documentation to add an exception. But the solution presented above would cause too much pain in the long run to maintain.

Although this is a good workaround for the short term, IMO the best approach is 
to keep everything in memory and write stuff out in the end. Something similar 
was already reported at 20714.

If someone is willing to work on a streaming model of the HTML Parser, this approach might actually make some sense.

I don't expect this patch to be definite. This is just a follow-up to my work. 
I'm looking forward to your feedback as I sure haven't taken into account all 
the consequences.

So far, the following are still to be done:
 - /tmp/.wget.stdout won't work on Windows, so port it.
 - FOPEN_OPT_ARGS is not taken into account.
 - I only took into account HTTP. The same workaround should be applied to FTP 
too.

As you've already realized, the approach you propose creates a whole bunch of corner cases that need to covered. At this moment, it looks like implementing this approach for -r -O- would be large enough for an entire GSoC Project.

Instead I think it is better to stick to the basics and simply reject such combinations of options

--
Thanking You,
Darshit Shah

Attachment: pgpaOh1DHrHVD.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]