bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [PATCH] Bug 40426 follow-up


From: Tim Ruehsen
Subject: Re: [Bug-wget] [PATCH] Bug 40426 follow-up
Date: Fri, 13 Mar 2015 12:03:15 +0100
User-agent: KMail/4.14.2 (Linux/3.16.0-4-amd64; KDE/4.14.2; x86_64; ; )

Hi Ander,

> A patch was already proposed that simply avoided '-r' and '-O -' to be set
> together. But that would void a clause in the documentation that states
> that "wget -O file http://foo"; is intended to work like "wget -O -
> http://foo > file". So, my approach is to maintain that.

wget -O - http://foo > file
is not using -r, so strictly the doc would be still correct when prohibiting 
the combination of -r and -O -.

I am not sure how your patch works with several HTML files to be downloaded 
when parsed from a (huge) single file. (A test case would show if it really 
works as expected).

How does your patch work together with -k ?

AFAICS, for each HTML/CSS file downloaded you need to keep the start and end 
position of the data appended to the temp. file. This has to be respected when 
loading the data for parsing (and in case of -k for changing the URLs). It is 
possible, but seems to be very bloated...

Keeping each downloaded file in memory is good when not using -k. But with -r 
-k you simply can't keep everything in memory and thus need separate files (as 
wget now works).

One approach would be to download everything 'as normal', keeping a list of 
all downloaded files in download order. And when ready and all downloading / 
conversion has been done, go through the list and copy all the file data to 
stdout (+ removing files / cleaning up).

What do you think ?

Tim

On Thursday 12 March 2015 12:23:15 Ander Juaristi wrote:
> This is my first attempt to fix 40426. I'm expecting to continue to work on
> it until it's fixed, unless someone has objections.
> 
> Source of the problem:
> 
> When -r and -O - are specified together, Wget hangs. Well, it really doesn't
> hang, it's actually waiting for input. When Wget downloads a file it saves
> it to the disk first and then reads it again from there to parse it and get
> more URLs, if any. But when '-O -' was specified, Wget reads from stdin.
> This is done in wget_read_file(). As a funny game, when this happens, type
> some HTML, and then hit Ctrl-D or any other key sequence equivalent to EOF,
> and you'll effectively trick Wget into inserting arbitrary HTML.
> 
> Solution:
> 
> A patch was already proposed that simply avoided '-r' and '-O -' to be set
> together. But that would void a clause in the documentation that states
> that "wget -O file http://foo"; is intended to work like "wget -O -
> http://foo > file". So, my approach is to maintain that.
> 
> Initially I thought of redirecting stdout to stdin, but that would be an
> ugly hack that would probably make things difficult in the future. What's
> more, there are options that rely on stdin, such as '--input-file=-'.
> 
> So, what I did is to write to a regular file in /tmp/.wget.stdout instead of
> in stdout when '-O -' is passed, and dump everything in the end. This,
> apart from fixing the bug, maintains the documented behavior.
> 
> Although this is a good workaround for the short term, IMO the best approach
> is to keep everything in memory and write stuff out in the end. Something
> similar was already reported at 20714.
> 
> I don't expect this patch to be definite. This is just a follow-up to my
> work. I'm looking forward to your feedback as I sure haven't taken into
> account all the consequences.
> 
> So far, the following are still to be done:
>    - /tmp/.wget.stdout won't work on Windows, so port it.
>    - FOPEN_OPT_ARGS is not taken into account.
>    - I only took into account HTTP. The same workaround should be applied to
> FTP too.
> 
> Here goes:
> 
> diff --git a/src/http.c b/src/http.c
> index b7020ef..bc7c1e9 100644
> --- a/src/http.c
> +++ b/src/http.c
> @@ -3080,7 +3080,7 @@ http_loop (struct url *u, struct url *original_url,
> char **newloc,
> 
>     /* Set LOCAL_FILE parameter. */
>     if (local_file && opt.output_document)
> -    *local_file = HYPHENP (opt.output_document) ? NULL : xstrdup
> (opt.output_document); +    *local_file = HYPHENP (opt.output_document) ?
> xstrdup (TMP_OUTFILE) : xstrdup (opt.output_document);
> 
>     /* Reset NEWLOC parameter. */
>     *newloc = NULL;
> @@ -3101,7 +3101,7 @@ http_loop (struct url *u, struct url *original_url,
> char **newloc,
> 
>     if (opt.output_document)
>       {
> -      hstat.local_file = xstrdup (opt.output_document);
> +      hstat.local_file = HYPHENP (opt.output_document) ? xstrdup
> (TMP_OUTFILE) : xstrdup (opt.output_document); got_name = true;
>       }
>     else if (!opt.content_disposition)
> diff --git a/src/main.c b/src/main.c
> index b23967b..1bf7565 100644
> --- a/src/main.c
> +++ b/src/main.c
> @@ -1582,7 +1582,17 @@ for details.\n\n"));
>   #ifdef WINDOWS
>             _setmode (_fileno (stdout), _O_BINARY);
>   #endif
> -          output_stream = stdout;
> +          // TODO We should take care of FOPEN_OPT_ARGS
> +          output_stream = fopen (TMP_OUTFILE, "wb+");
> +          if (output_stream == NULL)
> +            {
> +              perror (TMP_OUTFILE);
> +              exit (WGET_EXIT_GENERIC_ERROR);
> +            }
> +          /*
> +           * We know it's a regular file. No need to check.
> +           */
> +          output_stream_regular = true;
>           }
>         else
>           {
> @@ -1762,8 +1772,24 @@ outputting to a regular file.\n"));
>     if (opt.convert_links && !opt.delete_after)
>       convert_all_links ();
> 
> +  /* If output file is stdout (`-O -' was specified), obey.
> +     Print stuff out.  */
> +  if (HYPHENP (opt.output_document) && output_stream &&
> (total_downloaded_bytes > 0)) +    {
> +      struct file_memory *fm = wget_read_file (TMP_OUTFILE);
> +      if (fm)
> +        {
> +          write (fileno (stdout), fm->content, fm->length);
> +          wget_read_file_free (fm);
> +        }
> +    }
> +
>     cleanup ();
> 
> +  /* Delete the temporary output file  */
> +  if (HYPHENP (opt.output_document) && unlink (TMP_OUTFILE))
> +    logprintf (LOG_NOTQUIET, "Temporary output file was not deleted. Should
> be done by hand. %s\n", strerror (errno)); +
>     exit (get_exit_status ());
>   }
> 
> diff --git a/src/wget.h b/src/wget.h
> index 8d2b0f1..9d40255 100644
> --- a/src/wget.h
> +++ b/src/wget.h
> @@ -126,6 +126,12 @@ as that of the covered work.  */
> 
>   #define DEBUGP(args) do { IF_DEBUG { debug_logprintf args; } } while (0)
> 
> +/* If we're outputting to stdout ('-O -' was specified), we'd rather
> +   write to a temporary file and output everything in the end, since
> +   Wget still needs the downloaded files to be regular in order to
> +   parse them.  */
> +#define TMP_OUTFILE "/tmp/.wget.stdout"
> +
>   /* Pick an integer type large enough for file sizes, content lengths,
>      and such.  Because today's files can be very large, it should be a
>      signed integer at least 64 bits wide.  This can't be typedeffed to

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]