bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Combining --output-document with --recursive


From: Gijs van Tulder
Subject: [Bug-wget] Combining --output-document with --recursive
Date: Thu, 24 May 2012 23:45:20 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1

Hi,

There's a problem if you combine --output-document with --recursive or --page-requisites. --output-document breaks the recursion.

First you get a warning:

  WARNING: combining -O with -r or -p will mean that all downloaded
  content will be placed in the single file you specified.

That is what you'd expect, no problem there.

However, there is a problem with the recursion. Because Wget *appends* all downloaded content in the same file, the HTML and CSS parsers get confused. The same content is parsed over and over again, each time with a different URL context.

Example:
1. You run wget -O out.tmp -r http://example.com/index.html
2. http://example.com/index.html is written to out.tmp.
   URLs are extracted from out.tmp relative to
   http://example.com/index.html. Suppose that there is a link to a
   subdirectory test/index.html, which is added to the download queue
   as http://example.com/test/index.html (correct).
3. http://example.com/test/index.html is appended to out.tmp.
   Now, again, Wget extracts URLs from out.tmp. It parses the whole
   file, so it first finds the contents of /index.html, with the link
   to test/index.html. Because Wget thinks it is now parsing
   http://example.com/test/index.html, it will enqueue this as
   http://example.com/test/test/index.html (wrong).

One obvious solution, which I've added to this email, is to clear the output document before downloading the next file. This breaks the current behaviour, so maybe it's not a good idea. Is there a better solution?

Regards,

Gijs

--

index 8d4edba..502b68f 100644
--- a/src/http.c
+++ b/src/http.c
@@ -2888,7 +2888,18 @@ read_header:
         }
     }
   else
-    fp = output_stream;
+    {
+      fp = output_stream;
+      rewind (fp);
+      if (ftruncate (fileno (fp), 0) == -1)
+        {
+ logprintf (LOG_NOTQUIET, "Could not truncate output file: %s\n", strerror (errno));
+          CLOSE_INVALIDATE (sock);
+          xfree (head);
+          xfree_null (type);
+          return FOPENERR;
+        }
+    }

   /* Print fetch message, if opt.verbose.  */
   if (opt.verbose)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]