bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] New wget (1.19.2): Unexpected download behaviour for gzip


From: Jens Schleusener
Subject: Re: [Bug-wget] New wget (1.19.2): Unexpected download behaviour for gzip-compressed tarballs (HTTP-header dependent)
Date: Fri, 3 Nov 2017 20:10:22 +0100 (CET)
User-agent: Alpine 2.20 (LSU 67 2015-01-07)

On Fri, 3 Nov 2017, Tim Rühsen wrote:

On 11/03/2017 06:37 AM, James Cloos wrote:
"TR" == Tim Rühsen <address@hidden> writes:

TR> I downloaded/tested thousands of web pages and they behave as if 'Content-
TR> Encoding: gzip' is a compression for the transport. Uncompressing it 
'on-the-
TR> fly' and saving that uncompressed data was the correct behavior.

Lots of servers have that misconfiguration; it was recommended in the
past and apache defaulted to doing that when grabbing things like tar.gz.

The gui browsers had to learn to work around that misconfig.  wget also
has to.

In short, do not uncompress if the destination name has a compression
suffix.

Or, in that case, test whether the uncompressed data starts with gzip
magic and complete one decompression if so, non if not so.

And the same for the other compression formats.

Thanks for this insight !

Looking at the Mozilla/Gecko sources shows that gzip Content-Encoding is
just cleared for Content-Types application/x-gzip, application/gzip and
application/x-gunzip. That makes it straight forward to go that way.

That seems at least for the gzip ones to be a client-side correction of an incorrect server behaviour according to RFC 7231 "Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content" https://tools.ietf.org/html/rfc7231#section-3.1.2.2

   If the media type includes an inherent encoding, such as a data
   format that is always compressed, then that encoding would not be
   restated in Content-Encoding even if it happens to be the same
   algorithm as one of the content codings.  Such a content coding would
   only be listed if, for some bizarre reason, it is applied a second
   time to form the representation.  Likewise, an origin server might
   choose to publish the same data as multiple representations that
   differ only in whether the coding is defined as part of Content-Type
   or Content-Encoding, since some user agents will behave differently
   in their handling of each response (e.g., open a "Save as ..." dialog
   instead of automatic decompression and rendering of content).

Regards

Jens


reply via email to

[Prev in Thread] Current Thread [Next in Thread]