bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [PATCH] Implement --keep-badhash to keep Metalink's files


From: Tim Rühsen
Subject: Re: [Bug-wget] [PATCH] Implement --keep-badhash to keep Metalink's files with a bad hash
Date: Fri, 19 Aug 2016 12:16:15 +0200
User-agent: KMail/5.2.3 (Linux/4.6.0-1-amd64; KDE/5.23.0; x86_64; ; )

On Freitag, 19. August 2016 07:34:42 CEST Matthew White wrote:
> On Thu, 18 Aug 2016 12:50:48 +0200
> 
> Tim Rühsen <address@hidden> wrote:
> > On Mittwoch, 17. August 2016 17:55:47 CEST Matthew White wrote:
> > > On Thu, 04 Aug 2016 17:08:20 +0200
> > > 
> > > Tim Ruehsen <address@hidden> wrote:
> > > > On Thursday, August 4, 2016 12:40:49 PM CEST Matthew White wrote:
> > > > > On Thu, 04 Aug 2016 12:08:50 +0200
> > > > > 
> > > > > Tim Ruehsen <address@hidden> wrote:
> > > > > > On Wednesday, August 3, 2016 2:40:14 PM CEST Matthew White wrote:
> > > > > > > On Wed, 03 Aug 2016 14:05:06 +0200
> > > > > > > 
> > > > > > > Tim Ruehsen <address@hidden> wrote:
> > > > > > > > On Tuesday, August 2, 2016 11:27:28 AM CEST Matthew White 
wrote:
> > > > > > > > > On Mon, 01 Aug 2016 16:32:30 +0200
> > > > > > > > > 
> > > > > > > > > Tim Ruehsen <address@hidden> wrote:
> > > > > > > > > > On Saturday, July 30, 2016 9:41:48 PM CEST Matthew White
> > 
> > wrote:
> > > > > > > > > > > Hello!
> > > > > > > > > > > 
> > > > > > > > > > > I think that sometimes it could help to keep downloaded
> > > > > > > > > > > Metalink's
> > > > > > > > > > > files
> > > > > > > > > > > which have a bad hash.
> > > > > > > > > > > 
> > > > > > > > > > > The default wget behaviour is to delete such files.
> > > > > > > > > > > 
> > > > > > > > > > > This patch provides a way to keep files which have a bad
> > > > > > > > > > > hash
> > > > > > > > > > > through
> > > > > > > > > > > the
> > > > > > > > > > > option --keep-badhash. It appends the suffix .badhash to
> > > > > > > > > > > the
> > > > > > > > > > > file
> > > > > > > > > > > name,
> > > > > > > > > > > except without overwriting existing files. In the latter
> > > > > > > > > > > case,
> > > > > > > > > > > an
> > > > > > > > > > > unique
> > > > > > > > > > > suffix is appended after .badhash.
> > > > > > > > > > > 
> > > > > > > > > > > I made this patch working on the following branch:
> > > > > > > > > > > master (latest 20cac2c5ab3d63aacfba35fb10878a2d490e2377)
> > > > > > > > > > > git://git.savannah.gnu.org/wget.git
> > > > > > > > > > > 
> > > > > > > > > > > What do you think?
> > > > > > > > > > 
> > > > > > > > > > Hi Matthew,
> > > > > > > > > > 
> > > > > > > > > > good work !
> > > > > > > > > > 
> > > > > > > > > > While your FSF assignment is underway (PM), we can
> > > > > > > > > > continue
> > > > > > > > > > polishing.
> > > > > > > > > > 
> > > > > > > > > > Didn't test your code yet,  but from what I see, there are
> > > > > > > > > > just
> > > > > > > > > > these
> > > > > > > > > > peanuts:
> > > > > > > > > > 
> > > > > > > > > > 1.
> > > > > > > > > > +  bhash = malloc (strlen (name) + strlen (".badhash") +
> > > > > > > > > > 1);
> > > > > > > > > > +  strcat (strcpy (bhash, name), ".badhash");
> > > > > > > > > > 
> > > > > > > > > > Please consider using concat_strings() from util.h.
> > > > > > > > > > And please leave an empty line between var declaration and
> > > > > > > > > > code -
> > > > > > > > > > just
> > > > > > > > > > for
> > > > > > > > > > readability.
> > > > > > > > > > 
> > > > > > > > > > 2.
> > > > > > > > > > Please add 'link' and 'unlink' to bootstrap.conf. Gnulib
> > > > > > > > > > will
> > > > > > > > > > emulate
> > > > > > > > > > these on platforms where they are not available (or have a
> > > > > > > > > > slightly
> > > > > > > > > > different behavior). I guess we simply forgot 'unlink'
> > > > > > > > > > when we
> > > > > > > > > > switched
> > > > > > > > > > to gnulib.
> > > > > > > > > > 
> > > > > > > > > > 3.
> > > > > > > > > > The (GNU) commit messages should ideally look like:
> > > > > > > > > > 
> > > > > > > > > > one line of short description
> > > > > > > > > > <empty line>
> > > > > > > > > > file changes
> > > > > > > > > > <empty line>
> > > > > > > > > > long description
> > > > > > > > > > 
> > > > > > > > > > For example:
> > > > > > > > > > Add new option --keep-badhash
> > > > > > > > > > 
> > > > > > > > > > * src/init.c: Add keepbadhash
> > > > > > > > > > * src/main.c: Add keep-badhash
> > > > > > > > > > * src/options.h: Add keep_badhash
> > > > > > > > > > * doc/wget.texi: Add docs for --keep-badhash
> > > > > > > > > > * src/metalink.h: Add prototypes badhash_suffix(),
> > > > > > > > > > badhash_or_remove()
> > > > > > > > > > * src/metalink.c: New functions badhash_suffix(),
> > > > > > > > > > badhash_or_remove().
> > > > > > > > > > 
> > > > > > > > > >   (retrieve_from_metalink): Call append .badhash()
> > > > > > > > > > 
> > > > > > > > > > With --keep-badhash, append .badhash to Metalink's files
> > > > > > > > > > with
> > > > > > > > > > checksum
> > > > > > > > > > mismatch, except without overwriting existing files.
> > > > > > > > > > 
> > > > > > > > > > Without --keep-badhash, remove downloaded files with
> > > > > > > > > > checksum
> > > > > > > > > > mismatch
> > > > > > > > > > (this conforms to the old behaviour).
> > > > > > > > > > 
> > > > > > > > > > [This also applies to to your other patches]
> > > > > > > > > > 
> > > > > > > > > > 4.
> > > > > > > > > > Please not only write 'keepbadhash', but also what you did
> > > > > > > > > > (e.g.
> > > > > > > > > > remove,
> > > > > > > > > > rename, add, ...), see my example above.
> > > > > > > > > > 
> > > > > > > > > > Those ChangeLog entries should allow finding changes /
> > > > > > > > > > faults
> > > > > > > > > > /
> > > > > > > > > > regressions
> > > > > > > > > > etc. even when a versioning system is not at hand, e.g.
> > > > > > > > > > when
> > > > > > > > > > unpacking
> > > > > > > > > > the
> > > > > > > > > > sources from a tarball. (Not debatable that consulting
> > > > > > > > > > commit
> > > > > > > > > > messages
> > > > > > > > > > directly is much more powerful.)
> > > > > > > > > > 
> > > > > > > > > > 5.
> > > > > > > > > > You write "except without overwriting existing files",
> > > > > > > > > > maybe
> > > > > > > > > > you
> > > > > > > > > > should
> > > > > > > > > > mention appending a counter instead of overwriting
> > > > > > > > > > existent
> > > > > > > > > > files
> > > > > > > > > > !?
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Regards, Tim
> > > > > > > > > 
> > > > > > > > > Thanks Tim!
> > > > > > > > > 
> > > > > > > > > I really needed your guidance. I sent the modified patches
> > > > > > > > > to
> > > > > > > > > address@hidden
> > > > > > > > > 
> > > > > > > > > I believe there are more things to fix.
> > > > > > > > > 
> > > > > > > > > Consider the following after applying the attached patch:
> > > > > > > > > * src/metalink.c (retrieve_from_metalink): line 124: If the
> > > > > > > > > download
> > > > > > > > > is
> > > > > > > > > interrupted (output_stream isn't NULL), and there are more
> > > > > > > > > urls
> > > > > > > > > for
> > > > > > > > > the
> > > > > > > > > same file (we are still in the download loop), switching to
> > > > > > > > > the
> > > > > > > > > next
> > > > > > > > > url
> > > > > > > > > should resume the download (instead than start it over).
> > > > > > > > > 
> > > > > > > > > In this patch I added a fix to rename/remove the fully
> > > > > > > > > downloaded
> > > > > > > > > file
> > > > > > > > > (output_stream is NULL), but there was an error (probably
> > > > > > > > > checksum)
> > > > > > > > > and
> > > > > > > > > we
> > > > > > > > > are still in the download loop (more urls for the same
> > > > > > > > > file).
> > > > > > > > > But as
> > > > > > > > > said
> > > > > > > > > before, switching to the next url should continue the
> > > > > > > > > download:
> > > > > > > > > *
> > > > > > > > > src/metalink.c (retrieve_from_metalink): line 131:
> > > > > > > > > Rename/remove
> > > > > > > > > fully
> > > > > > > > > downloaded file on error
> > > > > > > > > 
> > > > > > > > > I still have to investigate the problem...
> > > > > > > > 
> > > > > > > > So, I wait with 0001-... !?
> > > > > > > > 
> > > > > > > > 0002-... has been pushed (extended with 'symlink').
> > > > > > > > 
> > > > > > > > Thanks for your contribution.
> > > > > > > > 
> > > > > > > > Tim
> > > > > > > 
> > > > > > > Ok for 0002. Thanks.
> > > > > > > 
> > > > > > > There's no problem applying the 0001.
> > > > > > 
> > > > > > Applied and pushed !
> > > > > > 
> > > > > > > The "if the stream got interrupted, then restart the download
> > > > > > > with
> > > > > > > the
> > > > > > > next
> > > > > > > url" (output_stream isn't NULL) was already there before the
> > > > > > > patch
> > > > > > > 0001.
> > > > > > > 
> > > > > > > [debug is required to know when output_stream isn't NULL]
> > > > > > > 
> > > > > > > commit 7fad76db4cdb7a0fe7e5aa0dd88f5faaf8f4cdc8
> > > > > > > * src/metalink.c (retrieve_from_metalink): line 124: 'if
> > > > > > > (output_stream)'
> > > > > > > remove file and restart download with next url
> > > > > > > 
> > > > > > > With the 0001, if --keep-badhash is used the file is renamed
> > > > > > > instead
> > > > > > > than
> > > > > > > removed, even when "output_stream is NULL".
> > > > > > > 
> > > > > > > I have to look through the "stream interrupted" situation.
> > > > > > > 
> > > > > > > I'm guessing that the download is not resumed.
> > > > > > > 
> > > > > > > What do you suggest?
> > > > > > 
> > > > > > We need a  document where we define wanted behavior, create tests
> > > > > > and
> > > > > > amend the code.
> > > > > > 
> > > > > > Regards, Tim
> > > > > 
> > > > > Ok, I just documented a test. See the attached tarball.
> > > > > 
> > > > > Also there is a patch attached (there's a copy in the tarball too),
> > > > > providing the suggested bugfix.
> > > > 
> > > > I made that test without your patch => crash due to
> > > > badhash_or_remove(NULL). With your patch, valgrind says:
> > > > 
> > > > ==15068== Syscall param open(filename) points to unaddressable byte(s)
> > > > ==15068==    at 0x6848F40: __open_nocancel (syscall-template.S:84)
> > > > ==15068==    by 0x67DFF7D: _IO_file_open (fileops.c:221)
> > > > ==15068==    by 0x67E009F: _IO_file_fopen@@GLIBC_2.2.5 (fileops.c:328)
> > > > ==15068==    by 0x67D4F03: __fopen_internal (iofopen.c:86)
> > > > ==15068==    by 0x434FAE: retrieve_from_metalink (metalink.c:184)
> > > > ==15068==    by 0x406C63: main (main.c:2025)
> > > > ==15068==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
> > > > ==15068==
> > > 
> > > I think I nailed the above bug and the SIGSEGV mentioned here:
> > > http://lists.gnu.org/archive/html/bug-wget/2016-08/msg00052.html
> > > http://lists.gnu.org/archive/html/bug-wget/2016-08/msg00056.html
> > > 
> > > I just pushed to https://github.com/mehw/wget/tree/metalink, and further
> > > explanations are in the commit fb3cdf5b0fd9e08426682b8d7c84be1f7e343805.
> > > 
> > > For the sake of clarity, below there's an extemp of the notes added to
> > > the
> > > commit message.
> > > 
> > > * src/metalink.c (retrieve_from_metalink):
> > >   commit a0ff0952d6fbd01a509797eb6ba31923ba6752af
> > >   [latest presenting the SIGSEGV problem]
> > >   
> > >   commit e3fb4c3859d2705fb2de80801b0bba2b64bea1a1
> > >   [discussed in the mailing list about the SIGSEGV]
> > >   
> > >   If unique_create() cannot create/open the destination file, filename
> > >   and output_stream remain NULL. If fopen() is used instead, filename
> > >   always remains NULL. Both functions cannot create "path/file" trees.
> > >   
> > >   Setting filename to the right value is sufficient to prevent SIGSEGV
> > >   generating from testing a NULL value. This also allows retrieve_url()
> > >   to create a "path/file" tree through opt.output_document.
> > >   
> > >   Reading NULL as output_stream, when it shall not be, leads to wrong
> > >   results. For instance, a non-NULL output_stream tells when a stream
> > >   was interrupted, reading NULL instead means to assume the contrary.
> > 
> > Maybe you could fix these warnings:
> > 
> > metalink.c: In function 'retrieve_from_metalink':
> > metalink.c:121:18: warning: implicit declaration of function
> > 'metalink_check_safe_path' [-Wimplicit-function-declaration]
> > 
> >        safename = metalink_check_safe_path (filename) ? filename :
> >        basename;
> >        
> >                   ^~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > metalink.c: In function 'get_metalink_basename':
> > metalink.c:683:3: warning: suggest parentheses around assignment used as
> > truth value [-Wparentheses]
> > 
> >    while (name = strstr (basename, "/"))
> >    ^~~~~
> > 
> > And set the X flags for
> > testenv/Test-metalink-xml-homepath.py
> 
> Thanks. Warnings fixed and 'x' flags set. Should I leave them as separate
> commits, or merge them with the commits requiring a fix?
> > I think it is debatable if Test-metalink-xml-abspath.py and
> > Test-metalink-xml- homepath.py really should do nothing. You said that
> > libmetalink returns NULL
> Test-metalink-xml-abspath.py, Test-metalink-xml-relpath.py, and
> Test-metalink-xml-homepath.py shall verify that only metalink:file elements
> with safe names are processed/downloaded/saved. If unsafe metalink:file
> elements are saved, such tests shall fail.
> 
> As I stated in doc/metalink.txt ('2. Notes'), saving a file to an absolute
> path poses a security problem. We must ensure that Wget's automated tests
> never modify the root.
> Can you confirm the above?

ACK. As I said, we could strip the path and use the basename instead of 
ignoring the file completely. But if path components are forbidden anyway, we 
can choose to simply ignore such files (as you suggest).

> > or an empty string as file name if the file name contains a path.
> > But shouldn't we (by default) use the file name from the URL ? And even
> > with -- trust-server-names we should take this name in case libmetalink
> > doesn't give us a name ?
> > 
> > Regards, Tim
> 
> By design, the behaviour is the following.
> 
> 
> Metalink/XML
> ------------
> 
> bugus.meta4 is a metalink xml file (see attachment):
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <metalink xmlns="urn:ietf:params:xml:ns:metalink">
>   <file name="/dir/A/File1">
>     <size>1617</size>
>     <hash
> type="sha256">ecb3dff2648667513e31554b3ad054ccd89fce38e33367c9459ac3a285153
> 742</hash> <url>http://another.url/common_name</url>
>     <url>http://ftpmirror.gnu.org/bash/bash-4.3-patches/bash43-001</url>
>   </file>
>   <file name="dir/B/File2">
>     <size>1594</size>
>     <hash
> type="sha256">eee7cd7062ab29a9e4f02924d9c367264dcb8b162703f74ff6eb8f175a915
> 02b</hash> <url>http://another.url/again/common_name</url>
>     <url>http://ftpmirror.gnu.org/bash/bash-4.3-patches/bash43-002</url>
>   </file>
> </metalink>
> 
> $ wget --input-metalink=bogus.meta4

Should be --input-file=bogus.meta4 --force-metalink. --input-metalink is a 
'shortcut' not conforming to the --force options that wget has. Just saying 
this for completeness, but for this discussion it might not be relevant.

The main question is: where did this file come from ?
It looks like downloading a file via metalink requires two command lines - one 
to retrieve the .meta4 file and another to start downloading from it.
This seems basically wrong and is a very bad user experience.

> Wget will eventually execute the following line:
> * src/main.c (main): 'meta_err = metalink_parse_file (opt.input_metalink,
> &metalink);'
> 
> [see bottom for type declarations]
> 
> By design, libmetalink rejects metalink:file elements with unsafe names, see
> https://github.com/metalink-dev/libmetalink: * lib/metalink_helper.c
> (metalink_check_safe_path)
> 
> At this point, the variable metalink contains data for "dir/B/File2" (i.e.
> *metalink->files[0]), but not for "/dir/A/File1" (no more
> *metalink->files).
> 
> There will never be *metalink->files with an empty or NULL name, thereof
> there could be no *metalink->files at all.
> 
> Even when all metalink:file elements are rejected, i.e. no *metalink->files,
> if metalink_parse_file() parsed the xml file without error, wget exits
> without error.
> 
> Wget is expected to save each metalink:file element using its unique "name"
> field as file name: https://tools.ietf.org/html/rfc5854#section-4.1.2.1
> 
> How do you suggest to implement --trust-server-names in the above scenario?

Starting from a local .meta4 file, we could use the original name without 
.meta4 extension. E.g. wget --input-metalink archive.gz.meta4 creates 
'archive.gz'. --trust-server-names would use the names from the metalink file.

Starting with a remote URL that returns 'application/metalink4+xml' Content-
type, that content should be parsed as metalink file and the described file(s) 
downloaded. In this case we should take the file name from the original URL - 
if two or more files are described, they should get the .1 .2 etc extensions. 
This is for safety, e.g. not to implicitly download executable files/scripts. 
In case --trust-server-names is given, the user says we should use the names 
from the metalink description, overriding the default security mechanism.

> 
> Let's say that --trust-server-names saves the metalink:file elements using
> their url's file name.
> 
> What should happen when different metalink:file elements, pointing to
> different locations/contents, end up all having the same url's file name,
> i.e. common_name in bogus.meta4?

They should have extensions .1 .2 .3 etc.

> Also, which file name should be wrote when a metalink:file has urls
> (mirrors) with different file names, and one of those urls fail (and the
> next one is used)?

The URLs of the mirrors should not change the file name at all. They are just 
mirrors.

> How do you choose the destination file name when a metalink:file uses its
> url sources in parallel? https://tools.ietf.org/html/rfc5854#section-1
> 
> I see the risk of making --trust-server-names/--input-metalink to operate in
> the same way of --input-file, ending up interpreting a metalink xml file as
> a list of urls, each one to be saved using its own url's file name.

Basically yes, but see above.

> 
> WDYT?
> 
> 
> Metalink/HTTP
> -------------
> 
> $ wget --metalink-over-http http://127.0.0.1/dir/sillyname.ext
> 
> [I'm using a local server]
> 
> Suppose the server answers with a header containing the following:
> Link: http://ftpmirror.gnu.org/bash/bash-4.3-patches/bash43-001;
> rel=duplicate; pref; pri=2 Link: http://another.url/common_name;
> rel=duplicate; pref; pri=1
> Digest: SHA-256=7LPf8mSGZ1E+MVVLOtBUzNifzjjjM2fJRZrDooUVN0I=
> 
> When opt.output_document and opt.content_disposition aren't specified, the
> following line is reached, and 'u' and 'original_url' are the same: *
> src/http.c (http_loop): 'url_file_name (opt.trustservernames ? u :
> original_url, NULL);'
> 
> The function url_file_name() combines the url's file name with any
> "Directory Options".
> 
> By design, the cli's url is used, i.e. the file name is 'sillyname.ext', and
> the "Directory Options" apply to http://127.0.0.1/dir/.

That seems correct (the user gets what she requested).

> How do you suggest to implement --trust-server-names in the above scenario?

It's just mirrors - the name should IMO stay sillyname.ext.

But this is not a real world example - a real metalink server would either 
redirect or answer with 200 + application/metalink4+xml + metalink body.

Example:
wget -d http://openoffice.mirrorbrain.org/stable/3.3.0/
OOo_3.3.0_src_extensions.tar.bz2

HTTP/1.1 302 Found
Date: Fri, 19 Aug 2016 09:26:29 GMT
Server: Apache/2.2.22 (Linux/SUSE) mod_ssl/2.2.22 OpenSSL/1.0.0k DAV/2 SVN/
1.7.5 mod_wsgi/3.3 Python/2.7.2 mod_asn/1.6 mod_mirrorbrain/2.18.0 mod_fcgid/
2.3.6
X-Prefix: 91.0.0.0/10
X-AS: 3320
X-MirrorBrain-Mirror: ftp5.gwdg.de
X-MirrorBrain-Realm: country
Link: <http://openoffice.mirrorbrain.org/stable/3.3.0/
OOo_3.3.0_src_extensions.tar.bz2.meta4>; rel=describedby; type="application/
metalink4+xml"
Link: <http://openoffice.mirrorbrain.org/stable/3.3.0/
OOo_3.3.0_src_extensions.tar.bz2.torrent>; rel=describedby; type="application/
x-bittorrent"
Link: <http://ftp5.gwdg.de/pub/openoffice/stable/3.3.0/
OOo_3.3.0_src_extensions.tar.bz2>; rel=duplicate; pri=1; geo=de
Link: <http://ftp.uasw.edu/pub/openoffice.org/stable/3.3.0/
OOo_3.3.0_src_extensions.tar.bz2>; rel=duplicate; pri=2; geo=de
Link: <http://ftp-stud.hs-esslingen.de/pub/Mirrors/ftp.openoffice.org/stable/
3.3.0/OOo_3.3.0_src_extensions.tar.bz2>; rel=duplicate; pri=3; geo=de
Link: <http://sunsite.informatik.rwth-aachen.de/ftp/pub/mirror/OpenOffice/
stable/3.3.0/OOo_3.3.0_src_extensions.tar.bz2>; rel=duplicate; pri=4; geo=de
Link: <http://ftp.tu-chemnitz.de/pub/openoffice-extended/stable/3.3.0/
OOo_3.3.0_src_extensions.tar.bz2>; rel=duplicate; pri=5; geo=de
Digest: MD5=onY2YSLrIbXofSyX14J61Q==
Digest: SHA=Zkg18VkCplq7L+x+tImnMyET/tc=
Digest: SHA-256=ftMOLpU/hW43Xcb8LqoZgHmCnL0iGM8C/HIo6KXqgkY=
Location: http://ftp5.gwdg.de/pub/openoffice/stable/3.3.0/
OOo_3.3.0_src_extensions.tar.bz2
Content-Length: 264
Keep-Alive: timeout=2, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1

Now it is the application's choice to download in one chunk (and use the 
Digest header to check for correctness) or to download the .meta4 description 
first - to take full advantage of chunked download. (That is how wget2 works - 
the single chunks are even downloaded from different mirrors in parallel to 
distribute the load.)
So, if possible, wget should use the HTTP answer just as 'trampoline' to fetch 
the metalink description. This allows for much better error control and 
automatic continuation of stopped downloads.

> 
> Let's say that --trust-server-names saves the file using one of its Link
> url's file name.
> 
> To which url the "Directory Options" should apply? To the original cli's
> url, or to the currrent mirror's url?
> 
> Also, if a Link fails and the download switches to the next mirror, should
> the file name be changed accordingly?
> 
> And what if a Link redirects to another url with a different file name?
> 
> How do you choose the destination file name when mirror Link header fields
> are used in parallel? https://tools.ietf.org/html/rfc6249#section-7
> 
> WDYT?

Tim

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]