bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [PATCH] Implement --keep-badhash to keep Metalink's files


From: Matthew White
Subject: Re: [Bug-wget] [PATCH] Implement --keep-badhash to keep Metalink's files with a bad hash
Date: Fri, 19 Aug 2016 07:34:42 +0200

On Thu, 18 Aug 2016 12:50:48 +0200
Tim Rühsen <address@hidden> wrote:

> On Mittwoch, 17. August 2016 17:55:47 CEST Matthew White wrote:
> > On Thu, 04 Aug 2016 17:08:20 +0200
> > 
> > Tim Ruehsen <address@hidden> wrote:
> > > On Thursday, August 4, 2016 12:40:49 PM CEST Matthew White wrote:
> > > > On Thu, 04 Aug 2016 12:08:50 +0200
> > > > 
> > > > Tim Ruehsen <address@hidden> wrote:
> > > > > On Wednesday, August 3, 2016 2:40:14 PM CEST Matthew White wrote:
> > > > > > On Wed, 03 Aug 2016 14:05:06 +0200
> > > > > > 
> > > > > > Tim Ruehsen <address@hidden> wrote:
> > > > > > > On Tuesday, August 2, 2016 11:27:28 AM CEST Matthew White wrote:
> > > > > > > > On Mon, 01 Aug 2016 16:32:30 +0200
> > > > > > > > 
> > > > > > > > Tim Ruehsen <address@hidden> wrote:
> > > > > > > > > On Saturday, July 30, 2016 9:41:48 PM CEST Matthew White 
> wrote:
> > > > > > > > > > Hello!
> > > > > > > > > > 
> > > > > > > > > > I think that sometimes it could help to keep downloaded
> > > > > > > > > > Metalink's
> > > > > > > > > > files
> > > > > > > > > > which have a bad hash.
> > > > > > > > > > 
> > > > > > > > > > The default wget behaviour is to delete such files.
> > > > > > > > > > 
> > > > > > > > > > This patch provides a way to keep files which have a bad
> > > > > > > > > > hash
> > > > > > > > > > through
> > > > > > > > > > the
> > > > > > > > > > option --keep-badhash. It appends the suffix .badhash to the
> > > > > > > > > > file
> > > > > > > > > > name,
> > > > > > > > > > except without overwriting existing files. In the latter
> > > > > > > > > > case,
> > > > > > > > > > an
> > > > > > > > > > unique
> > > > > > > > > > suffix is appended after .badhash.
> > > > > > > > > > 
> > > > > > > > > > I made this patch working on the following branch:
> > > > > > > > > > master (latest 20cac2c5ab3d63aacfba35fb10878a2d490e2377)
> > > > > > > > > > git://git.savannah.gnu.org/wget.git
> > > > > > > > > > 
> > > > > > > > > > What do you think?
> > > > > > > > > 
> > > > > > > > > Hi Matthew,
> > > > > > > > > 
> > > > > > > > > good work !
> > > > > > > > > 
> > > > > > > > > While your FSF assignment is underway (PM), we can continue
> > > > > > > > > polishing.
> > > > > > > > > 
> > > > > > > > > Didn't test your code yet,  but from what I see, there are
> > > > > > > > > just
> > > > > > > > > these
> > > > > > > > > peanuts:
> > > > > > > > > 
> > > > > > > > > 1.
> > > > > > > > > +  bhash = malloc (strlen (name) + strlen (".badhash") + 1);
> > > > > > > > > +  strcat (strcpy (bhash, name), ".badhash");
> > > > > > > > > 
> > > > > > > > > Please consider using concat_strings() from util.h.
> > > > > > > > > And please leave an empty line between var declaration and
> > > > > > > > > code -
> > > > > > > > > just
> > > > > > > > > for
> > > > > > > > > readability.
> > > > > > > > > 
> > > > > > > > > 2.
> > > > > > > > > Please add 'link' and 'unlink' to bootstrap.conf. Gnulib will
> > > > > > > > > emulate
> > > > > > > > > these on platforms where they are not available (or have a
> > > > > > > > > slightly
> > > > > > > > > different behavior). I guess we simply forgot 'unlink' when we
> > > > > > > > > switched
> > > > > > > > > to gnulib.
> > > > > > > > > 
> > > > > > > > > 3.
> > > > > > > > > The (GNU) commit messages should ideally look like:
> > > > > > > > > 
> > > > > > > > > one line of short description
> > > > > > > > > <empty line>
> > > > > > > > > file changes
> > > > > > > > > <empty line>
> > > > > > > > > long description
> > > > > > > > > 
> > > > > > > > > For example:
> > > > > > > > > Add new option --keep-badhash
> > > > > > > > > 
> > > > > > > > > * src/init.c: Add keepbadhash
> > > > > > > > > * src/main.c: Add keep-badhash
> > > > > > > > > * src/options.h: Add keep_badhash
> > > > > > > > > * doc/wget.texi: Add docs for --keep-badhash
> > > > > > > > > * src/metalink.h: Add prototypes badhash_suffix(),
> > > > > > > > > badhash_or_remove()
> > > > > > > > > * src/metalink.c: New functions badhash_suffix(),
> > > > > > > > > badhash_or_remove().
> > > > > > > > > 
> > > > > > > > >   (retrieve_from_metalink): Call append .badhash()
> > > > > > > > > 
> > > > > > > > > With --keep-badhash, append .badhash to Metalink's files with
> > > > > > > > > checksum
> > > > > > > > > mismatch, except without overwriting existing files.
> > > > > > > > > 
> > > > > > > > > Without --keep-badhash, remove downloaded files with checksum
> > > > > > > > > mismatch
> > > > > > > > > (this conforms to the old behaviour).
> > > > > > > > > 
> > > > > > > > > [This also applies to to your other patches]
> > > > > > > > > 
> > > > > > > > > 4.
> > > > > > > > > Please not only write 'keepbadhash', but also what you did
> > > > > > > > > (e.g.
> > > > > > > > > remove,
> > > > > > > > > rename, add, ...), see my example above.
> > > > > > > > > 
> > > > > > > > > Those ChangeLog entries should allow finding changes / faults
> > > > > > > > > /
> > > > > > > > > regressions
> > > > > > > > > etc. even when a versioning system is not at hand, e.g. when
> > > > > > > > > unpacking
> > > > > > > > > the
> > > > > > > > > sources from a tarball. (Not debatable that consulting commit
> > > > > > > > > messages
> > > > > > > > > directly is much more powerful.)
> > > > > > > > > 
> > > > > > > > > 5.
> > > > > > > > > You write "except without overwriting existing files", maybe
> > > > > > > > > you
> > > > > > > > > should
> > > > > > > > > mention appending a counter instead of overwriting existent
> > > > > > > > > files
> > > > > > > > > !?
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Regards, Tim
> > > > > > > > 
> > > > > > > > Thanks Tim!
> > > > > > > > 
> > > > > > > > I really needed your guidance. I sent the modified patches to
> > > > > > > > address@hidden
> > > > > > > > 
> > > > > > > > I believe there are more things to fix.
> > > > > > > > 
> > > > > > > > Consider the following after applying the attached patch:
> > > > > > > > * src/metalink.c (retrieve_from_metalink): line 124: If the
> > > > > > > > download
> > > > > > > > is
> > > > > > > > interrupted (output_stream isn't NULL), and there are more urls
> > > > > > > > for
> > > > > > > > the
> > > > > > > > same file (we are still in the download loop), switching to the
> > > > > > > > next
> > > > > > > > url
> > > > > > > > should resume the download (instead than start it over).
> > > > > > > > 
> > > > > > > > In this patch I added a fix to rename/remove the fully
> > > > > > > > downloaded
> > > > > > > > file
> > > > > > > > (output_stream is NULL), but there was an error (probably
> > > > > > > > checksum)
> > > > > > > > and
> > > > > > > > we
> > > > > > > > are still in the download loop (more urls for the same file).
> > > > > > > > But as
> > > > > > > > said
> > > > > > > > before, switching to the next url should continue the download:
> > > > > > > > *
> > > > > > > > src/metalink.c (retrieve_from_metalink): line 131: Rename/remove
> > > > > > > > fully
> > > > > > > > downloaded file on error
> > > > > > > > 
> > > > > > > > I still have to investigate the problem...
> > > > > > > 
> > > > > > > So, I wait with 0001-... !?
> > > > > > > 
> > > > > > > 0002-... has been pushed (extended with 'symlink').
> > > > > > > 
> > > > > > > Thanks for your contribution.
> > > > > > > 
> > > > > > > Tim
> > > > > > 
> > > > > > Ok for 0002. Thanks.
> > > > > > 
> > > > > > There's no problem applying the 0001.
> > > > > 
> > > > > Applied and pushed !
> > > > > 
> > > > > > The "if the stream got interrupted, then restart the download with
> > > > > > the
> > > > > > next
> > > > > > url" (output_stream isn't NULL) was already there before the patch
> > > > > > 0001.
> > > > > > 
> > > > > > [debug is required to know when output_stream isn't NULL]
> > > > > > 
> > > > > > commit 7fad76db4cdb7a0fe7e5aa0dd88f5faaf8f4cdc8
> > > > > > * src/metalink.c (retrieve_from_metalink): line 124: 'if
> > > > > > (output_stream)'
> > > > > > remove file and restart download with next url
> > > > > > 
> > > > > > With the 0001, if --keep-badhash is used the file is renamed instead
> > > > > > than
> > > > > > removed, even when "output_stream is NULL".
> > > > > > 
> > > > > > I have to look through the "stream interrupted" situation.
> > > > > > 
> > > > > > I'm guessing that the download is not resumed.
> > > > > > 
> > > > > > What do you suggest?
> > > > > 
> > > > > We need a  document where we define wanted behavior, create tests and
> > > > > amend the code.
> > > > > 
> > > > > Regards, Tim
> > > > 
> > > > Ok, I just documented a test. See the attached tarball.
> > > > 
> > > > Also there is a patch attached (there's a copy in the tarball too),
> > > > providing the suggested bugfix.
> > > 
> > > I made that test without your patch => crash due to
> > > badhash_or_remove(NULL). With your patch, valgrind says:
> > > 
> > > ==15068== Syscall param open(filename) points to unaddressable byte(s)
> > > ==15068==    at 0x6848F40: __open_nocancel (syscall-template.S:84)
> > > ==15068==    by 0x67DFF7D: _IO_file_open (fileops.c:221)
> > > ==15068==    by 0x67E009F: _IO_file_fopen@@GLIBC_2.2.5 (fileops.c:328)
> > > ==15068==    by 0x67D4F03: __fopen_internal (iofopen.c:86)
> > > ==15068==    by 0x434FAE: retrieve_from_metalink (metalink.c:184)
> > > ==15068==    by 0x406C63: main (main.c:2025)
> > > ==15068==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
> > > ==15068==
> > 
> > I think I nailed the above bug and the SIGSEGV mentioned here:
> > http://lists.gnu.org/archive/html/bug-wget/2016-08/msg00052.html
> > http://lists.gnu.org/archive/html/bug-wget/2016-08/msg00056.html
> > 
> > I just pushed to https://github.com/mehw/wget/tree/metalink, and further
> > explanations are in the commit fb3cdf5b0fd9e08426682b8d7c84be1f7e343805.
> > 
> > For the sake of clarity, below there's an extemp of the notes added to the
> > commit message.
> > 
> > * src/metalink.c (retrieve_from_metalink):
> > 
> >   commit a0ff0952d6fbd01a509797eb6ba31923ba6752af
> >   [latest presenting the SIGSEGV problem]
> > 
> >   commit e3fb4c3859d2705fb2de80801b0bba2b64bea1a1
> >   [discussed in the mailing list about the SIGSEGV]
> > 
> >   If unique_create() cannot create/open the destination file, filename
> >   and output_stream remain NULL. If fopen() is used instead, filename
> >   always remains NULL. Both functions cannot create "path/file" trees.
> > 
> >   Setting filename to the right value is sufficient to prevent SIGSEGV
> >   generating from testing a NULL value. This also allows retrieve_url()
> >   to create a "path/file" tree through opt.output_document.
> > 
> >   Reading NULL as output_stream, when it shall not be, leads to wrong
> >   results. For instance, a non-NULL output_stream tells when a stream
> >   was interrupted, reading NULL instead means to assume the contrary.
> 
> Maybe you could fix these warnings:
> 
> metalink.c: In function 'retrieve_from_metalink':
> metalink.c:121:18: warning: implicit declaration of function 
> 'metalink_check_safe_path' [-Wimplicit-function-declaration]
>        safename = metalink_check_safe_path (filename) ? filename : basename;
>                   ^~~~~~~~~~~~~~~~~~~~~~~~
> metalink.c: In function 'get_metalink_basename':
> metalink.c:683:3: warning: suggest parentheses around assignment used as 
> truth 
> value [-Wparentheses]
>    while (name = strstr (basename, "/"))
>    ^~~~~
> 
> And set the X flags for
> testenv/Test-metalink-xml-homepath.py

Thanks. Warnings fixed and 'x' flags set. Should I leave them as separate 
commits, or merge them with the commits requiring a fix?

> 
> I think it is debatable if Test-metalink-xml-abspath.py and Test-metalink-xml-
> homepath.py really should do nothing. You said that libmetalink returns NULL 

Test-metalink-xml-abspath.py, Test-metalink-xml-relpath.py, and 
Test-metalink-xml-homepath.py shall verify that only metalink:file elements 
with safe names are processed/downloaded/saved. If unsafe metalink:file 
elements are saved, such tests shall fail.

As I stated in doc/metalink.txt ('2. Notes'), saving a file to an absolute path 
poses a security problem. We must ensure that Wget's automated tests never 
modify the root.

Can you confirm the above?

> or an empty string as file name if the file name contains a path.
> But shouldn't we (by default) use the file name from the URL ? And even with 
> --
> trust-server-names we should take this name in case libmetalink doesn't give 
> us a name ?
> 
> Regards, Tim

By design, the behaviour is the following.


Metalink/XML
------------

bugus.meta4 is a metalink xml file (see attachment):

<?xml version="1.0" encoding="UTF-8"?>
<metalink xmlns="urn:ietf:params:xml:ns:metalink">
  <file name="/dir/A/File1">
    <size>1617</size>
    <hash 
type="sha256">ecb3dff2648667513e31554b3ad054ccd89fce38e33367c9459ac3a285153742</hash>
    <url>http://another.url/common_name</url>
    <url>http://ftpmirror.gnu.org/bash/bash-4.3-patches/bash43-001</url>
  </file>
  <file name="dir/B/File2">
    <size>1594</size>
    <hash 
type="sha256">eee7cd7062ab29a9e4f02924d9c367264dcb8b162703f74ff6eb8f175a91502b</hash>
    <url>http://another.url/again/common_name</url>
    <url>http://ftpmirror.gnu.org/bash/bash-4.3-patches/bash43-002</url>
  </file>
</metalink>

$ wget --input-metalink=bogus.meta4

Wget will eventually execute the following line:
* src/main.c (main): 'meta_err = metalink_parse_file (opt.input_metalink, 
&metalink);'

[see bottom for type declarations]

By design, libmetalink rejects metalink:file elements with unsafe names, see 
https://github.com/metalink-dev/libmetalink:
* lib/metalink_helper.c (metalink_check_safe_path)

At this point, the variable metalink contains data for "dir/B/File2" (i.e. 
*metalink->files[0]), but not for "/dir/A/File1" (no more *metalink->files).

There will never be *metalink->files with an empty or NULL name, thereof there 
could be no *metalink->files at all.

Even when all metalink:file elements are rejected, i.e. no *metalink->files, if 
metalink_parse_file() parsed the xml file without error, wget exits without 
error.

Wget is expected to save each metalink:file element using its unique "name" 
field as file name:
https://tools.ietf.org/html/rfc5854#section-4.1.2.1

How do you suggest to implement --trust-server-names in the above scenario?

Let's say that --trust-server-names saves the metalink:file elements using 
their url's file name.

What should happen when different metalink:file elements, pointing to different 
locations/contents, end up all having the same url's file name, i.e. 
common_name in bogus.meta4?

Also, which file name should be wrote when a metalink:file has urls (mirrors) 
with different file names, and one of those urls fail (and the next one is 
used)?

How do you choose the destination file name when a metalink:file uses its url 
sources in parallel?
https://tools.ietf.org/html/rfc5854#section-1

I see the risk of making --trust-server-names/--input-metalink to operate in 
the same way of --input-file, ending up interpreting a metalink xml file as a 
list of urls, each one to be saved using its own url's file name.

WDYT?


Metalink/HTTP
-------------

$ wget --metalink-over-http http://127.0.0.1/dir/sillyname.ext

[I'm using a local server]

Suppose the server answers with a header containing the following:
Link: http://ftpmirror.gnu.org/bash/bash-4.3-patches/bash43-001; rel=duplicate; 
pref; pri=2
Link: http://another.url/common_name; rel=duplicate; pref; pri=1
Digest: SHA-256=7LPf8mSGZ1E+MVVLOtBUzNifzjjjM2fJRZrDooUVN0I=

When opt.output_document and opt.content_disposition aren't specified, the 
following line is reached, and 'u' and 'original_url' are the same:
* src/http.c (http_loop): 'url_file_name (opt.trustservernames ? u : 
original_url, NULL);'

The function url_file_name() combines the url's file name with any "Directory 
Options".

By design, the cli's url is used, i.e. the file name is 'sillyname.ext', and 
the "Directory Options" apply to http://127.0.0.1/dir/.

How do you suggest to implement --trust-server-names in the above scenario?

Let's say that --trust-server-names saves the file using one of its Link url's 
file name.

To which url the "Directory Options" should apply? To the original cli's url, 
or to the currrent mirror's url?

Also, if a Link fails and the download switches to the next mirror, should the 
file name be changed accordingly?

And what if a Link redirects to another url with a different file name?

How do you choose the destination file name when mirror Link header fields are 
used in parallel?
https://tools.ietf.org/html/rfc6249#section-7

WDYT?


Type declarations
-----------------

See https://github.com/metalink-dev/libmetalink:
* lib/includes/metalink/metalink_types.h

typedef struct _metalink {
  /* list of metalink_file_t */
  metalink_file_t** files;
  ...
} metalink_t;

typedef struct _metalink_file {
  /* filename, null terminated string */
  char* name;
  /* list of metalink_resource_t */
  metalink_resource_t** resources;
  ...
} metalink_file_t;

typedef struct _metalink_resource {
  /* url, null terminated string */
  char* url;
  ...
} metalink_resource_t;

metalink_t *metalink;


Regards,
Matthew

-- 
Matthew White <address@hidden>

Attachment: bogus.meta4
Description: application/metalink4

Attachment: pgp8qlzdqfNng.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]