bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] wget 1.18-5+deb9u1 with --hsts -E -k fails


From: Karl O. Pinc
Subject: [Bug-wget] wget 1.18-5+deb9u1 with --hsts -E -k fails
Date: Wed, 18 Apr 2018 17:01:43 -0500

Hello,

This is a bad bug report.  Sorry.  I'm thinking that
you'd rather hear _something_ than nothing.

I'm using wget 1.18-5+deb9u1, which is 1.18
on Debian Squeeze (9.4).

I can't say I'm certain that there is even a bug,
although there is a functionality problem at some
level.

3 different problems, from simplest and most trivial
to more complex.

1) The wget Savannah page makes no mention of
version 1.19 in the news section.

2) The situation described below (with --adjust-extension)
produces a "doc/guide" directory and a "doc/guide.1.html"
file.  It would be nice if the file were instead named
"doc/guide.html", without the ".1".  (There is no such
file.)

3)  I am mirroring a site where the url paths ending
in "/" deliver pages, but there are additional, longer,
urls which extend these urls.  So --adjust-extension (-E)
is required so that wget can write an "index file",
ending in ".html", and create directories to hold
additional content.

I am also using --convert (-k) so as to have relative
links in the downloaded material.

The problem is that when I use --hsts I get (sometimes,
but consistently for particular urls)
a "foo/" directory, a "foo.1.html" file containing
some converted links, and a "foo.html" file without
converted links.  FYI, "foo" is downloaded by linking
"upwards" in the url path from the targeted url to
mirror.  The downloaded, --convert-ed, material contains
some links to "foo.1.html" and some to "foo.html".

When using --no-hsts I get 301 (Permanent redirect)
from the mirrored site to https pages (and it seems
in this particular case https pages on the target
top-level domain).  I then have no problems with
--convert-ed data.

With --hsts I get some pages on other sub-domains
of the target domain, FYI.  This is not obviously
related to the problem.

Now, for the specifics.  Apologies that the
example is not clean and the site it hits may change
in ways that make the problem not reproducible.

The goal is to mirror the Yii 1.1 reference documentation
and user guide.  The command which "works" is:

wget --no-hsts --directory-prefix mirror --timestamping -F
--no-remove-listing --domains=www.yiiframework.com,yiiframework.com
--regex-type=pcre
--reject-regex='^https?://www\.yiiframework\.com/(?:(?:forum)|(?:wiki)|(?:user)|(?:extension)|(?:doc-2\.0)|(?:doc/(?:(?:(?:(?:guide)|(?:api))/(?:1|2)\.0)|(?:guide/1\.1/(?:(?:de)|(?:es)|(?:fr)|(?:he)|(?:id)|(?:it)|(?:ja)|(?:pl)|(?:pt)|(?:pt-br)|(?:ro)|(?:ru)|(?:sv)|(?:uk)|(?:zh-cn)))|(?:download/yii-.*-2\.0)|(?:blog)))|(?:news)|(?:blog)|(?:team)|(?:user)|(?:badge))'
--adjust-extension --recursive --level inf --convert-links
--page-requisites --span-hosts --no-clobber
https://www.yiiframework.com/doc/guide/1.1/en

Some notes:

I happen to know that the guide contains links to the API
docs, and all the API docs cross reference each other, 
so I mirrored the guide and picked up the API docs as well.

The above command downloads 521 files comprising 43MB. (!)
Sorry.

-F probably does nothing, but I included it because
that's what I ran with.

Leaving off the --no-hsts I get:

mirror/
  www.yiiframework.com/
    doc/
      api/
      api.1.html
      api.html
      guide/
        1.1/
          en/
          en.1.html
          en.html
      guide.1.html
      terms/

As noted, "en.html" and "api.html" contain un-converted links and
some downloaded content links to these files.  I _think_ these get
created late in the download.

With --no-hists (as in the command above) I get:

mirror/
  www.yiiframework.com/
    doc/
      api/
      api.html
      guide/
        1.1/
          en/
          en.html
      guide.1.html
      terms/

FYI.  I first tried using multiple --reject-regex arguments, but 
this did not seem to work.  The docs were not clear as to whether
multiple --reject-regex arguments are allowed.  So I wrote a single
regex.  A note in the documentation about this might be helpful.

I hope that the above is useful.

Regards,

Karl <address@hidden>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein



reply via email to

[Prev in Thread] Current Thread [Next in Thread]