bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] How to prevent .1.html numbering of downloaded file?


From: B Wooster
Subject: Re: [Bug-wget] How to prevent .1.html numbering of downloaded file?
Date: Fri, 28 Nov 2014 16:38:43 -0500

OK some more info after some debugging.

Looks like the problem is in the unique_name function. At that point, it
does not know about adjust-extensions, so it always checks for name without
the extension. And depending on how things are queued, it can cause correct
or incorrect behavior. Anyone know if this is an existing issue, and any
known workaround? I can locally change wget if necessary, and will likely
do that after I figure it out.

So if things are queued like this, it is all fine:
article  (will save to article.html but calls unique_name with just
"article" which luckily does not exist)
article/post.html (will save to article/post.html, creating directory
article)

but this will mess it up:
article/post.html (will save to article/post.html)
article  (will save to article.html but calls unique_name with just
"article" which by now exists).

Sorting the queue (but then it is no long a queue!) or better still:
checking unique_name after adjust extensions has produced a suffix would
fix this. Any one have any tips?



On Fri, Nov 28, 2014 at 2:11 PM, B Wooster <address@hidden> wrote:

> This only happens to some of my downloads - presumably there was a
> conflict that caused it to name something .1.html? But I can't see any
> reason for it in the log file.
>
> Example of downloaded files:
> albums/
> albums.1.html
> article/
> article.html
> band/
> band.html
> blog/
> blog.1.html
> etc
>
> I don't see any mention of a albums.html in the log, just the albums.1.html
>
> This was done for a fresh wget download, nothing in target directory.
> wget --recursive --page-requisites --timestamping --level=9
> --exclude-directories=/cgi-bin,/files,/fonts --adjust-extension
> --execute=robots=off --convert-links -P tmp.wget
> '--reject-regex=(.*/email.html)' -o log1 http://www.example.com/
>
> I'm trying to make a local archive of a local Drupal site, and can deal
> with the .html suffix, but cannot handle a .1 or .2 etc suffix... for now,
> am just trying to understand why it added .1 to some files above and not
> all.
>
> It seems running it a bunch of times gets different files with the number
> - sometimes I do get blog.html instead of blog.1.html (but that may be due
> to other reasons, downloading a partial site.)
>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]