bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Miscellaneous thoughts & concerns


From: Darshit Shah
Subject: Re: [Bug-wget] Miscellaneous thoughts & concerns
Date: Sun, 8 Apr 2018 17:11:09 +0200
User-agent: NeoMutt/20180323

* Jeffrey Fetterman <address@hidden> [180408 04:53]:
> Yes! Multiplexing was indeed partially the culprit, I've changed it
> to --http2-request-window=5
> 
> However the download queue (AKA 'Todo') still gets enormous. It's why I was
> wanting to use non-verbose mode in the first place, screens and screens of
> 'Adding url:'. There should really be a limit on how many urls it adds!
> 
The URLs are added first because of the way Wget will traverse the links. It
just adds these URLs to the download queue, doesn't start downloading them
instantly. If you traverse a web page and Wget finds links on it, it will
obviously add them to the download queue. What else would you expect Wget to
do?

> Darshit, as it stands it doesn't look like --force-progress does anything
> because --progress=bar forces the same non-verbose mode, and
> --force-progress is meant to be something used in non-verbose mode.
> 
> However, the progress bar is still really... not useful. See here:
> https://i.imgur.com/KvbGmKe.png
> 
> It's a single bar displaying a nonsense percentage, and it sounds like with
> multiplexing there's supposed to be, by default, 30 transfers going
> concurrently.
> 
Yes, I am aware of this. Sadly, Wget is developed entirely on volunteer effort.
And currently, I don't have the time on my hands to fix the progress bar. It's
being caused due to HTTP/2 connection multiplexing. I will fix it when I find
some time for it.

> > Both reduce RTT by 1, but they can't be combined.
> 
> I was using TLS Resume because, well, for a 300+GB download it just seemed
> to make sense, so it wouldn't have to check over 100GB of files before
> getting back to where I left off.
> 
> > You use TLS Resume, but you don't explicitly need to specify a file. By
> default it will use ~/.wget-session.
> 
> I figure a 300GB+ transfer should have its own session file just in case I
> do something smaller between resumes that might overwrite .wget-session,
> plus you've got to remember I'm on WSL and I'd rather have relevant files
> kept within my normal folders rather than my WSL filesystem.
> 
I'm not sure if you've understood TLS Session Resume correctly. TLS Session
Resume is not going to resume your download session from where it left off. Due
to the way HTTP works, Wget will still have to scan all your existing files and
send HEAD requests for each of them when resuming. This is just a limitation of
HTTP and there's nothing anybody can do about it.

TLS Session Resume will simply reduce 1 RTT when starting a new TLS Session. It
simply matters for the TLS handshake and nothing else. It doesn't resume the
Wget session at all. Also, the ~/.wget-session file simply stores the TLS
Session information for each TLS Session. So you can use it for multiple
sessions. It is just a cache.
> On Sat, Apr 7, 2018 at 3:04 AM, Darshit Shah <address@hidden> wrote:
> 
> > Hi Jefferey,
> >
> > Thanks a lot for your feedback. This is what helps us improve.
> >
> > * Tim Rühsen <address@hidden> [180407 00:01]:
> > >
> > > On 06.04.2018 23:30, Jeffrey Fetterman wrote:
> > > > Thanks to the fix that Tim posted on gitlab, I've got wget2 running
> > just
> > > > fine in WSL. Unfortunately it means I don't have TCP Fast Open, but
> > given
> > > > how fast it's downloading a ton of files at once, it seems like it
> > must've
> > > > been only a small gain.
> > > >
> > TCP Fast Open will not save you a lot in your particular scenario. It
> > simply
> > saves one round trip when opening a new connection. So, if you're using
> > Wget2
> > to download a lot of files, you are probably only opening ~5 connections
> > at the
> > beginning and reusing them all. It depends on your RTT to the server, but
> > 1 RTT
> > when downloading several megabytes is already an insignificant amount if
> > time.
> >
> > > >
> > > > I've come across a few annoyances however.
> > > >
> > > > 1. There doesn't seem to be any way to control the size of the download
> > > > queue, which I dislike because I want to download a lot of large files
> > at
> > > > once and I wish it'd just focus on a few at a time, rather than over a
> > > > dozen.
> > > The number of parallel downloads ? --max-threads=n
> >
> > I don't think he meant --max-threads. Given how he is using HTTP/2,
> > there's a
> > chance what he's seeing is HTTP Stream Multiplexing. There is also,
> > `--http2-request-window` which you can try.
> > >
> > > > 3. Doing a TLS resume will cause a 'Failed to write 305 bytes (32:
> > Broken
> > > > pipe) error to be thrown', seems to be related to how certificate
> > > > verification is handled upon resume, but I was worried at first that
> > the
> > > > WLS problems were rearing their ugly head again.
> > > Likely the WSL issue is also affecting the TLS layer. TLS resume is
> > > considered 'insecure',
> > > thus we have it disabled by default. There still is TLS False Start
> > > enabled by default.
> > >
> > >
> > > > 3. --no-check-certificate causes significantly more errors about how
> > the
> > > > certificate issuer isn't trusted to be thrown (even though it's not
> > > > supposed to be doing anything related to certificates).
> > > Maybe a bit too verbose - these should be warnings, not errors.
> >
> > @Tim: I thunk with `--no-check-certificate` these should not be either
> > warnings
> > or errors. The user explicitly stated that they don't care about the
> > validity
> > of the certificate. Why add any information there at all? Maybe we keep it
> > only
> > in debug mode
> > >
> > > > 4. --force-progress doesn't seem to do anything despite being
> > recognized as
> > > > a valid paramater, using it in conjunction with -nv is no longer
> > beneficial.
> > > You likely want to use --progress=bar. --force-progress is to enable the
> > > progress bar even when redirecting (e.g. to a log file).
> > > @Darshit, we shoudl adjust the behavior to be the same as in Wget1.x.
> >
> > I think the progress bar options are sometimes a little off since we don't
> > have
> > tests for those and I am the only one using them.
> >
> > When exactly did you try to use --force-progress? I will change the
> > documentation today to reflect its actual usecase. --force-progress is
> > useful
> > only in --quiet mode. Which, TBH, doesn't make much sense to me since
> > simply
> > --progress=bar will essentially put you in the same mode. AFAIR, this comes
> > from trying to bring in option compatibility from Wget 1.x.
> >
> > @Tim: Adjusting behaviour to the same as Wget 1.x doesn't make a lot of
> > sense
> > for the progress bar. In Wget 1.x, the default mode is: progress bar +
> > verbose.
> > Whereas, in Wget2, progress-bar will effectively enable the non-verbose
> > mode
> > where only warnings and errors are printed. I am noting this down for now.
> > When
> > I have a little time, I will think about all the progress and verbosity
> > options
> > in Wget 1.x and make sure that they do something similar in Wget2. Though,
> > they
> > won't have the exact same behaviour.
> > >
> > > > 5. The documentation is unclear as to how to disable things that are
> > > > enabled by default. Am I to assume that --robots=off is equivalent to
> > -e
> > > > robots=off?
> > >
> > > -e robots=off should still work. We also allow --robots=off or
> > --no-robots.
> > >
> > > > 6. The documentation doesn't document being able to use 'M' for
> > chunk-size,
> > > > e.g. --chunk-size=2M
> > >
> > > The wget2 documentation has to be brushed up - one of the blockers for
> > > the first release.
> > >
> > > >
> > > > 7. The documentation's instructions regarding --progress is all wrong.
> > > I'll take a look the next days.
> >
> > Thanks for the heads up. Will look into it when I look at the rest of the
> > progress options.
> > >
> > > >
> > > > 8. The http/https proxy options return as unknown options despite
> > being in
> > > > the documentation.
> > > Yeah, the docs... see above. Also, proxy support is currently limited.
> > >
> > >
> > > > Lastly I'd like someone to look at the command I've come up with and
> > offer
> > > > me critiques (and perhaps help me address some of the remarks above if
> > > > possible).
> > >
> > > No need for --continue.
> > > Think about using TLS Session Resumption.
> > > --domains is not needed in your example.
> > >
> >
> > You use TLS Resume, but you don't explicitly need to specify a file. By
> > default
> > it will use ~/.wget-session.
> >
> > > Did you build with http/2 and compression support ?
> > >
> > > Regards, Tim
> > > > #!/bin/bash
> > > >
> > > > wget2 \
> > > >       `#WSL compatibility` \
> > > >       --restrict-file-names=windows --no-tcp-fastopen \
> > > >       \
> > > >       `#No certificate checking` \
> > > >       --no-check-certificate \
> > > >       \
> > > >       `#Scrape the whole site` \
> > > >       --continue --mirror --adjust-extension \
> > > >       \
> > > >       `#Local viewing` \
> > > >       --convert-links --backup-converted \
> > > >       \
> > > >       `#Efficient resuming` \
> > > >       --tls-resume --tls-session-file=.\tls.session \
> > > >       \
> > > >       `#Chunk-based downloading` \
> > > >       --chunk-size=2M \
> > > >       \
> > > >       `#Swiper no swiping` \
> > > >       --robots=off --random-wait \
> > > >       \
> > > >       `#Target` \
> > > >       --domains=example.com example.com
> > > >
> > >
> > >
> > >
> >
> > --
> > Thanking You,
> > Darshit Shah
> > PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6
> >

-- 
Thanking You,
Darshit Shah
PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]