bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Miscellaneous thoughts & concerns


From: Jeffrey Fetterman
Subject: Re: [Bug-wget] Miscellaneous thoughts & concerns
Date: Mon, 9 Apr 2018 04:23:42 -0500

> You won't resume a download with TLS Resume. You refer to TLS Session
Resumption... that means the client stores parts of the TLS handshake and
uses it with the next connect to the same IP/Host to reduce RTT by 1. There
are several reasons why this might not work. If TLS False Start works for
you, leave --tls-resume away. And anyways, session resumption is only of
help in certain conditions (e.g. you need many files from a HTTPS server
that closes the connection after each one).

I understood you the first time. All I meant when I say 'resuming a
download with TLS Resume' is forcequitting out of a download and then
starting from the same session file as last time.

On Mon, Apr 9, 2018 at 3:36 AM, Tim Rühsen <address@hidden> wrote:

> On 04/09/2018 10:10 AM, Jeffrey Fetterman wrote:
> > I've tested wget2 with the following changes to .\libwget\ssl_gnutls.c
> >
> >         if (ret < 0) {
> > -           if (errno == EINPROGRESS) {
> > +           if (errno == EINPROGRESS || errno == 22 || errno == 32) {
> >                 errno = EAGAIN; // GnuTLS does not handle EINPROGRESS
> >             } else if (errno == EOPNOTSUPP) {
> >                 // fallback from fastopen, e.g. when fastopen is disabled
> > in system
> >                 debug_printf("Fallback from TCP Fast Open... TFO is
> > disabled at system level\n");
> >                 tcp->tcp_fastopen = 0;
> >                 ret = connect(tcp->sockfd, tcp->connect_addrinfo->ai_
> addr,
> > tcp->connect_addrinfo->ai_addrlen);
> > -               if (errno == ENOTCONN || errno == EINPROGRESS)
> > +               if (errno == ENOTCONN || errno == EINPROGRESS || errno ==
> > 22 || errno == 32)
> >                     errno = EAGAIN;
> >             }
> >         }
> >
>
> That's what I tested here as well with good results.
>
> >
> > However, I still end up with multiple 'Failed to write 305 bytes (32:
> > Broken pipe)' errors when resuming a previous download with TLS Resume.
>
> You won't resume a download with TLS Resume. You refer to TLS Session
> Resumption... that means the client stores parts of the TLS handshake
> and uses it with the next connect to the same IP/Host to reduce RTT by
> 1. There are several reasons why this might not work.
> If TLS False Start works for you, leave --tls-resume away. And anyways,
> session resumption is only of help in certain conditions (e.g. you need
> many files from a HTTPS server that closes the connection after each one).
>
> Regards, Tim
>
> >
> > On Sun, Apr 8, 2018 at 4:38 PM, Jeffrey Fetterman <
> address@hidden>
> > wrote:
> >
> >>>  The URLs are added first because of the way Wget will traverse the
> >> links. It just adds these URLs to the download queue, doesn't start
> >> downloading them instantly. If you traverse a web page and Wget finds
> >> links on it, it will obviously add them to the download queue. What else
> >> would you expect Wget to do?
> >>
> >> Not traverse the entire site at once, waiting until the queue gets low
> >> enough to continue traversing.
> >>
> >>> TLS Session Resume will simply reduce 1 RTT when starting a new TLS
> >> Session. It simply matters for the TLS handshake and nothing else. It
> >> doesn't resume the Wget session at all. Also, the ~/.wget-session file
> >> simply stores the TLS Session information for each TLS Session. So you
> can
> >> use it for multiple sessions. It is just a cache.
> >>
> >> Ah, I see, so I should switch over to using TLS False Start since
> there's
> >> no real difference performance-wise?
> >>
> >>
> >> On Sun, Apr 8, 2018 at 10:11 AM, Darshit Shah <address@hidden> wrote:
> >>
> >>> * Jeffrey Fetterman <address@hidden> [180408 04:53]:
> >>>> Yes! Multiplexing was indeed partially the culprit, I've changed it
> >>>> to --http2-request-window=5
> >>>>
> >>>> However the download queue (AKA 'Todo') still gets enormous. It's why
> I
> >>> was
> >>>> wanting to use non-verbose mode in the first place, screens and
> screens
> >>> of
> >>>> 'Adding url:'. There should really be a limit on how many urls it
> adds!
> >>>>
> >>> The URLs are added first because of the way Wget will traverse the
> links.
> >>> It
> >>> just adds these URLs to the download queue, doesn't start downloading
> them
> >>> instantly. If you traverse a web page and Wget finds links on it, it
> will
> >>> obviously add them to the download queue. What else would you expect
> Wget
> >>> to
> >>> do?
> >>>
> >>>> Darshit, as it stands it doesn't look like --force-progress does
> >>> anything
> >>>> because --progress=bar forces the same non-verbose mode, and
> >>>> --force-progress is meant to be something used in non-verbose mode.
> >>>>
> >>>> However, the progress bar is still really... not useful. See here:
> >>>> https://i.imgur.com/KvbGmKe.png
> >>>>
> >>>> It's a single bar displaying a nonsense percentage, and it sounds like
> >>> with
> >>>> multiplexing there's supposed to be, by default, 30 transfers going
> >>>> concurrently.
> >>>>
> >>> Yes, I am aware of this. Sadly, Wget is developed entirely on volunteer
> >>> effort.
> >>> And currently, I don't have the time on my hands to fix the progress
> bar.
> >>> It's
> >>> being caused due to HTTP/2 connection multiplexing. I will fix it when
> I
> >>> find
> >>> some time for it.
> >>>
> >>>>> Both reduce RTT by 1, but they can't be combined.
> >>>>
> >>>> I was using TLS Resume because, well, for a 300+GB download it just
> >>> seemed
> >>>> to make sense, so it wouldn't have to check over 100GB of files before
> >>>> getting back to where I left off.
> >>>>
> >>>>> You use TLS Resume, but you don't explicitly need to specify a file.
> >>> By
> >>>> default it will use ~/.wget-session.
> >>>>
> >>>> I figure a 300GB+ transfer should have its own session file just in
> >>> case I
> >>>> do something smaller between resumes that might overwrite
> .wget-session,
> >>>> plus you've got to remember I'm on WSL and I'd rather have relevant
> >>> files
> >>>> kept within my normal folders rather than my WSL filesystem.
> >>>>
> >>> I'm not sure if you've understood TLS Session Resume correctly. TLS
> >>> Session
> >>> Resume is not going to resume your download session from where it left
> >>> off. Due
> >>> to the way HTTP works, Wget will still have to scan all your existing
> >>> files and
> >>> send HEAD requests for each of them when resuming. This is just a
> >>> limitation of
> >>> HTTP and there's nothing anybody can do about it.
> >>>
> >>> TLS Session Resume will simply reduce 1 RTT when starting a new TLS
> >>> Session. It
> >>> simply matters for the TLS handshake and nothing else. It doesn't
> resume
> >>> the
> >>> Wget session at all. Also, the ~/.wget-session file simply stores the
> TLS
> >>> Session information for each TLS Session. So you can use it for
> multiple
> >>> sessions. It is just a cache.
> >>>> On Sat, Apr 7, 2018 at 3:04 AM, Darshit Shah <address@hidden>
> wrote:
> >>>>
> >>>>> Hi Jefferey,
> >>>>>
> >>>>> Thanks a lot for your feedback. This is what helps us improve.
> >>>>>
> >>>>> * Tim Rühsen <address@hidden> [180407 00:01]:
> >>>>>>
> >>>>>> On 06.04.2018 23:30, Jeffrey Fetterman wrote:
> >>>>>>> Thanks to the fix that Tim posted on gitlab, I've got wget2
> >>> running
> >>>>> just
> >>>>>>> fine in WSL. Unfortunately it means I don't have TCP Fast Open,
> >>> but
> >>>>> given
> >>>>>>> how fast it's downloading a ton of files at once, it seems like it
> >>>>> must've
> >>>>>>> been only a small gain.
> >>>>>>>
> >>>>> TCP Fast Open will not save you a lot in your particular scenario. It
> >>>>> simply
> >>>>> saves one round trip when opening a new connection. So, if you're
> >>> using
> >>>>> Wget2
> >>>>> to download a lot of files, you are probably only opening ~5
> >>> connections
> >>>>> at the
> >>>>> beginning and reusing them all. It depends on your RTT to the server,
> >>> but
> >>>>> 1 RTT
> >>>>> when downloading several megabytes is already an insignificant amount
> >>> if
> >>>>> time.
> >>>>>
> >>>>>>>
> >>>>>>> I've come across a few annoyances however.
> >>>>>>>
> >>>>>>> 1. There doesn't seem to be any way to control the size of the
> >>> download
> >>>>>>> queue, which I dislike because I want to download a lot of large
> >>> files
> >>>>> at
> >>>>>>> once and I wish it'd just focus on a few at a time, rather than
> >>> over a
> >>>>>>> dozen.
> >>>>>> The number of parallel downloads ? --max-threads=n
> >>>>>
> >>>>> I don't think he meant --max-threads. Given how he is using HTTP/2,
> >>>>> there's a
> >>>>> chance what he's seeing is HTTP Stream Multiplexing. There is also,
> >>>>> `--http2-request-window` which you can try.
> >>>>>>
> >>>>>>> 3. Doing a TLS resume will cause a 'Failed to write 305 bytes (32:
> >>>>> Broken
> >>>>>>> pipe) error to be thrown', seems to be related to how certificate
> >>>>>>> verification is handled upon resume, but I was worried at first
> >>> that
> >>>>> the
> >>>>>>> WLS problems were rearing their ugly head again.
> >>>>>> Likely the WSL issue is also affecting the TLS layer. TLS resume is
> >>>>>> considered 'insecure',
> >>>>>> thus we have it disabled by default. There still is TLS False Start
> >>>>>> enabled by default.
> >>>>>>
> >>>>>>
> >>>>>>> 3. --no-check-certificate causes significantly more errors about
> >>> how
> >>>>> the
> >>>>>>> certificate issuer isn't trusted to be thrown (even though it's
> >>> not
> >>>>>>> supposed to be doing anything related to certificates).
> >>>>>> Maybe a bit too verbose - these should be warnings, not errors.
> >>>>>
> >>>>> @Tim: I thunk with `--no-check-certificate` these should not be
> either
> >>>>> warnings
> >>>>> or errors. The user explicitly stated that they don't care about the
> >>>>> validity
> >>>>> of the certificate. Why add any information there at all? Maybe we
> >>> keep it
> >>>>> only
> >>>>> in debug mode
> >>>>>>
> >>>>>>> 4. --force-progress doesn't seem to do anything despite being
> >>>>> recognized as
> >>>>>>> a valid paramater, using it in conjunction with -nv is no longer
> >>>>> beneficial.
> >>>>>> You likely want to use --progress=bar. --force-progress is to
> >>> enable the
> >>>>>> progress bar even when redirecting (e.g. to a log file).
> >>>>>> @Darshit, we shoudl adjust the behavior to be the same as in
> >>> Wget1.x.
> >>>>>
> >>>>> I think the progress bar options are sometimes a little off since we
> >>> don't
> >>>>> have
> >>>>> tests for those and I am the only one using them.
> >>>>>
> >>>>> When exactly did you try to use --force-progress? I will change the
> >>>>> documentation today to reflect its actual usecase. --force-progress
> is
> >>>>> useful
> >>>>> only in --quiet mode. Which, TBH, doesn't make much sense to me since
> >>>>> simply
> >>>>> --progress=bar will essentially put you in the same mode. AFAIR, this
> >>> comes
> >>>>> from trying to bring in option compatibility from Wget 1.x.
> >>>>>
> >>>>> @Tim: Adjusting behaviour to the same as Wget 1.x doesn't make a lot
> >>> of
> >>>>> sense
> >>>>> for the progress bar. In Wget 1.x, the default mode is: progress bar
> +
> >>>>> verbose.
> >>>>> Whereas, in Wget2, progress-bar will effectively enable the
> >>> non-verbose
> >>>>> mode
> >>>>> where only warnings and errors are printed. I am noting this down for
> >>> now.
> >>>>> When
> >>>>> I have a little time, I will think about all the progress and
> >>> verbosity
> >>>>> options
> >>>>> in Wget 1.x and make sure that they do something similar in Wget2.
> >>> Though,
> >>>>> they
> >>>>> won't have the exact same behaviour.
> >>>>>>
> >>>>>>> 5. The documentation is unclear as to how to disable things that
> >>> are
> >>>>>>> enabled by default. Am I to assume that --robots=off is
> >>> equivalent to
> >>>>> -e
> >>>>>>> robots=off?
> >>>>>>
> >>>>>> -e robots=off should still work. We also allow --robots=off or
> >>>>> --no-robots.
> >>>>>>
> >>>>>>> 6. The documentation doesn't document being able to use 'M' for
> >>>>> chunk-size,
> >>>>>>> e.g. --chunk-size=2M
> >>>>>>
> >>>>>> The wget2 documentation has to be brushed up - one of the blockers
> >>> for
> >>>>>> the first release.
> >>>>>>
> >>>>>>>
> >>>>>>> 7. The documentation's instructions regarding --progress is all
> >>> wrong.
> >>>>>> I'll take a look the next days.
> >>>>>
> >>>>> Thanks for the heads up. Will look into it when I look at the rest of
> >>> the
> >>>>> progress options.
> >>>>>>
> >>>>>>>
> >>>>>>> 8. The http/https proxy options return as unknown options despite
> >>>>> being in
> >>>>>>> the documentation.
> >>>>>> Yeah, the docs... see above. Also, proxy support is currently
> >>> limited.
> >>>>>>
> >>>>>>
> >>>>>>> Lastly I'd like someone to look at the command I've come up with
> >>> and
> >>>>> offer
> >>>>>>> me critiques (and perhaps help me address some of the remarks
> >>> above if
> >>>>>>> possible).
> >>>>>>
> >>>>>> No need for --continue.
> >>>>>> Think about using TLS Session Resumption.
> >>>>>> --domains is not needed in your example.
> >>>>>>
> >>>>>
> >>>>> You use TLS Resume, but you don't explicitly need to specify a file.
> >>> By
> >>>>> default
> >>>>> it will use ~/.wget-session.
> >>>>>
> >>>>>> Did you build with http/2 and compression support ?
> >>>>>>
> >>>>>> Regards, Tim
> >>>>>>> #!/bin/bash
> >>>>>>>
> >>>>>>> wget2 \
> >>>>>>>       `#WSL compatibility` \
> >>>>>>>       --restrict-file-names=windows --no-tcp-fastopen \
> >>>>>>>       \
> >>>>>>>       `#No certificate checking` \
> >>>>>>>       --no-check-certificate \
> >>>>>>>       \
> >>>>>>>       `#Scrape the whole site` \
> >>>>>>>       --continue --mirror --adjust-extension \
> >>>>>>>       \
> >>>>>>>       `#Local viewing` \
> >>>>>>>       --convert-links --backup-converted \
> >>>>>>>       \
> >>>>>>>       `#Efficient resuming` \
> >>>>>>>       --tls-resume --tls-session-file=.\tls.session \
> >>>>>>>       \
> >>>>>>>       `#Chunk-based downloading` \
> >>>>>>>       --chunk-size=2M \
> >>>>>>>       \
> >>>>>>>       `#Swiper no swiping` \
> >>>>>>>       --robots=off --random-wait \
> >>>>>>>       \
> >>>>>>>       `#Target` \
> >>>>>>>       --domains=example.com example.com
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Thanking You,
> >>>>> Darshit Shah
> >>>>> PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6
> >>>>>
> >>>
> >>> --
> >>> Thanking You,
> >>> Darshit Shah
> >>> PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6
> >>>
> >>
> >>
> >
>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]