bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Miscellaneous thoughts & concerns


From: Jeffrey Fetterman
Subject: Re: [Bug-wget] Miscellaneous thoughts & concerns
Date: Mon, 9 Apr 2018 03:10:38 -0500

I've tested wget2 with the following changes to .\libwget\ssl_gnutls.c

        if (ret < 0) {
-           if (errno == EINPROGRESS) {
+           if (errno == EINPROGRESS || errno == 22 || errno == 32) {
                errno = EAGAIN; // GnuTLS does not handle EINPROGRESS
            } else if (errno == EOPNOTSUPP) {
                // fallback from fastopen, e.g. when fastopen is disabled
in system
                debug_printf("Fallback from TCP Fast Open... TFO is
disabled at system level\n");
                tcp->tcp_fastopen = 0;
                ret = connect(tcp->sockfd, tcp->connect_addrinfo->ai_addr,
tcp->connect_addrinfo->ai_addrlen);
-               if (errno == ENOTCONN || errno == EINPROGRESS)
+               if (errno == ENOTCONN || errno == EINPROGRESS || errno ==
22 || errno == 32)
                    errno = EAGAIN;
            }
        }


However, I still end up with multiple 'Failed to write 305 bytes (32:
Broken pipe)' errors when resuming a previous download with TLS Resume.

On Sun, Apr 8, 2018 at 4:38 PM, Jeffrey Fetterman <address@hidden>
wrote:

> >  The URLs are added first because of the way Wget will traverse the
> links. It just adds these URLs to the download queue, doesn't start
> downloading them instantly. If you traverse a web page and Wget finds
> links on it, it will obviously add them to the download queue. What else
> would you expect Wget to do?
>
> Not traverse the entire site at once, waiting until the queue gets low
> enough to continue traversing.
>
> > TLS Session Resume will simply reduce 1 RTT when starting a new TLS
> Session. It simply matters for the TLS handshake and nothing else. It
> doesn't resume the Wget session at all. Also, the ~/.wget-session file
> simply stores the TLS Session information for each TLS Session. So you can
> use it for multiple sessions. It is just a cache.
>
> Ah, I see, so I should switch over to using TLS False Start since there's
> no real difference performance-wise?
>
>
> On Sun, Apr 8, 2018 at 10:11 AM, Darshit Shah <address@hidden> wrote:
>
>> * Jeffrey Fetterman <address@hidden> [180408 04:53]:
>> > Yes! Multiplexing was indeed partially the culprit, I've changed it
>> > to --http2-request-window=5
>> >
>> > However the download queue (AKA 'Todo') still gets enormous. It's why I
>> was
>> > wanting to use non-verbose mode in the first place, screens and screens
>> of
>> > 'Adding url:'. There should really be a limit on how many urls it adds!
>> >
>> The URLs are added first because of the way Wget will traverse the links.
>> It
>> just adds these URLs to the download queue, doesn't start downloading them
>> instantly. If you traverse a web page and Wget finds links on it, it will
>> obviously add them to the download queue. What else would you expect Wget
>> to
>> do?
>>
>> > Darshit, as it stands it doesn't look like --force-progress does
>> anything
>> > because --progress=bar forces the same non-verbose mode, and
>> > --force-progress is meant to be something used in non-verbose mode.
>> >
>> > However, the progress bar is still really... not useful. See here:
>> > https://i.imgur.com/KvbGmKe.png
>> >
>> > It's a single bar displaying a nonsense percentage, and it sounds like
>> with
>> > multiplexing there's supposed to be, by default, 30 transfers going
>> > concurrently.
>> >
>> Yes, I am aware of this. Sadly, Wget is developed entirely on volunteer
>> effort.
>> And currently, I don't have the time on my hands to fix the progress bar.
>> It's
>> being caused due to HTTP/2 connection multiplexing. I will fix it when I
>> find
>> some time for it.
>>
>> > > Both reduce RTT by 1, but they can't be combined.
>> >
>> > I was using TLS Resume because, well, for a 300+GB download it just
>> seemed
>> > to make sense, so it wouldn't have to check over 100GB of files before
>> > getting back to where I left off.
>> >
>> > > You use TLS Resume, but you don't explicitly need to specify a file.
>> By
>> > default it will use ~/.wget-session.
>> >
>> > I figure a 300GB+ transfer should have its own session file just in
>> case I
>> > do something smaller between resumes that might overwrite .wget-session,
>> > plus you've got to remember I'm on WSL and I'd rather have relevant
>> files
>> > kept within my normal folders rather than my WSL filesystem.
>> >
>> I'm not sure if you've understood TLS Session Resume correctly. TLS
>> Session
>> Resume is not going to resume your download session from where it left
>> off. Due
>> to the way HTTP works, Wget will still have to scan all your existing
>> files and
>> send HEAD requests for each of them when resuming. This is just a
>> limitation of
>> HTTP and there's nothing anybody can do about it.
>>
>> TLS Session Resume will simply reduce 1 RTT when starting a new TLS
>> Session. It
>> simply matters for the TLS handshake and nothing else. It doesn't resume
>> the
>> Wget session at all. Also, the ~/.wget-session file simply stores the TLS
>> Session information for each TLS Session. So you can use it for multiple
>> sessions. It is just a cache.
>> > On Sat, Apr 7, 2018 at 3:04 AM, Darshit Shah <address@hidden> wrote:
>> >
>> > > Hi Jefferey,
>> > >
>> > > Thanks a lot for your feedback. This is what helps us improve.
>> > >
>> > > * Tim Rühsen <address@hidden> [180407 00:01]:
>> > > >
>> > > > On 06.04.2018 23:30, Jeffrey Fetterman wrote:
>> > > > > Thanks to the fix that Tim posted on gitlab, I've got wget2
>> running
>> > > just
>> > > > > fine in WSL. Unfortunately it means I don't have TCP Fast Open,
>> but
>> > > given
>> > > > > how fast it's downloading a ton of files at once, it seems like it
>> > > must've
>> > > > > been only a small gain.
>> > > > >
>> > > TCP Fast Open will not save you a lot in your particular scenario. It
>> > > simply
>> > > saves one round trip when opening a new connection. So, if you're
>> using
>> > > Wget2
>> > > to download a lot of files, you are probably only opening ~5
>> connections
>> > > at the
>> > > beginning and reusing them all. It depends on your RTT to the server,
>> but
>> > > 1 RTT
>> > > when downloading several megabytes is already an insignificant amount
>> if
>> > > time.
>> > >
>> > > > >
>> > > > > I've come across a few annoyances however.
>> > > > >
>> > > > > 1. There doesn't seem to be any way to control the size of the
>> download
>> > > > > queue, which I dislike because I want to download a lot of large
>> files
>> > > at
>> > > > > once and I wish it'd just focus on a few at a time, rather than
>> over a
>> > > > > dozen.
>> > > > The number of parallel downloads ? --max-threads=n
>> > >
>> > > I don't think he meant --max-threads. Given how he is using HTTP/2,
>> > > there's a
>> > > chance what he's seeing is HTTP Stream Multiplexing. There is also,
>> > > `--http2-request-window` which you can try.
>> > > >
>> > > > > 3. Doing a TLS resume will cause a 'Failed to write 305 bytes (32:
>> > > Broken
>> > > > > pipe) error to be thrown', seems to be related to how certificate
>> > > > > verification is handled upon resume, but I was worried at first
>> that
>> > > the
>> > > > > WLS problems were rearing their ugly head again.
>> > > > Likely the WSL issue is also affecting the TLS layer. TLS resume is
>> > > > considered 'insecure',
>> > > > thus we have it disabled by default. There still is TLS False Start
>> > > > enabled by default.
>> > > >
>> > > >
>> > > > > 3. --no-check-certificate causes significantly more errors about
>> how
>> > > the
>> > > > > certificate issuer isn't trusted to be thrown (even though it's
>> not
>> > > > > supposed to be doing anything related to certificates).
>> > > > Maybe a bit too verbose - these should be warnings, not errors.
>> > >
>> > > @Tim: I thunk with `--no-check-certificate` these should not be either
>> > > warnings
>> > > or errors. The user explicitly stated that they don't care about the
>> > > validity
>> > > of the certificate. Why add any information there at all? Maybe we
>> keep it
>> > > only
>> > > in debug mode
>> > > >
>> > > > > 4. --force-progress doesn't seem to do anything despite being
>> > > recognized as
>> > > > > a valid paramater, using it in conjunction with -nv is no longer
>> > > beneficial.
>> > > > You likely want to use --progress=bar. --force-progress is to
>> enable the
>> > > > progress bar even when redirecting (e.g. to a log file).
>> > > > @Darshit, we shoudl adjust the behavior to be the same as in
>> Wget1.x.
>> > >
>> > > I think the progress bar options are sometimes a little off since we
>> don't
>> > > have
>> > > tests for those and I am the only one using them.
>> > >
>> > > When exactly did you try to use --force-progress? I will change the
>> > > documentation today to reflect its actual usecase. --force-progress is
>> > > useful
>> > > only in --quiet mode. Which, TBH, doesn't make much sense to me since
>> > > simply
>> > > --progress=bar will essentially put you in the same mode. AFAIR, this
>> comes
>> > > from trying to bring in option compatibility from Wget 1.x.
>> > >
>> > > @Tim: Adjusting behaviour to the same as Wget 1.x doesn't make a lot
>> of
>> > > sense
>> > > for the progress bar. In Wget 1.x, the default mode is: progress bar +
>> > > verbose.
>> > > Whereas, in Wget2, progress-bar will effectively enable the
>> non-verbose
>> > > mode
>> > > where only warnings and errors are printed. I am noting this down for
>> now.
>> > > When
>> > > I have a little time, I will think about all the progress and
>> verbosity
>> > > options
>> > > in Wget 1.x and make sure that they do something similar in Wget2.
>> Though,
>> > > they
>> > > won't have the exact same behaviour.
>> > > >
>> > > > > 5. The documentation is unclear as to how to disable things that
>> are
>> > > > > enabled by default. Am I to assume that --robots=off is
>> equivalent to
>> > > -e
>> > > > > robots=off?
>> > > >
>> > > > -e robots=off should still work. We also allow --robots=off or
>> > > --no-robots.
>> > > >
>> > > > > 6. The documentation doesn't document being able to use 'M' for
>> > > chunk-size,
>> > > > > e.g. --chunk-size=2M
>> > > >
>> > > > The wget2 documentation has to be brushed up - one of the blockers
>> for
>> > > > the first release.
>> > > >
>> > > > >
>> > > > > 7. The documentation's instructions regarding --progress is all
>> wrong.
>> > > > I'll take a look the next days.
>> > >
>> > > Thanks for the heads up. Will look into it when I look at the rest of
>> the
>> > > progress options.
>> > > >
>> > > > >
>> > > > > 8. The http/https proxy options return as unknown options despite
>> > > being in
>> > > > > the documentation.
>> > > > Yeah, the docs... see above. Also, proxy support is currently
>> limited.
>> > > >
>> > > >
>> > > > > Lastly I'd like someone to look at the command I've come up with
>> and
>> > > offer
>> > > > > me critiques (and perhaps help me address some of the remarks
>> above if
>> > > > > possible).
>> > > >
>> > > > No need for --continue.
>> > > > Think about using TLS Session Resumption.
>> > > > --domains is not needed in your example.
>> > > >
>> > >
>> > > You use TLS Resume, but you don't explicitly need to specify a file.
>> By
>> > > default
>> > > it will use ~/.wget-session.
>> > >
>> > > > Did you build with http/2 and compression support ?
>> > > >
>> > > > Regards, Tim
>> > > > > #!/bin/bash
>> > > > >
>> > > > > wget2 \
>> > > > >       `#WSL compatibility` \
>> > > > >       --restrict-file-names=windows --no-tcp-fastopen \
>> > > > >       \
>> > > > >       `#No certificate checking` \
>> > > > >       --no-check-certificate \
>> > > > >       \
>> > > > >       `#Scrape the whole site` \
>> > > > >       --continue --mirror --adjust-extension \
>> > > > >       \
>> > > > >       `#Local viewing` \
>> > > > >       --convert-links --backup-converted \
>> > > > >       \
>> > > > >       `#Efficient resuming` \
>> > > > >       --tls-resume --tls-session-file=.\tls.session \
>> > > > >       \
>> > > > >       `#Chunk-based downloading` \
>> > > > >       --chunk-size=2M \
>> > > > >       \
>> > > > >       `#Swiper no swiping` \
>> > > > >       --robots=off --random-wait \
>> > > > >       \
>> > > > >       `#Target` \
>> > > > >       --domains=example.com example.com
>> > > > >
>> > > >
>> > > >
>> > > >
>> > >
>> > > --
>> > > Thanking You,
>> > > Darshit Shah
>> > > PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6
>> > >
>>
>> --
>> Thanking You,
>> Darshit Shah
>> PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6
>>
>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]