bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [RFC] Extend concurrency support


From: Tim Ruehsen
Subject: Re: [Bug-wget] [RFC] Extend concurrency support
Date: Wed, 21 May 2014 11:56:45 +0200
User-agent: KMail/4.12.4 (Linux/3.14-1-amd64; KDE/4.13.1; x86_64; ; )

On Tuesday 20 May 2014 20:14:56 Daniel Stenberg wrote:
> On Tue, 20 May 2014, Tim Ruehsen wrote:
> >> Not sure what other people think about it, but I think wget2, whatever it
> >> will be, should be based on libcurl and focus the wget development on
> >> what
> >> wget does better, eg recursive downloads.
> > 
> > Libcurl is one option (and not the worst). At least it would replace the
> > HTTP and FTP send and receive (plus the underlying TCP network handling -
> > what about DNS caching ?). This is just a small amount of Wget's code to
> > replace.
> 
> I'll start my response by clearly spell out that I am the libcurl maintainer
> and primary developer. I'm biased as hell.
> 
> FTP, HTTP, TLS and sockets seem to be about 25% of wget code (very roughly
> counted). Replacing the network layers with libcurl would not remove those
> 25% completely, as I assume there would have to be adaptions and stuff
> written anyway. Also, not everything would be provide exactly in way wget
> would prefer it since the designs are quite different.

I also estimate it is somewhere between 20%-25%. Sorry for naming this 'small 
amount'...

> The benefit would not just be less code in wget. libcurl offers a
> substantial amount of more functionality in the network layer than what
> wget has. And yes, libcurl has a DNS cache.

As I understood Guiseppe, he wants to concentrate on FTP(S) and HTTP(S). 
Additional functionality like POP3, IMAP, ... is not going to be used. As long 
as this functionality is not a burden (mem usage, library load impact, ...) it 
is fine.
You are using chained lists to store cache entries !? Does this scale with 
Wget corner cases like 'download the internet' ? I know it sounds it bit nit-
picky... but I like to mention it rather sooner than too late.

> In the past I've been told arguments such that the licensing of libcurl and
> the fact that libcurl is not a GNU project to be blocking reasons against
> using it in wget. I don't know if there's any rules or guidelines that
> actually dictate this, but those libcurl facts won't change.

AFAIK, it is more a 'would be nice'. Guiseppe as the maintainer should know or 
at least is in the right position to ask someone of the GNU 'organization'.

> In my eyes, the biggest drawback with switching to libcurl is that it'll
> require quite a big ripping-out-and-replacing-the-carpet-beneath-us
> operation, and one that'll require one or more dedicated contributors to do
> some heavy lifting to get everything on track.

I am not sure, how we find enough people-power for this task. On the other 
hand side, that's what I've done in the Mget project. I guess, a merge of Mget 
and Wget would be less work. Mget already implements most of Wget's options 
plus a bunch more.

> > - Cookie logic (incl. public suffix handling)
> 
> libcurl also provides cookie support.

Yes, to set cookie headers in requests and get cookies from responses.
This would not replace relevant code in Wget's cookie.c

> > - threading abstraction API
> 
> libcurl offers parallel transfers without the use of threads so wget
> actually wouldn't need threads if based on libcurl. At least not for that
> reason.

Think of Metalink downloads. We have the situation that chunks of files are 
downloaded in parallel AND after downloading each chunck, the checksum has to 
calculated and verified. With a non-threaded approach, you would serialize 
this task to a single CPU core. While checksumming, the parallel download is 
paused. Not so in a threaded model.

Well ,you could wait until the file is complete and than do checksumming over 
the whole file. But Wget is able to download many files at once... same 
problem here that checksumming and downloading would block each other with a 
non-threaded model.

The same goes for DNS resolving as long as the resolving does not work 
asynchronous. Not sure how and if this works with libcurl, but I guess that it 
will make the clients code more complex. Not so in a threaded model.

And the same goes to HTML/CSS/whatever parsing to retrieve URLs. In a threaded 
model these tasks are running parallel with further downloading.


One question that came to my mind while I was looking at libcurl API.
What about type safety (thinking of e.g. curl_easy_setopt()) ? Are there 'not 
so easy' functions or should a developer write a wrapper function for each 
CURLoption value ? 

Regards, Tim




reply via email to

[Prev in Thread] Current Thread [Next in Thread]