bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Concurrency and wget


From: Tim Ruehsen
Subject: Re: [Bug-wget] Concurrency and wget
Date: Tue, 3 Apr 2012 11:17:56 +0200
User-agent: KMail/1.13.7 (Linux/3.2.0-2-amd64; KDE/4.7.4; x86_64; ; )

Hi Giuseppe, hi Micah,

while couldn't sleep last night, I thought about wget and concurrency...

I had the idea of using a top-down approach to outline what wget is doing.
Just to have a overview without struggling with the details of implementation.
As a side effect one would have a (textual? graphical?) starting point for
contributors to rush into the project. A chance to have a clear and well 
documented design.

Since maintenance of a flowchart is time-consuming and requires some extra
skills and tools, pure texts in the form of a "programming language" seems to 
fit.

Here is just a beginning, let's say a basis for discussions.
If you don't mind, I would like take part in ongoing development.

Basic wget functionality (download given URI/IRI):

main (URI) {
        put <URI> into <queue>

        while <queue> is not empty {
                download_and_analyse(next <queue> entry)
        }
}

download_and_analyse (URI) {
        download URI to FILE
        add URI to <downloaded>
        remove URI from <queue>
        scan FILE and add URIs to <queue> if not already in <downloaded>
}


Extended for simple multitasking (threaded, multi processes or even 
distributed).
This is just one possible design for concurrent downloads.
Maybe you have a more elegant idea.

main (URI) {
        create <N> downloaders
        put <URI> into <queue>

        wait for status message from downloader {
                print status
                if <queue> is empty {
                        stop downloaders
                        we are done
                }
        }
}

downloader {
        wait for and allocate entry in <queue> {
                download_and_analyse(entry)
        }
}

download_and_analyse (URI) {
        download URI to FILE
        add URI to <downloaded>
        remove URI from <queue>
        scan FILE and add URIs to <queue> if not already in <downloaded>
}


Extended to download a URI from several sources in parallel.
main and downloader stay the same, just download_and_analyse() is extended.

download_and_analyse (URI) {
        /* download URI to FILE */
        put <X> chunk entries into <chunk_queue>
        create <X> chunkloaders
        wait for status message from chunkloader {
                send modified status message to main
                if <chunk_queue> is empty {
                        stop chunk_loaders
                        end loop
                }
        }

        add URI to <downloaded>
        remove URI from <queue>
        scan FILE and add URIs to <queue> if not already in <downloaded>
}

chunk_loader {
        wait for and allocate entry in <chunk_queue> {
                download(entry)
                remove entry from <chunk_queue>
        }
}

After some iterations we should come to a point where we can make further 
decisions:
- how to implement concurrency (threads, processes, distributed process, 
(cloud))
- how to implement communication between tasks
- is a wget rewrite reasonable ?
- which existing code to recycle ?
- creating libraries from existing code (e.g. libwget) or use external 
libraries
  (e.g. for network stuff, parsing and creating URI/IRIs, etc.)
- create a list of test code, especially for the library code
- ... etc etc ...


    Tim



reply via email to

[Prev in Thread] Current Thread [Next in Thread]