[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Race condition on downloaded files among multiple wget in

From: Giuseppe Scrivano
Subject: Re: [Bug-wget] Race condition on downloaded files among multiple wget instances
Date: Tue, 10 Sep 2013 16:48:21 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)

Tim Ruehsen <address@hidden> writes:

> And SIGBUS could also occur out of any other reason (e.g. real bugs in Wget).
> As was already said, replacing mmap by read would not crash (wget_read_file() 
> reads as many bytes as there are without prior checking the length of the 
> file). But without additional logic, it might read random data (many 
> processes 
> writing into the file at the same time, not necessarily the same data). Wget 
> would try to parse / change (-k) it, the result would be broken, but no error 
> would be printed. So, replacing mmap is not a solution, but maybe a part of a 
> solution.
> Now to the possible solutions that come into my mind:
> 1. While downloading / writing data, Wget could build a checksum of the file.
> It allows checking later when re-reading the file. In this case we could 
> really tell the user: hey, someone trashed our file while we are working...
> To get this working, we must remove the mmap code.
> 2. Using tempfiles / tempdirs only and move them to the right place. That 
> would bring in some kind of atomicity, though there are still conflicts to 
> solve (e.g. a second Wget instance is faster - should we overwrite existing 
> files / directories).
> 3. Keeping html/css files in memory after downloading. These are the ones we 
> later re-read to parse them for links/URLs. Writing them to disk after 
> parsing 
> using a tempfile and a move/rename to have atomicity.
> 4. Using (advisory) file-locks just helps against other Wget instances (is 
> that enough ?). And with -k you have to keep the descriptor open for each 
> file 
> until Wget is done with downloading everything. This is not practical, since 
> there could be (10-, 100-)thousands of files to be downloaded.
> If someone likes to work on a patch, here is my opinion: I would implement 1. 
> as the least complex to code (but it needs more CPU). Point 4 is would not 
> work in any cases.

I don't think we should aim at supporting more instances of wget that
can run on the same tree, but we can aim at having at least atomic
operations per file.

Said so, I think that using temp files is more than enough and we
shouldn't really care about possible conflicts, another instance of wget
is just a separate entity for us that we should not consider.

File locks can be implemented as an additional level of security and
requiring an explicit argument to enable them, but still I don't see the
point since we don't support multiple processes running simultaneously
on the same data.

Hopefully we will merge the parallel-wget branch soon, so we will have
threads instead of processes :-)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]