bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [PATCH] New option: --rename-output: modify output filena


From: Ángel González
Subject: Re: [Bug-wget] [PATCH] New option: --rename-output: modify output filename with perl
Date: Mon, 05 Aug 2013 00:33:15 +0200
User-agent: Thunderbird

On 04/08/13 20:47, Micah Cowan wrote:
On Sat, Aug 03, 2013 at 11:50:48PM +0200, Ángel González wrote:
If stdbuf(1) was installed, wget could use it to disable the std buffering.
Adding yet more variation between systems...
Thanks for pointing that out; I'd completely forgotten about stdbuf...

IMO, though, stdbuf is a hack; very convenient when you need such a
thing, but ultimately pretty unreliable (undefined behavior for some
uses (unrelated to our needs), won't always have the effect we want (if
the wrapped program explicitly adjusts its buffers, as tee does).

As far as variation between systems is concerned, though, a possible
choice would be to disable the --name-filter-program option unless
stdbuf exists. Of course, it could always exist at configure time, and
be absent at runtime... and it would probably limit the number of OSses
to handle this feature unacceptably.
I mentioned variation as something to keep into account. It's not necessarily “bad”.
You could as well have different sed flavours.
I don't think --name-filter-program should depend on having stdbuf installed.

A possible approach is to set an alarm for a couple of seconds, launch the
filter (through stdbuf if available) and read/select on the pipe. If we receive the alarm, mark as buffered and close the filter stdin. If we get the filename,
mark as unbuffered and continue using the same filter instance.

If the program is marked as buffered, it would launch a new instance per
filename, immediatly closing the input pipe.

For sed, we could try launching with --unbuffered and detecting whether
it returns non-zero or not.

Sadly, it is not specified by posix:
http://pubs.opengroup.org/onlinepubs/009695399/utilities/sed.html

Both BSD (freebsd) and Solaris sed return non-zero on the unknown
argument "--unbuffered" (return code 1 and 2 respectively) but it's the
same return code used when regex compiling failed.


Although when continuing a recursive download where most files are already
downloaded, it will need to rewrite a lot of filenames in rapid
sucession, so I
wonder if it could trigger some forking rate limit (intended to
prevent fork
bombs, presumably).
I dunno. Can't a loop over sed in a shell script produce the same
problem? I haven't seen that before, myself. There is a "maximum number
of processes" limit on my GNU/Linux OS, which makes better sense to
me, since that prevents fork bombs without limiting typical shell usage.

I have seen «Id "x" respawning too fast: disabled for 5 minutes» and
«WARNING: App 'x' respawning too quickly» messages but I think such
rate limit was done by init or the session, not by the kernel.


But the recursive download situation - and possibly a "download from
localhost" situation, are among the exceptions that would cause such
frequent spawns to likely become less inefficient.

Although, in such a case, if the files meant to be transformed already
exist, wouldn't they also already be transformed? In which case they'd
be redownloaded, in the absense of some sort of database that can map
original URLs to current files.
wget would have to filter the filenames to see if the files are already
downloaded.
(It would obviously fail if your filter embedded a date or counter in the
filename, but that's your problem for using such filter)


...I don't know anything about PCRE, but I'm hoping it has its own
parser for the common "s///" idiom, so Wget wouldn't have to write/debug
our own.
I don't think we should allow letters as separation character. Which should
fix the issue (inspired by php behavior on preg_* functions:
“Delimiter must not be alphanumeric or backslash”).
Yes; although if PCRE has its own s/// parser, as I'd hope, this choice
may be unavailable to us. It'd simply be impossible for them to use s as
a separator, without also prefixing it with s (if they really wanted,
they could do ss...s...s). But this is silly. No one here's going to
spend time on support for a user that's doing that. :)

I think it doesn't have such function (programs pass regex and replacement
as different arguments), but it wouldn't be hard to code.


The wget-1.10.2.tar.gz example isn't the worst vresioned-program
transfomation. If you had
program-2.0.tgz, it would become program-2.1.0.tgz :(
Yeah, excellent point. That's even less acceptable.

...Trying to think of a way to still use this model, but avoid that
problem. Could stop at the first fully numeric component, but then that
doesn't work for program-2.0c.tgz. Could stop at any component
containing a number, but that doesn't work for "bz2". Or components
prefixed with numbers, but I imagine there are file extensions like that
too.

Didn't want to force the uniquer to have to recognize filetypes, since
that's a maintenance problem, though in practice it's probably only
necessary to recognize compression-format extensions, which reduces the
maintenance issue to some degree.

But I also didn't want Niwt to use Wget's idiom, as it can be impractical
for downloading things and then viewing them with a web browser or what
not.

Obviously, the whole point of making the uniquer a separate program is
that users can work around such issues themselves; but I'd want to avoid
forcing them to do that wherever feasible.

-mjc
I like the uniquer used by chromium, it appends before the extension the
number between brackets (but I would start the counter on 2!).

Example:
* attachment-0001.pdf
* attachment-0001 (1).pdf
* attachment-0001 (2).pdf
* attachment-0001 (3).pdf

A list of compression filetypes seems the best way. There are only a few formats
applied as filters (.Z .gz .bz2 .7z .xz) and wgetrc could allow more to be
added (but wget may be updated earlier than most users finding a new one).






reply via email to

[Prev in Thread] Current Thread [Next in Thread]