bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [PATCH] New option: --rename-output: modify output filena


From: Micah Cowan
Subject: Re: [Bug-wget] [PATCH] New option: --rename-output: modify output filename with perl
Date: Fri, 2 Aug 2013 11:41:20 -0700
User-agent: Mutt/1.5.21 (2010-09-15)

On Fri, Aug 02, 2013 at 11:53:24AM +0200, Tim Ruehsen wrote:
> Hi Dagobert,
> 
> > All this added complexity seems highly overengineered for a feature
> > that is not in the core functionality of the tool and that only a
> > fraction of the users use. Keep in mind: a good tool is one that does
> > a single job right.
> 
> Andrew already answered you mail with a bunch of arguments.

> I am very conform with his writing, your posting puzzled me as well.
> 
> Your above sentence let me want to say:
> - The new option is not complex.

But it is additional complexity relative to just using sh (which is a
common unix practice, to my mind for good reason) and it seems to me,
likely to get more so.  What's stopping us from adding a new specialized
tag for every pet transformation language? We have Perl, will we add
Ruby and Python and (...) when people request it? And if not, why not?
It's (mild) bloat, that lends itself towards further increased bloat.

I view GNU screen's decision to provide language bindings versus
tmux's decision to be highly sh-scriptable in a similar vein... lots of
extra things to maintain that all do the same thing in different ways
(though in their case, the situation is significicantly more work than
here).

> - It is straight forward, and not 'over engineered' or even 'highly over 
> engineered'.

Straight forward doesn't mean "not over-engineered". I think I wouldn't
claim "highly", but it's still "over-engineered" as I complained in my
last mail. Over engineered to me means expending further effort on
working around or solving a problem than the problem itself presents.
In this case, yes, the implementation is simple. But still more complex
than it needs to be, and inviting of still further complexity.

> - The added code doesn't interfere with the existing code in a way that you 
> would experience side-effects, if you do not use it.
> - There is no incidence that Wget is doing it's job worse than before.
> - The new option adds value to Wget by making a core functionality tunable.

I don't know why these points are being made, no one's arguing them. In
particular, the middle point isn't really useful, because ANY new
feature you add to Wget, no matter how you implement it, or even no
matter how buggy it is, is always an improvement over previous versions
of Wget which lacked the new feature in any form whatever. And all these
statements apply to the proposal of "just use sh" at least as easily.

No one's talking about this feature versus not this feature. The
discussion so far is this feature versus a simpler (trivial, and also
trivial-to-maintain) version of it, and one much more common to the Unix
idiom at that.

I'll respond to a couple of points Andrew made else-thread:

(Andrew wrote):
> Different systems have different shells.  When you have to try to escape
> for the system shell, you run into portability problems, and general
> confusion regarding double-escaping.  If you sit in freenode #openssh
> for a while, you can see these problems routinely, resulting from the
> fact that ssh remote commands are executed through the remote system
> shell.

Different UNIX systems have different system shells, all of which are
sh-compatible.  The quoting/escaping rules do not change, unless of
course you are using an extended syntax such as bash's $'...', in which
case you know what you're doing. The only system shell we might have to
deal with having a truly different quoting syntax, would be Windows
command shell, in the event that we port this feature for the Windows
version (I imagine the implementation for piping to shell processes
would be necessarily different, so wouldn't be immediately supported, if
ever).

(And as Dagobert pointed out, unlike openssh, you're always using your
local system shell, with which you are presumably familiar.)

> > > With sed, you still need -u, or else there is a deadlock.  This
> > > knowledge should be embedded into wget because most people don't
> > > have it.
> >
> > You are talking about GNU sed, please keep in mind that wget is
> > portable to systems without or just a subset of the GNU userland.
> 
> Yes, I know.  But those other sed implementations will probably not
> work.  They will just deadlock.

To me, all of this is a strong argument that the default for any sh or
sed protocol, should be to fork a new process for each name (regardless
of which solution we go with). I'd far rather that, than exclude those
with non-GNUish seds, or require embedding unportable constructs on the
part of either Wget or Wget users. The tiny efficiency bonus you might
see by streaming constantly to a process pales in comparison to the
potential issues in supporting such a protocol.

And it's worth noting that AFAICT, there IS no efficiency difference in
the average case. If you fork and pipe and write and read with the
process while the download is in operation (after you've obtained the
final network name from redirections, of course), the average page
download is going to take much longer than that whole operation, which
can happen mostly in parallel while waiting for more network data to
arrive.

In solving the buffering problem, an alternative to forking/execing on
every name, but one I personally like less, is to allocate a ptty around
the program to force it to use line buffering even if it doesn't have an
explicit option to do so. And such an option is obviously not
implementable in Windows, if we do port this option there.

Yet another alternative, sort of a compromise between the streamed and
single-line-per-process approaches, would be to batch several names
after we've collected them, send them all through and close the write
end, and then collect the transformations from the program.

It's worth pointing out that, if the alternative approach currently
being approached - the CGIish content-type aware method - it would be
necessary in all cases to fork/exec a new process for every transform,
since each transformation would take place within a unique environment.

-mjc



reply via email to

[Prev in Thread] Current Thread [Next in Thread]