bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Shouldn't wget strip leading spaces from a URL?


From: Dale R. Worley
Subject: Re: [Bug-wget] Shouldn't wget strip leading spaces from a URL?
Date: Wed, 14 Jun 2017 20:49:51 -0400

L A Walsh <address@hidden> writes:
> Dale R. Worley wrote:
>>  But of course, no [RFC3986-conforming] URL
>>  contains an embedded space because that's what it
>>  says in RFC 3986, which is "what *defines* what a
>>  URL *is*"[sic; should read "is one definition of
> a URL.
> ---
>     Right, just like speed limit signs define
> what the maximum speed is.
>
> There is the "model" and there is reality.  To believe that
> the model replaces and/or dictates reality is not
> realistic and bordering on some mental pathology.
>
> I understand what you are saying Dale.  My dad was a lawyer,
> and life would be so much easier if specs, RFCs or other
> models of reality were the only thing we had to pay attention
> to.  But... to do so generally creates various levels of
> discomfort and/or headaches.

There's a reason why the Internet has advanced on the back of thousands
of anal-retentive standards documents.

There really are situations where DWIM (Do What I Mean) design makes
life worse.  It's plausible that in a web browser it's reasonable to
allow users to type in purported URLs that are invalid, and for the
browser to make its best guess as to what the user meant.  This is
because getting the guess wrong rarely causes troubles beyond showing
the user a page that they aren't interested in; the user can just retype
the right URL and get what they wanted.

But every such slackness introduces uncertainty.  If the user types
"http://www.example.com/ " (that is, with a trailing space), should it
be handled as "http://www.example.com/%20"; (assuming the user wanted to
access a file whose name is a single space, and providing the URL that
does that) or "http://www.example.com/"; (assuming that the space is a
cut-and-paste error and should be ignored).

As long as this is being directly monitored by the user, this works
reasonably well.  But once the DWIM program starts being used as a
*part* of a system, things get hazardous.  People start building other
parts of the system assuming that the DWIM program doesn't hold them to
the rules.  And since the DWIM program's behavior in those
outside-the-box cases isn't clearly defined, there's no protection from
the situation where its guesses change, but the rest of the system
depends on *particular* guesses that it used to make.

In the particular case of wget, consider that portions of the URL that
the user enters are extracted and used in the HTTP request.  Again,
there's a strict specification of what constitutes a valid HTTP request.
If the user includes an invalid character in the URL, should wget simply
pass it through into the HTTP request, assuming that a well-built web
browser will Do What the User (probably) Meant?

And it should be remembered that there's a design principle of Unix
that's rarely mentioned:  People write a lot of shell scripts for Unix,
and the external interface of Unix commands is optimized for use within
shell scripts, not for being directly executed by users.  That's why
most of them provide no output whatever if their execution is
successful, and why most of them that do generate output provide no
"headers" -- that would get in the way of handing the output to another
program as input.  I've even seen an exercise in a Unix training book
asking the student to explain why the single header line in the output
of the "ps" command is undesirable.

Within that context, the point of wget is to fetch the contents of a URL
that is provided by something else that *should* know what a properly
formed URL is.

Dale



reply via email to

[Prev in Thread] Current Thread [Next in Thread]