lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

lynx-dev Query string handling bug vs bad html


From: Doug Kaufman
Subject: lynx-dev Query string handling bug vs bad html
Date: Mon, 27 Apr 1998 07:46:44 -0700 (PDT)

A user of my copy of lynx2.8.1dev6 on rahul.net has complained that
there is a bug in handling ampersands used as delimiters in query
strings in URL's. I replied that the ampersands needed to be escaped
as "&", but he replied that the need for escaping is outside of
the specifications. I would appreciate comments on whether this is a
lynx bug or invalid html. The question is outside of the area of my
expertise. If I gave bad advice, I would like to make a retraction.
Please see the test URL at the end of this message.

Excerpts from the discussions follow:

I wrote:

DK>I updated the version of Lynx available from my directory via the
DK>userbin to version 2.8.1dev6 today. Let me know if anyone has any
DK>problems with it.

He wrote:

AM>Yes, I encountered a SERIOUS problem with it.  I have verified that my
AM>problem is exhibited only by the new version of Lynx.  Old versions
AM>work, as do other browsers.
AM>
AM>The problem is that this new Lynx fouls up a query string contained in a
AM>URL for a CGI invocation.  For example, I have the following link in one
AM>of my web pages:
AM>
AM><a 
href="http://cgi.unicorn.us.com/cgi-bin/unicorn/mgadmin?getpswd=33536&lg=1";>Miami
 (University of)</a>
AM>
AM>Here's how the new Lynx interprets the URL above:
AM>
AM>http://cgi.unicorn.us.com/cgi-bin/unicorn/mgadmin?getpswd=xc100b~^I6=1
AM>
AM>Note how it mangled the query string.  Needless to say, my CGI
AM>goes off and does something it's not supposed to do.  You can
AM>try it yourself, you won't hurt anything.  Any link on the page
AM>http://unicorn.us.com/guide/programs/a_rg4.html will do this -- you're
AM>supposed to get a log-in prompt (and you do with other browsers) but the
AM>new Lynx will send the CGI off into a "change password" function.
AM> ...
AM>This happens only if you have a page displaying that URL and you step on
AM>it with Lynx.  If you invoke Lynx from the commandline with that URL as
AM>the argument for Lynx (in single quotes so the ampersand won't get
AM>intperpreted as a shell background directive), it works fine.  The URL
AM>will get misinterpreted only if it's already displayed on a web page.

I replied:

DK>The problem is that lynx is interpreting the URL according to the html
DK>specification. You have invalid html on your web page, as you can see
DK>with one of the html validators. The ampersand needs to be escaped.
DK>Change the "&" to "&amp;" and all should work well. Let me know if this
DK>doesn't fix the problem. Lynx is not trying to be a validator, but
DK>trying to interpret html accurately means that some invalid html is not
DK>rendered as the author intended.

He replied:

AM>No, my HTML is valid.  This has nothing to do with HTML specs, this is
AM>about URI specs.  WebLint in pedantic mode found nothing wrong with my
AM>URIs.  Neither the W3C validator nor NetMechanic found anything wrong
AM>with my URIs either.
AM>
AM>DK>The ampersand needs to be escaped.  Change the "&" to "&amp;" and all
AM>DK>should work well.
AM>
AM>This ampersand is an argument delimiter in a query string, not a
AM>displayable character.  The URI specification says pretty clearly that
AM>the ampersand is reserved as a delimiter when used within a query string.
AM>
AM>Allow me to quote from "Uniform Resource Identifiers
AM>(URI): Generic Syntax and Semantics", T. Berners-Lee,
AM>R. Fielding, L. Masinter, 18 November 1997, at
AM>http://www.ics.uci.edu/pub/ietf/uri/draft-fielding-uri-syntax-01.txt.
AM>
AM>2.2. Reserved Characters
AM>
AM>   Many URIs include components consisting of or DELIMITED BY, certain
AM>   special characters.  These characters are called "reserved", since
AM>   their usage within the URI component is limited to their reserved
AM>   purpose.  If the data for a URI component would conflict with the
AM>   reserved purpose, then the conflicting data must be escaped before
AM>   forming the URI.
AM>
AM>      reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+"
AM>
AM>   ...Characters in the "reserved" set are not reserved in all contexts.
AM>   The set of characters actually reserved within any given URI
AM>   component is defined by that component. In general, a character is
AM>   reserved if the semantics of the URI changes if the character is
AM>   replaced with its escaped US-ASCII encoding.
AM>
AM>The ampersand is reserved as a delimiter in a query string.  If the data
AM>contained within the query string needs an ampersand because the CGI
AM>expects it, then that ampersand is not being used as a delimiter and
AM>must be escaped.  Here's the description of the query component of the
AM>URI:
AM>
AM>4.3.3. Query Component
AM>
AM>   The query component is a string of information to be interpreted by
AM>   the resource.
AM>
AM>      query         = *uric
AM>
AM>   Within a query component, the characters "/", "&", "=", and "+" are
AM>   reserved.
AM>
AM>Lynx should transmit the query string "as is" without interpreting it.
AM>I know of no spec that says a query string inside a URI is supposed
AM>to be urldecoded by the browser before transmitting it!  Decoding the
AM>query string is the CGI's job.  I maintain that this is a bug in Lynx.
AM>A browser has no business decoding a urlencoded string inside a URI.
AM>
AM>DK>Let me know if this doesn't fix the problem. Lynx is not trying to be
AM>DK>a validator, but trying to interpret html accurately means that some
AM>DK>invalid html is not rendered as the author intended.
AM>
AM>The HTML is valid.  Changing my ampersands to &amp; did cause Lynx to
AM>use the URL correctly.  However, I am suspicious that this fix works
AM>only for Lynx, and not for other browsers.
AM>
AM>In any case, it's pretty clear that this is a bug in Lynx.  It is
AM>interpreting the ampersand the same way throughout the page regardless
AM>of the context.  My web pages conform to the specs, far as I can tell.

I replied:

DK>I am certainly not a html expert. The best official reference that I
DK>could find was:
DK>"http://www.w3.org/MarkUp/html-spec/html-spec_toc.html#SEC3.2.1";
DK>
DK>If you still think that this is a bug in lynx's handling of html, I
DK>would like your permission to post some of this correspondence,
DK>including the URL's in question to the lynx-dev list, so that others
DK>more expert can comment. The question of recognition as delimiter seems
DK>to be in section 9.6 of the SGML ISO specification, which does not
DK>appear to be available on the web.

He replied:

AM>My understanding that URLs (or more generally URIs) are a separate spec
AM>from HTML.  HTML is a markup language for laying out a document; the
AM>part of HTML having to do with URLs is the <A HREF="..."> tag.  What
AM>goes inside the quotation marks, and how the contents are, is determined
AM>by the URI spec, as far as I know.
AM>
AM>DK> If you still think that this is a bug in lynx's handling of html, I
AM>DK> would like your permission to post some of this correspondence,
AM>
AM>Feel free.  The article I posted in a2i.general I also posted in
AM>comp.infosystems.www.browsers.misc, because some Lynx developers are
AM>known to hang out there.
AM>
AM>DK> including the URL's in question to the lynx-dev list, so that others
AM>
AM>I can do better than that.  I have created a page,
AM>http://unicorn.us.com/testlynx.html to demonstrate the problem.  Just
AM>refer people there.  The link executes a CGI program that displays all
AM>the CGI environment variables, showing that the query string doesn't
AM>pass properly.

                                  Doug

__
Doug Kaufman
Internet: address@hidden (preferred)
          address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]