[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug)
From: |
Leonid Pauzner |
Subject: |
Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug) |
Date: |
Mon, 22 Mar 1999 17:24:36 +0300 (MSK) |
21-Mar-99 12:38 Klaus Weide wrote:
> On Sun, 21 Mar 1999, Leonid Pauzner wrote:
>> One certain "problem" I personally run into is a utf-8 URL encoding:
>> when HREF= have *open 8-bit text* the remote server (script)
>> may (1) expect such bytes %xx-encoded,
>> but lynx now (2) translate URLs from document charset to utf-8
>> and then sent each byte %xx-encoded (an obvious check -
>> a number of %xx encoded bytes increased).
> But URLs should never *have* unencoded 8-bit chars - and lynx
Right.
> never generates such URLs as a result of form submission (I hope).
Right (we generate %xx encoded bytes (1), including local file names)
HTML4.0 on syntax of anchor names:
http://www.w3.org/TR/PR-html40/struct/links.html#h-12.2.1
says:
Anchor names should be restricted to ASCII characters. Please consult
the section on representing non-ASCII characters is URLs for more
information.
and that section is under
http://www.w3.org/TR/PR-html40/appendix/notes.html#urls
(below)
So both (1) and (2) should be considered as a recovery from a broken document.
We usually bypass the problem when Lynx process both broken #fragment link
and a broken NAME= target (they get resolved in a consistent way),
but the problem occurs when we deals with one end only
(say, link to a CGI script).
22-Mar-99 12:42 I wrote:
> 21-Mar-99 20:37 Klaus Weide wrote:
>> This means that the user can usually toggle between the two interpretations
>> with -raw / '@'. It's not completely logical that the interpretation
>> of URLs should depend on this. OTOH there's the ease of switching, and
>> it's more likely that encoding the raw value is the right thing (or even
>> possible) when the user's environment is consistent with the server's.
> Completely wrong to overload -raw mode here (to ask user
> to get the document unreadable in order to follow a link),
> it may be switchable like "dsoft-quotes" instead.
Now I think we may overload "dsoft-quotes" to toggle between
two interpretations, the original meaning of this key is a work around
the bug in HTML anchor which is very close to discussed problem.
(One should decide which "interpretation" is "standard"
and which is a workaround).
I haven't come with a patch yet but pick references FYI:
HTML4.0, Lynx/2.7.2 CHANGES and Lynx/2.8 CHANGES.
***** HTML 4.0
The following notes are informative, not normative.
B.1 Representing non-ASCII characters in URLs
We recommend the following convention for representing non-ASCII
characters in URLs: each character is represented in UTF-8 (see
[RFC2044]) as one or more bytes and these bytes are then escaped with
the URL escaping mechanism (converting each byte to %HH, where HH is
the hexadecimal notation of the byte value).
This procedure results in the same syntactically legal URL according
to [RFC1738] or [RFC2141] and independent of the character encoding to
which the HTML document carrying the URL may have been transcoded.
Note. The procedure above doesn't guarantee that UTF-8 can be used in
all schemes or on all resources of a scheme. The the producer of a URL
(usually the HTML author) is responsible for ensuring that this works
for the URL in question, or using another notation (with %HH escapes
not corresponding to UTF-8 if necessary) to address the resource in
question.
Note. Some older user agents trivially process URLs in HTML using the
bytes of the character encoding in which the document was received.
Some older HTML documents rely on this (illegal) practice and break
when transcoded. User agents that want to handle these older documents
should, on receiving a URL containing characters outside the legal
set, first use the conversion based on UTF-8. Only if the resulting
URL does not resolve should they try constructing a URL based on the
bytes of the character encoding in which the document was received.
Note. The same conversion based on UTF-8 should be applied to anchor
names as appearing in the name attribute of the A element.
Note. The URL that is constructed when a form is submitted may be used
as an anchor-style link (e.g., the href attribute for the A element).
Unfortunately, the use of the "&" character to separate form fields
interacts with its use in SGML attribute values to delimit character
entity references. For example, to use the URL "http://host/?x=1&y=2"
as a linking URL, it must be written <A
href="http://host/?x=1&y=2"> or <A
href="http://host/?x=1&y=2">. HTTP server implementors, and in
particular, CGI implementors are encouraged to support the use of ";"
in place of "&" to save authors the trouble of escaping "&" characters
in this manner.
****** Lynx/2.8
1997-09-27
...
* Non-ASCII characters in URLs and similar strings encountered in the HTML.c
processing (previously handled by LYUnEscapeToLatinOne) are now generally
URL-encoded, instead of doing this just for 8-bit characters which are
the result of entity expansion. There is no clear standard definition what
non-ASCII characters in URLs in HTML attributes (HREF etc.) actually mean,
especially if the transmission character encoding is something else than
iso-8859-1. Leaving them as the raw byte values as received runs against
the HTML i18n view that the transmission encoding is distinct from the
document character set and has to be (conceptually at least) decoded before
SGML parsing. It also won't work in general for entities that expand to
to Unicode characters which cannot be expressed at all in the currently
effective (or assumed) charset, and would lead to problems with displaying
URLs on the statusline or representing them in auxiliary screens or bookmark
files. So now we try to first transform to the document charset "as usual"
(undo the transmission encoding), then translate the Unicode value into a
sequence of (one or more) byte values which are then URL-encoded. Since
character values > 255 cannot be expressed in a byte, always use UTF-8
for them. It may not be what the author intended, but should be at least
consistent between internal (fragment) HREFs and NAME (or ID) attributes
in the same document or set of documents. Since this is dealing with
bytes currently disallowed in URLs, it falls under error recovery. But
the handling should be roughly in line with current Internet Drafts
(draft-masinter-url-i18n-00.txt, draft-duerst-query-i18n-00.txt,
draft-ietf-ftpext-intl-ftp-02.txt).
For character values < 256 (but > 127) this isn't currently consistently
done, we may still be URL-escaping the byte value without UTF-8 encoding.
- KW
***** Lynx/2.7.2
1997-10-06
...
* Made LYExpandString(), LYUnEscapeEntities() and LYUnEscapeToLatinOne()
simpler, added better comments, and modified LYUnEscapeToLatinOne() so
that it uses hex escaped UTF-8 multibytes for characters outside the
ASCII range (may need mods when standards for internationization of
URLs and MIME headers are finalized). These functions still expect
strings in the charset of the input stream, with only invalid control
characters removed, and still parallel the conversions done in SGML.c
and HTPlain.c, within the context of the HTML parser's (Utterly Tag and
Attribute Soup :) settings and the display character set options. They
do not URL encode any ASCII characters, except for ESC in CJK escape
sequences when the flag to do that is set, to avoid possible double
encoding. - FM
- lynx-dev lynx: have bug (fwd), dickey, 1999/03/21
- Re: lynx-dev lynx: have bug (fwd), Klaus Weide, 1999/03/21
- Re: lynx-dev lynx: have bug (fwd), Leonid Pauzner, 1999/03/21
- Re: lynx-dev lynx: have bug (fwd), Klaus Weide, 1999/03/21
- Re: lynx-dev lynx: have bug (fwd), Leonid Pauzner, 1999/03/21
- lynx-dev URLs with raw 8-bit chars (was: lynx: have bug), Klaus Weide, 1999/03/21
- Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug), Leonid Pauzner, 1999/03/22
- Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug), Leonid Pauzner, 1999/03/22
- Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug), Klaus Weide, 1999/03/22
- Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug),
Leonid Pauzner <=
- Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug), Klaus Weide, 1999/03/22
- Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug), Leonid Pauzner, 1999/03/22
- Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug), Klaus Weide, 1999/03/22
- Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug), Klaus Weide, 1999/03/25
- Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug), Leonid Pauzner, 1999/03/25
- Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug), Klaus Weide, 1999/03/25
- Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug), Leonid Pauzner, 1999/03/26
- Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug), Klaus Weide, 1999/03/27
Re: lynx-dev lynx: have bug (fwd), YUKSEL TURAN, 1999/03/21