lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug)


From: Klaus Weide
Subject: Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug)
Date: Sat, 27 Mar 1999 02:53:24 -0600 (CST)

On Sat, 27 Mar 1999, Leonid Pauzner wrote:
> 25-Mar-99 09:17 Klaus Weide wrote:
> > On Thu, 25 Mar 1999, Leonid Pauzner wrote:
> >> Something like this (will try later, probably we should restrict
> >> expanding of &entities; or so).  Think about renaming
> >> TRANSLATE_AND_UNESCAPE_TO_STD() to TRANSLATE_HREF_ATTRIBUTE()
> 
> > But entities have to be expanded - all those that mean allowed 7-bit
> > characters are perfectly valid.  And those that mean other characters
> > should also be expanded, for consistency.
> For consistency with what? 

Consistency between having those characters in entity form or NCR form
on one hand and having them in "raw" form on the other hand.  Where
"raw" means not REALLY raw (there is no such thing - what is
transmitted and received is octets, not characters), but encoded as
specified by a charset (either explicit or assumed).  And since
"utf-8" is one of the supported character encodings (charsets), we
already have one charset that can contain all the characters that can
be expressed by entities anyway.

> "Other" entities will be expanded to utf-8
> and we get a mixture of 8bit characters and utf-8 multibytes
> in a single word - completely unrecoverable mess.

There are two kinds of "other" entities: those that can be translated
to the "target" charset and those that cannot.  With target charset I
mean the representation that subsequently gets hex-encoded - presumably
either utf-8 or the charset of the document where the link is found.

If target charset = utf-8, there is no ambiguity.  (This also means that
if document charset = utf-8, then there never is ambiguity.)  If the "other"
entity can be translated to the (non-utf-8) target charset, then it should
be.  We have a problem only if the "other" entity cannot be translated to
the target charset.  This case is going to be extremely rare and may never
occur in practice, we might just as well generate utf-8 or leave the entity
as is or whatever in this case - especially since all this stuff is Doing
Questionable Transformations On Invalid URLs For Compatibility With Broken
Sites anyway.

But we should treat non-ASCII characters in entities in HREFs (etc.) the
same way as "raw" non-ASCII characters whenever we can.  Think of a simple
useful tool (text filter) that does the following:
 - Given an input text and its charset, converts all non-ASCII characters
   to NCRs [at least in all places where a HTML spec says that they are
   equivalent].
Lynx should treat the text resulting from this transformation the same
way it would handle the untransformed text.  Even w.r.t. Doing Questionable
Transformations On Invalid URLs, IMO.

    Klaus

reply via email to

[Prev in Thread] Current Thread [Next in Thread]