lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev HTML4.0 and default charset


From: Klaus Weide
Subject: Re: lynx-dev HTML4.0 and default charset
Date: Mon, 1 Mar 1999 01:54:15 -0600 (CST)

Please excuse the length of this diatribe...

On Sat, 27 Feb 1999, Alan J. Flavell wrote:

> In HTML4.0 section 5.2.2 there's a rather odd sentence; I'd better
> quote the complete paragraph so as not to take it out-of-context:

I find all three sentences rather "odd" at first look, to put it mildly.

>  The HTTP protocol ([RFC2068], section 3.7.1) mentions ISO-8859-1
>  as a default character encoding when the "charset" parameter is
>  absent from the "Content-Type" header field. In practice, this
>  recommendation has proved useless because some servers don't allow a
>  "charset" parameter to be sent, and others may not be configured to
>  send the parameter. Therefore, user agents must not assume any
>  default value for the "charset" parameter.

You honour that text by calling it (part of) a "spec" below - but it is
so diffuse that it hardly deserves that honour.  Nor does it extend that
same honour to the HTTP 1.1 RFC.

It begins with stating that RFC2068 "mentions" something, and then
later calling that a (implied: mere) "recommendation".  In fact, 
section 3.7.1 of RFC2068 has something more than a recommendation.
It *specifies*:
                                               When no explicit charset
   parameter is provided by the sender, media subtypes of the "text"
   type are defined to have a default charset value of "ISO-8859-1" when
   received via HTTP.

I would like to be able to say that it *clearly* specifies; unfortunately
it uses the somewhat diffuse term "_default_ charset value".  Any
specification of a "default", however definitive it may be, isn't really
enough for a clear determination of what's finally in effect, unless
it's also specified under which conditions this default applies (and
what may override it).

At any rate, the requirement for information *providers* is clear in the
next sentence:
                      Data in character sets other than "ISO-8859-1" or
   its subsets MUST be labeled with an appropriate charset value.

The final two paragraphs go into more detail and incorporate some real
world experience.  We may hope to find some clarification here about
what "default charset value" means.  First there is

   Some HTTP/1.0 software has interpreted a Content-Type header without
   charset parameter incorrectly to mean "recipient should guess."
   Senders wishing to defeat this behavior MAY include a charset
   parameter even when the charset is ISO-8859-1 and SHOULD do so when
   it is known that it will not confuse the recipient.

This contains the only part of the section that could be called a mere
"recommendation".  But it only recommends *server* (information provider)
behaviour, it doesn't talk about client behaviour except by mentioning
"some" software's behaviour as motivation for the recommendation.

There is more about older clients' "unfortunate" behaviour in the final
paragraph, then requirements (for the first time, really) for HTTP/1.1
clients:

   Unfortunately, some older HTTP/1.0 clients did not deal properly with
   an explicit charset parameter. HTTP/1.1 recipients MUST respect the
   charset label provided by the sender; and those user agents that have
   a provision to "guess" a charset MUST use the charset from the
   content-type field if they support that charset, rather than the
   recipient's preference, when initially displaying a document.

The client requirement is clear for the case where there is an explicit
charset value in the Content-Type header.  (One could quibble about
the exact meaning of "initial[ly] displaying" though.)  There is no
clear prescription for the case of a missing charset value, But the last
sentence implies that user agents (at least some class of them) are
allowed to override the default value "ISO-8859-1" defined above.

So it remains unclear just what the default value of "ISO-8859-1" means,
and under which circumstances it applies.  One could speculate that, by
omission, it is definitive for all user agents without a 'provision to
"guess"' - but it remains ultimately unspecified.  Oddly the sentence
only talks about "guessing" and not other mechanisms like configuration
by the user, but maybe one is to regard that as a form of "guessing",
too.  [ Since RFC2068 is the reference explicitly mentioned, I am not
comparing against the latest HTTP/1.1 draft; it may have done something
to clarify - or obfuscate - matters more. ]

The most commonsensical interpretation of this ambiguity seems to me to
be: a client that wishes to allow something else than "ISO-8859-1" as
assumed charset is free to do so, employing "guessing" or some other
mechanism that could be construed as guessing in the widest sense;
a client that can't be bothered about implementing more appropriate
"assume charset" mechanisms should assume "ISO-8859-1".  Effectively,
as far as prescribed behaviour goes, "do what you want".

Maybe my characterization above of HTML4.0's use of RFC2068 was a bit
too harsh then.  3.7.1 *can* be interpreted as a recommendation to
assume a "ISO-8859-1" charset value - it just seems to me it's either
less or more that just that.  But, looking at the next sentence from
HTML4.0 (to quote again):
>                                               In practice, this
>  recommendation has proved useless because some servers don't allow a
>  "charset" parameter to be sent, and others may not be configured to
>  send the parameter.

I completely fail to follow.  Neither of the two reasons given are
adequate to prove a potential "ISO-8859-1" charset assumption useless.
There's a lot of things one or another specific server or browser
cannot do - that has never been a reason to declare some protocol
or protocol feature or HTML "feature" useless, AFAIK.  Those who
are able to use it can, those who aren't able have to come up with
something else - hasn't it always been like that?  Things might be
different if an overwhelming majority of servers were unable to send
charset parameters, but that's definitely not the case.

Well, there are the qualifying words "In practice".  So the alleged
uselessness apparently doesn't follow directly from the two reasons
given, but from practical experience which in turn is explained in
terms of those two reasons.  Taken this way, I would concede that the
"ISO-8859-1" charset assumption indeed does not work (i.e. gives
wrong results) for some people / in some situations / with some web
sites.  The predominant reasons for that are more likely laziness,
ignorance and disregard (whether innocent or wilful) of previous
specs, "but it works here" shortsightedness, and other "social"
reasons, rather than a genuine lack of server support.  All of
these reasons lead to violations of the *server* requirement of the
HTTP specs - which then, in turn, make the "ISO-8859-1" assumption
wrong.

But in a large number of situations, whether that currently is still
for majority of web sites or not, the "ISO-8859-1" assumption still
*is* the right one.  That alone contradicts that claim that it is
"useless".

On to the conclusion HTML4.0 draws:
>                      Therefore, user agents must not assume any
>  default value for the "charset" parameter.

> Now, that last sentence, on the face of it, forbids a browser to
> assume a default value for the "charset" parameter.  In order to
> solve the problem it has just described, however, the browser clearly
> must have a user-configurable "charset default", to be used when the
> incoming document lacks one.
> 
> Paradox?

On the face of it, that sentence makes no sense - at least for some
understanding of the words "assume" and "default".

Of course a user agent *has to* assume something for the charset if
that isn't specified (unless we want to return to the dark ages of
"just put it on the screen, maybe it makes sense to the user" - and
even that implies the assumption that charset matches display font).
"Applying a default" is just another way of saying "assume a charset",
in this situation, unless my understanding of basic terms is completely
wrong.  Therefore forbidding a default value would be nonsensical.

But there are defaults, there are defaults for defaults, and maybe
defaults for defaults for defaults - and so on for a while.
I think all the unclarity comes from not specifying the "level of
defaulting" to which the sentence applies.

The first level of default is the charset value assumed by the
browser in an actual situation, say when parsing the headers of a
specific HTTP response without charset, and actually used in
rendering the document.  This first level default value may be
specified or not - if not specified, the value defaults to the
second level default.  For a simplified example, -assume_charset 
allows the lynx user to specify the 1st level default; if
-assume_charset isn't given, the value defaults to ASSUME_CHARSET
in lynx.cfg.  If that is not given, it defaults to assuming
"ISO-8859-1".  (Actually it's still more complicated - toggling
"raw mode" may be viewed as introducing yet another level.)

Now, is "assuming a default" the same as "having a default", or is it
rather the same as "having a default for a default" or "defaulting
the assumption"?

Which of these levels of defaulting is HTML4.0 trying to forbid?
Does the requirement apply to the "vendor" (There shall be no factory
defaults), to the programmer (Don't allow users to save a default), or
to the installer (Don't supply users with a default)?

> Of course, Lynx has such a configurable default (as do other WWW
> browsers, for sure); but it seems to me that every one of them has
> some initial selection of this default, and therefore must be rated
> non-compliant with HTML4.0 on the grounds that it 'assumes a default
> value for the "charset" parameter', the very thing which the spec
> forbids.

You seem to have interpreted the HTML4.0 sentence in a specific way.
As I understand it, your interpretation amounts to answering my
questions of the previous paragraph with "It is trying to forbid the
lynx.cfg level of defaulting", and "The requirement applies to the
installer" (since that is who sets the configurable default) (or it
applies to the programmer, for allowing the installer to do so).

> Perhaps the spec was worded infelicitously - maybe it meant to forbid
> a client to have a fixed default setting, but must make it user
> configurable.  I don't know, but, for now, it says what it says.

I don't know what it is really supposed to mean.  There are too many
possible interpretations.  We can only guess, or try to eliminate
most interpretations by some continuing process of reductio ad absurdum.

Whatever it *is* trying to say, it doesn't say it clearly.

> Aside from the fact that I'm probably being pedantic (nothing new in
> that),

Have I beat you? :)

> is there some logical way of making a browser that both
> conforms to the last sentence _and_ allows the reader to solve the
> indicated problem?

Assuming we would know at what level of defaulting the sentence is
to apply, and assuming it is not the first level - what *could* a
browser do to conform, if it needs a default value but there is none?

 - Ask the user what to assume for each individual document.
 - Ask the user to supply a default (for a level of defaulting that
   *is* allowed - which could mean, for the session only).
 - Ask the user to ask the system administrator to supply a default
   (if that's what is allowed but a hardwired factory default isn't).
 - Refuse to handle the document; for example, crash.
 - Refuse to handle the document *as text*, or in any way that requires
   assuming a charset; for example, go into binary download mode.
 - Prevent the situation from occurring in the first place.  For example
   refuse to install successfully or to start up unless a default at
   an allowed level has been supplied.

These range from the ridiculous to the merely heavy-handed, but I can't
come up with anything better.  For now my guess is that nobody has made
a general-purpose browser that conforms to that sentence in *any*
reasonable interpretation.  It may be possible logically, but who cares
if the result would be unacceptable to most users or installers.

    Klaus

reply via email to

[Prev in Thread] Current Thread [Next in Thread]