lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LYNX-DEV cp1252 (shudder)


From: Alan J. Flavell
Subject: Re: LYNX-DEV cp1252 (shudder)
Date: Wed, 19 Nov 1997 08:39:11 +0000 (GMT)

On Tue, 18 Nov 1997, David Woolley wrote:

> > A question was raised as to whether these horrible MS-created quasi HTML
> > documents that use cp1252 (whether overtly or not), that one finds so
> 
> The overt ones are technically legal, but may require an HTML 4.0 doctype,

As 8-bit characters, their meaning is defined by HTTP protocols; the
HTML DTD comes too late to address such matters.  As Numerical Character
References, &#number; , they are technically invalid, period.  But we
know what MS are trying to achieve, don't we?

> > often on the web, could get their curly quotes presented as ordinary
> > quotes.  It naturally occurred to me to try this in a development version
> > of Lynx, to see if this is a supported combination.
> 
> Most of these curly quotes from FrontPage are undefined entities.
> The HTML decode rules are that the HTTP character set is resolved
> to canonical form (originally 8859, but Unicode for HTML 4) before
> the entities are processed.  

I'm sure this is all common knowledge to those doing the character tables. 
My point was this: Lynx already uses "approximations" for representing
characters that it understands but are not in the repertoire of the
selected output (terminal Charset) encoding.  Do not confuse the
documents' charset with the terminal Charset, they are very different
things.  And NCRefs are different again, in theory (and maybe it is
correct for Lynx to point those up as illegal; I was merely making a
practical suggestion.  People rarely accuse me of lack of pedantry!). 

> Numeric entities are then interpreted as
> code points in the canonical character set.  145 and 146, etc. are not
> defined code points in Unicode.  

Of course.  Nevertheless, MS software creates these illegal and
meaningless representations, and it would be obtuse to claim that we don't
know what they intend by it.  I wasn't asking for an explanation of what
they do or don't mean in HTML, but making a suggestion for a practical
way of dealing with them, either as 8-bit characters - which would be
perfectly legal if charset=cp1252, in fact; or as NCRefs - which is, we
agree, invalid HTML, but we still may discuss how to deal with it, may
we? 

best regards
;
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.
;

reply via email to

[Prev in Thread] Current Thread [Next in Thread]