lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LYNX-DEV minor display problem (?character 0xA2?)


From: Klaus Weide
Subject: Re: LYNX-DEV minor display problem (?character 0xA2?)
Date: Fri, 2 May 1997 15:55:43 -0500 (CDT)

On Fri, 2 May 1997, Foteos Macrides wrote:
> Hynek Med <address@hidden> wrote:
> >On Fri, 2 May 1997, Klaus Weide wrote:
> >> On Fri, 2 May 1997, Bela Lubkin wrote:
> >> > An hour later: it's because character 0xA2 is eventually being
> >> > translated to 0x9b on output.  The SCO ANSI console takes 0x9b as CSI,
> >>                                      ^^^^^^^^^^^^^^^^
> >> The IBM PC character set (cp437) contains a visible character at that
> >> position.  [...]
> >
> >Or perhaps we need another option [...]
> >:-) 
> 
>       The bank of 8-bit control characters always is illegal for
> text/html.  ALWAYS, ALWAYS, ALWAYS.  

They (code points in the range 128-159) are illegal for
"text/html;charset=iso-8859-N" where N=1..10.
They are also illegal in the "document character set" in the SGML sense,
for all known HTML versions, and that is why numeric character references
like "&#153;" are illegal.
But they may or may not be legal in HTML documents AS TRANSMITTED OR
STORED if they use a different "charset", because that is supposed to
be an ENCODING of the real thing.

> [...] Lynx still does a single
> pass through the stream, and thus uses a state machine to juggle the
> "de-encoding" and "de-encoded parsing" at the same time.  

Still true for the chartrans code in the devel Lynx.  That's why attribute
values, for example ALT= text, don't get treated the same way as normal
PCTEXT.

According to the HTML Internationalization RFC 2070, the reference
processing model is
 
   [resource]->[decoder]->[entity ]->[ SGML ]->[application]->[display]
                          [manager]  [parser]
 
but "An actual implementation may choose, or not, to translate the
   document into some encoding of the document character set as
   described above; the behaviour described by this reference processing
   model can be achieved otherwise."

Lynx doesn't have a separate "decoder" or "entity manager", so those
functions are either also handled in the "parser" in SGML.c or deferred
to later processing.

> But for CJK
> charsets, those aren't 8-bit control characters.  They're half of a
> multibyte pair.

The same for UTF-8 encoding of Unicode, where bytes in that rage are
more or less guaranteed to appear.

  Klaus

;
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.
;

reply via email to

[Prev in Thread] Current Thread [Next in Thread]