lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --)


From: Klaus Weide
Subject: Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --)
Date: Sat, 20 Mar 1999 11:07:00 -0600 (CST)

On Sat, 20 Mar 1999, Leonid Pauzner wrote:
> 19-Mar-99 13:46 Klaus Weide wrote:
> 
> > Do you mean (a) for the special case Display character set = UTF-8
> > or (b) in general?
> 
> I mean general case, especially this:
> 
>         "-",    /* dash the width of emsp - emdash */
>         "\002", /* emsp, em space - not collapsed NEVER CHANGE THIS - emsp */
>         "-",    /* dash the width of ensp - endash */
>         "\002", /* ensp, en space - not collapsed NEVER CHANGE THIS - ensp */
>         "\360", /* small eth, Icelandic (ð) - eth */
>         "\353", /* small e, dieresis or umlaut mark (ë) - euml */
> 
> do we already have emsp/ensp as not collapsed spaces
> so " \002 " string will result with three spaces, not within <pre> ?
> If so, emsp may be of any number of spaces we want.
> 
> (In fact, modified file test/spaces.html shows up that
> if we remove <pre> we got 1 spaces for " &ensp; " in attributes
> and 3 spaces in normal text, so I assume attributes are broken).

I reproduces this (with code without your changes for HT_E{M,N}_SPACE).  I
also tried it with a 2.7.1ac-0.91 I keep for reference; there it was
"broken" (it that's what it is) for named entities, independent of whether
ALT or not, but not-"broken" for numeric ones in ALT.

I say '"broken"' instead of 'broken' bacause to my knowledge it isn't
specified anywhere whether theses special spaces should be collapsed
or not.


In a way the whole complicated model of treating attributes differently
from the rest of text w.r.t. character set translation is broken.  I'm
sure there are other little inconsistencies if you look close enough.
Currently we have: [*A*]

 text/html stream       text translated
   -----------> SGML.c  -----------------------> HTML.c (as is) ---------->
                 |
                 |  attr possibly untranslated
                 \ ----------------------------> HTML.c (translates)[3] -->

A cleaner approach would be [*B*]

 text/html     unicode[1]       text as unicode[1]
   -----> ???.c -----> SGML.c  ------------------> HTML.c (translates)[2] --->
           ^              |
   translate to unicode   | attr as unicode[1]
                          \ ---------------------> HTML.c (translates)[2,3] ->

[1] Whether 'unicode' is in UTF-8 representation or in some wide character
    form is an implementation detail.
[2] unicode to display character set (for displayable strings) could be
    further deferred to GridText.c (HText_* functions).
[3] Different kinds of attributes have their values translated in different
    ways: ALT is different from HREF.

In my view, the reasons why we have [*A*] instead of [*B*] are
(a) It's the result of incrementally changing stuff fromm the way they
    were in the days when lynx could only handle iso-8859-1 and
    'Raw' (incl. CJK).
(b) It allows fallbacks, CJK handling, and a transparent mode without having
    to invent a consistent private unicode representation for untouchable
    characters.  ('fallbacks' may include possibility to somewhat deal
    with underspecified charsets (that don't have .tbl))
(c) It allows deferring translation of attribute values containing URLs and
    suchlike until it is know what kind of attribute we have ([3] above).
(d) anything else I forgot?


I would nearly say we can live with small inconsistencies unless / until
the whole mechanism is revised (or deal with problems on a case by case
basis if they really show up in real-world documents).

Or we could regard the fact that currently attribute values are sometimes
translated differently from the main text as an opportunity instead of
a problem - in case it sometimes makes sense to have things done
differently (e.g.) for ALT text.  (Is that the case anywhere?)


   Klaus

reply via email to

[Prev in Thread] Current Thread [Next in Thread]