[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --)
From: |
Leonid Pauzner |
Subject: |
Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --) |
Date: |
Sun, 21 Mar 1999 23:33:17 +0300 (MSK) |
20-Mar-99 11:07 Klaus Weide wrote:
> In a way the whole complicated model of treating attributes differently
> from the rest of text w.r.t. character set translation is broken. I'm
> sure there are other little inconsistencies if you look close enough.
> Currently we have: [*A*]
text/plain,
> text/html stream text translated
> -----------> SGML.c -----------------------> HTML.c (as is) ---------->
> |
> | attr possibly untranslated
> \ ----------------------------> HTML.c (translates)[3] -->
> A cleaner approach would be [*B*]
> text/html unicode[1] text as unicode[1]
> -----> ???.c -----> SGML.c ------------------> HTML.c (translates)[2] --->
> ^ |
> translate to unicode | attr as unicode[1]
> \ ---------------------> HTML.c (translates)[2,3] ->
> [1] Whether 'unicode' is in UTF-8 representation or in some wide character
> form is an implementation detail.
> [2] unicode to display character set (for displayable strings) could be
> further deferred to GridText.c (HText_* functions).
> [3] Different kinds of attributes have their values translated in different
> ways: ALT is different from HREF.
> In my view, the reasons why we have [*A*] instead of [*B*] are
> (a) It's the result of incrementally changing stuff fromm the way they
> were in the days when lynx could only handle iso-8859-1 and
> 'Raw' (incl. CJK).
> (b) It allows fallbacks, CJK handling, and a transparent mode without having
> to invent a consistent private unicode representation for untouchable
> characters. ('fallbacks' may include possibility to somewhat deal
> with underspecified charsets (that don't have .tbl))
> (c) It allows deferring translation of attribute values containing URLs and
> suchlike until it is know what kind of attribute we have ([3] above).
> (d) anything else I forgot?
We could get a real problem with CJK handling in [*B*]
when find unicode numeric/character references in CJK text:
a mixture of two multibyte encodings in SGML.c/HTML.c
is not the easiest thing we can imagine (and debug).
There is no such problem when we translate text in one stage [*A*].
Yes, attributes may be translated differently:
a visible text (value=, alt=, etc.?) should be translated
the same way the rest of the text does, but others will not accept
any "approximation" (fortunately, only HREF= and NAME= are essential for lynx,
others should be of 7bit (guess) so no problem but in theory we can imagine
someone using 8bit letters for JavaScript functions...)
> I would nearly say we can live with small inconsistencies unless / until
> the whole mechanism is revised (or deal with problems on a case by case
> basis if they really show up in real-world documents).
yes.
> Or we could regard the fact that currently attribute values are sometimes
> translated differently from the main text as an opportunity instead of
> a problem - in case it sometimes makes sense to have things done
> differently (e.g.) for ALT text. (Is that the case anywhere?)
probably no sence (see above).
> Klaus
> On Sat, 20 Mar 1999, Leonid Pauzner wrote:
>> 19-Mar-99 13:46 Klaus Weide wrote:
>>
>> > Do you mean (a) for the special case Display character set = UTF-8
>> > or (b) in general?
>>
>> I mean general case, especially this:
>>
>> "-", /* dash the width of emsp - emdash */
>> "\002", /* emsp, em space - not collapsed NEVER CHANGE THIS - emsp */
>> "-", /* dash the width of ensp - endash */
>> "\002", /* ensp, en space - not collapsed NEVER CHANGE THIS - ensp */
>> "\360", /* small eth, Icelandic (ð) - eth */
>> "\353", /* small e, dieresis or umlaut mark (ë) - euml */
>>
>> do we already have emsp/ensp as not collapsed spaces
>> so " \002 " string will result with three spaces, not within <pre> ?
>> If so, emsp may be of any number of spaces we want.
>>
>> (In fact, modified file test/spaces.html shows up that
>> if we remove <pre> we got 1 spaces for "   " in attributes
>> and 3 spaces in normal text, so I assume attributes are broken).
> I reproduces this (with code without your changes for HT_E{M,N}_SPACE). I
> also tried it with a 2.7.1ac-0.91 I keep for reference; there it was
> "broken" (it that's what it is) for named entities, independent of whether
> ALT or not, but not-"broken" for numeric ones in ALT.
> I say '"broken"' instead of 'broken' bacause to my knowledge it isn't
> specified anywhere whether theses special spaces should be collapsed
> or not.
Yes, but it should be consistent at least:
UNICODE NCR alt-NCR named alt-named
lynx/2.6+
0x2002 [ ] [ ] [ ] [ ] # EN SPACE
0x2003 [   ] [   ] [ ] [ ] # EM SPACE
lynx/2.7.1ac-0.98
0x2002 [ ] [ ] [ ] [ ] # EN SPACE
0x2003 [ ] [ ] [ ] [ ] # EM SPACE
lynx/2.8.1(and newer)
0x2002 [ ] [ ] [ ] [ ] # EN SPACE
0x2003 [ ] [ ] [ ] [ ] # EM SPACE
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), (continued)
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), Leonid Pauzner, 1999/03/16
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), Klaus Weide, 1999/03/16
- lynx-dev chrtrans .tbl format (was: ... (em dash = --), Leonid Pauzner, 1999/03/18
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), Leonid Pauzner, 1999/03/18
- lynx-dev New Version Notification service, Mark E. Crane, 1999/03/18
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), Klaus Weide, 1999/03/18
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), Leonid Pauzner, 1999/03/19
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), Klaus Weide, 1999/03/19
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), Leonid Pauzner, 1999/03/19
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), Klaus Weide, 1999/03/20
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --),
Leonid Pauzner <=
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), Philip Webb, 1999/03/15
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), David Combs, 1999/03/15
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), Philip Webb, 1999/03/15
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), Klaus Weide, 1999/03/16
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), David Combs, 1999/03/16
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), David Combs, 1999/03/15
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), Philip Webb, 1999/03/15
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), Leonid Pauzner, 1999/03/15
- Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), Leonid Pauzner, 1999/03/15
- Re: lynx-dev lynx2.8.2dev.19 patch #8 (minore chrtrans undo), Leonid Pauzner, 1999/03/15