Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --)

lynx-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --)

From:	Leonid Pauzner
Subject:	Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --)
Date:	Sun, 21 Mar 1999 23:33:17 +0300 (MSK)

20-Mar-99 11:07 Klaus Weide wrote:
> In a way the whole complicated model of treating attributes differently
> from the rest of text w.r.t. character set translation is broken.  I'm
> sure there are other little inconsistencies if you look close enough.
> Currently we have: [*A*]

   text/plain,
>  text/html stream       text translated
>    -----------> SGML.c  -----------------------> HTML.c (as is) ---------->
>                  |
>                  |  attr possibly untranslated
>                  \ ----------------------------> HTML.c (translates)[3] -->

> A cleaner approach would be [*B*]

>  text/html     unicode[1]       text as unicode[1]
>    -----> ???.c -----> SGML.c  ------------------> HTML.c (translates)[2] --->
>            ^              |
>    translate to unicode   | attr as unicode[1]
>                           \ ---------------------> HTML.c (translates)[2,3] ->

> [1] Whether 'unicode' is in UTF-8 representation or in some wide character
>     form is an implementation detail.
> [2] unicode to display character set (for displayable strings) could be
>     further deferred to GridText.c (HText_* functions).
> [3] Different kinds of attributes have their values translated in different
>     ways: ALT is different from HREF.

> In my view, the reasons why we have [*A*] instead of [*B*] are
> (a) It's the result of incrementally changing stuff fromm the way they
>     were in the days when lynx could only handle iso-8859-1 and
>     'Raw' (incl. CJK).
> (b) It allows fallbacks, CJK handling, and a transparent mode without having
>     to invent a consistent private unicode representation for untouchable
>     characters.  ('fallbacks' may include possibility to somewhat deal
>     with underspecified charsets (that don't have .tbl))
> (c) It allows deferring translation of attribute values containing URLs and
>     suchlike until it is know what kind of attribute we have ([3] above).
> (d) anything else I forgot?

We could get a real problem with CJK handling in [*B*]
when find unicode numeric/character references in CJK text:
a mixture of two multibyte encodings in SGML.c/HTML.c
is not the easiest thing we can imagine (and debug).
There is no such problem when we translate text in one stage [*A*].

Yes, attributes may be translated differently:
a visible text (value=, alt=, etc.?) should be translated
the same way the rest of the text does, but others will not accept
any "approximation" (fortunately, only HREF= and NAME= are essential for lynx,
others should be of 7bit (guess) so no problem but in theory we can imagine
someone using 8bit letters for JavaScript functions...)

> I would nearly say we can live with small inconsistencies unless / until
> the whole mechanism is revised (or deal with problems on a case by case
> basis if they really show up in real-world documents).
yes.

> Or we could regard the fact that currently attribute values are sometimes
> translated differently from the main text as an opportunity instead of
> a problem - in case it sometimes makes sense to have things done
> differently (e.g.) for ALT text.  (Is that the case anywhere?)
probably no sence (see above).

>    Klaus

> On Sat, 20 Mar 1999, Leonid Pauzner wrote:
>> 19-Mar-99 13:46 Klaus Weide wrote:
>>
>> > Do you mean (a) for the special case Display character set = UTF-8
>> > or (b) in general?
>>
>> I mean general case, especially this:
>>
>>         "-",    /* dash the width of emsp - emdash */
>>         "\002", /* emsp, em space - not collapsed NEVER CHANGE THIS - emsp */
>>         "-",    /* dash the width of ensp - endash */
>>         "\002", /* ensp, en space - not collapsed NEVER CHANGE THIS - ensp */
>>         "\360", /* small eth, Icelandic (&#240;) - eth */
>>         "\353", /* small e, dieresis or umlaut mark (&#235;) - euml */
>>
>> do we already have emsp/ensp as not collapsed spaces
>> so " \002 " string will result with three spaces, not within <pre> ?
>> If so, emsp may be of any number of spaces we want.
>>
>> (In fact, modified file test/spaces.html shows up that
>> if we remove <pre> we got 1 spaces for " &ensp; " in attributes
>> and 3 spaces in normal text, so I assume attributes are broken).

> I reproduces this (with code without your changes for HT_E{M,N}_SPACE).  I
> also tried it with a 2.7.1ac-0.91 I keep for reference; there it was
> "broken" (it that's what it is) for named entities, independent of whether
> ALT or not, but not-"broken" for numeric ones in ALT.

> I say '"broken"' instead of 'broken' bacause to my knowledge it isn't
> specified anywhere whether theses special spaces should be collapsed
> or not.
Yes, but it should be consistent at least:

   UNICODE NCR alt-NCR named alt-named

lynx/2.6+
   0x2002 [&#x2002;] [&#x2002;] [ ] [ ] # EN SPACE
   0x2003 [ &#x2003; ] [ &#x2003; ] [   ] [   ] # EM SPACE
lynx/2.7.1ac-0.98
   0x2002 [ ] [ ] [ ] [ ] # EN SPACE
   0x2003 [ ] [ ] [   ] [ ] # EM SPACE
lynx/2.8.1(and newer)
   0x2002 [ ] [ ] [ ] [ ] # EN SPACE
   0x2003 [   ] [ ] [   ] [ ] # EM SPACE

[Prev in Thread]

Current Thread

[Next in Thread]

Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --), (continued)

Prev by Date: Re: lynx-dev Bombout due to my ISP being a Sun network
Next by Date: Re: lynx-dev lynx: have bug (fwd)
Previous by thread: Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --)
Next by thread: Re: lynx-dev lynx2.8.2dev.19 patch #6 (em dash = --)
Index(es):
- Date
- Thread