lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev Lynx character entity references fix


From: Leonid Pauzner
Subject: Re: lynx-dev Lynx character entity references fix
Date: Fri, 12 Mar 1999 22:55:27 +0300 (MSK)

12-Mar-99 10:27 Klaus Weide wrote:
> On Fri, 12 Mar 1999, Leonid Pauzner wrote:
>>
>> >> OK, changing of "assume charset" for unlabelled document gives the 
>> >> folowing
>> >> (grep UC_MapGN from trace log):
>> >>
>> >> UC_MapGN: Using 1 <- 26 (windows-1251)
>> >> UC_MapGN: Using 1 <- 1 (iso-8859-15)
>> >> UC_MapGN: Using 2 <- 2 (cp850)
>> >> UC_MapGN: Using 1 <- 3 (windows-1252)
>> >> UC_MapGN: Using 2 <- 4 (cp437)
>> >> UC_MapGN: Using 1 <- 5 (dec-mcs)
>> >> UC_MapGN: Using 2 <- 6 (macintosh)
>> >> UC_MapGN: Using 1 <- 7 (next)
>> >> UC_MapGN: Using 2 <- 8 (hp-roman8)
>>
>> There there two more charsets not shown above: iso-8859-1 and us-ascii
>> (before iso-8859-15) - apparently constant slot #0.

> I'm mot sure why us-ascii doesn't show up in the TRACE - possibly
> because 8-bit characters get rejected already in SGML.c, so it never comes
> this far.  (speculation...)

No. This is a bug in SGML.c: us-ascii document assumed as iso-8859-1 document
so 8bit chars not filtered out but "translated from iso-8859-1" !

Try raw8bit.html under the test/ directory: set "assumed charset" to us-ascii
and "display charset" to something not latin1, say us-ascii,
and try to switch "\" several times - no problem with HTPlain.c
(I saw this bug a year ago when was preparing this test file
but was too lazy to fix SGML_character() - it should be merged
against Fote's 2.7.2 which looked more consistent when I saw it last time,
the same merge was done for HTPlain.c by me early.)

>> I'm thinking on undo some UCCanTranslate* changes to support UChndl >= 0 
>> back,
>> this handler is a couple of bytes and can be removed at the last stage.

> Agreed.  If at some point all "old style" tables are gone, then UChndl == -1
> cannot occur any more (It may still make sense to have some [other?] flag to
> say "we can translate *to* this, but not *from* this).

>> > In general your changes seem to aim at simplifying things (with the
>> > final goal to get completely rid of "old" stuff?) and and at making
>> > things clearer.  I think using UChndl = -1 to mean something else than
>> > it used to doesn't make things clearer though.
>>
>> > I leave it to you to find the best way (and reserve right to complain...)
>>
>> The real simplification may be #ifdef'ing some heavy code
>> that deal with "old" style usage (in SGML.c, HTPlain.c, LYCharUtils.c (Uh!),
>> and at the last stage - from HTMLDTD.c, LYCharSets.c, ...)
>> It is a "bloating binary" item and also a problem of maintaining
>> such ungomogenouse piece of code in general.

> Yes, there is quite some duplication there.

> I think LYCharUtils.c is not so bad, although you find it "somehow ugly". :)
I find its LYUCFullyTranslateString_1() state machine
completely unmaintainable: CJK, non_ascii_text, entities, URL-escaping,
hidden space, lots of flags with unknown specifications - all in a single mess.
I wonder why we have no single function for character translation
instead of three currently.

> There may be less baggage there than in SGML.c, HTPlain.c.

> [...]
>> No problem - it may be left #ifdef'ed in the code
>> (but since it will not be used it will not be actively tested/maintained
>> to a greater chance became broken in future by occasional lynx changes, yes).

> Same as with other "dead code" removal by #ifdef'ing.

>> p.s. The real problem I see is a limited capacity of space for lynx special
>> characters like HT_NON_BREAK_SPACE, HT_EM_SPACE, etc. (see GridText.h),
>> which mapped to < 32 area: we cannot add more, say HT_EN_SPACE
>> (and we probably have Vietnamese implementation already broken,
>> though nobody interested seems). Indirect usage of "old" entities translation
>> may effectively solve the problem, but I am not sure.

Less costly solution may be to introduce an "internal multibyte scheme"
for special chars, say 0x02 is an escape character for next special byte.

> The best (most general) solution to that would be to feed Unicode values
> (instead of chars) to GridText, then there is nearly unlimited space
> for private regions.  I.e. translate Unicode -> display character set
> as late as possible, but that would mean that also HTML.c has to deal
> with text as Unicode values instead of chars.  Eventually that would be
> cleaner, but not trivial to change (especially not breaking CJK and
> "Transparent").
they can be shifted to "private use area" by a mask.

> One advantage of doing  Unicode -> display  is that a larger glyph
> repertoire could be used if the terminal supports some way for it -
> at least dec graphics characters that are in addition to the normally
> printable output chars, (by switching to curses alternate-character-set)
> or possibly VGA fonts of 512 characters (some are available for linux
> console).


>    Klaus




reply via email to

[Prev in Thread] Current Thread [Next in Thread]