[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: lynx-dev Lynx character entity references fix
From: |
Leonid Pauzner |
Subject: |
Re: lynx-dev Lynx character entity references fix |
Date: |
Fri, 12 Mar 1999 22:55:27 +0300 (MSK) |
12-Mar-99 10:27 Klaus Weide wrote:
> On Fri, 12 Mar 1999, Leonid Pauzner wrote:
>>
>> >> OK, changing of "assume charset" for unlabelled document gives the
>> >> folowing
>> >> (grep UC_MapGN from trace log):
>> >>
>> >> UC_MapGN: Using 1 <- 26 (windows-1251)
>> >> UC_MapGN: Using 1 <- 1 (iso-8859-15)
>> >> UC_MapGN: Using 2 <- 2 (cp850)
>> >> UC_MapGN: Using 1 <- 3 (windows-1252)
>> >> UC_MapGN: Using 2 <- 4 (cp437)
>> >> UC_MapGN: Using 1 <- 5 (dec-mcs)
>> >> UC_MapGN: Using 2 <- 6 (macintosh)
>> >> UC_MapGN: Using 1 <- 7 (next)
>> >> UC_MapGN: Using 2 <- 8 (hp-roman8)
>>
>> There there two more charsets not shown above: iso-8859-1 and us-ascii
>> (before iso-8859-15) - apparently constant slot #0.
> I'm mot sure why us-ascii doesn't show up in the TRACE - possibly
> because 8-bit characters get rejected already in SGML.c, so it never comes
> this far. (speculation...)
No. This is a bug in SGML.c: us-ascii document assumed as iso-8859-1 document
so 8bit chars not filtered out but "translated from iso-8859-1" !
Try raw8bit.html under the test/ directory: set "assumed charset" to us-ascii
and "display charset" to something not latin1, say us-ascii,
and try to switch "\" several times - no problem with HTPlain.c
(I saw this bug a year ago when was preparing this test file
but was too lazy to fix SGML_character() - it should be merged
against Fote's 2.7.2 which looked more consistent when I saw it last time,
the same merge was done for HTPlain.c by me early.)
>> I'm thinking on undo some UCCanTranslate* changes to support UChndl >= 0
>> back,
>> this handler is a couple of bytes and can be removed at the last stage.
> Agreed. If at some point all "old style" tables are gone, then UChndl == -1
> cannot occur any more (It may still make sense to have some [other?] flag to
> say "we can translate *to* this, but not *from* this).
>> > In general your changes seem to aim at simplifying things (with the
>> > final goal to get completely rid of "old" stuff?) and and at making
>> > things clearer. I think using UChndl = -1 to mean something else than
>> > it used to doesn't make things clearer though.
>>
>> > I leave it to you to find the best way (and reserve right to complain...)
>>
>> The real simplification may be #ifdef'ing some heavy code
>> that deal with "old" style usage (in SGML.c, HTPlain.c, LYCharUtils.c (Uh!),
>> and at the last stage - from HTMLDTD.c, LYCharSets.c, ...)
>> It is a "bloating binary" item and also a problem of maintaining
>> such ungomogenouse piece of code in general.
> Yes, there is quite some duplication there.
> I think LYCharUtils.c is not so bad, although you find it "somehow ugly". :)
I find its LYUCFullyTranslateString_1() state machine
completely unmaintainable: CJK, non_ascii_text, entities, URL-escaping,
hidden space, lots of flags with unknown specifications - all in a single mess.
I wonder why we have no single function for character translation
instead of three currently.
> There may be less baggage there than in SGML.c, HTPlain.c.
> [...]
>> No problem - it may be left #ifdef'ed in the code
>> (but since it will not be used it will not be actively tested/maintained
>> to a greater chance became broken in future by occasional lynx changes, yes).
> Same as with other "dead code" removal by #ifdef'ing.
>> p.s. The real problem I see is a limited capacity of space for lynx special
>> characters like HT_NON_BREAK_SPACE, HT_EM_SPACE, etc. (see GridText.h),
>> which mapped to < 32 area: we cannot add more, say HT_EN_SPACE
>> (and we probably have Vietnamese implementation already broken,
>> though nobody interested seems). Indirect usage of "old" entities translation
>> may effectively solve the problem, but I am not sure.
Less costly solution may be to introduce an "internal multibyte scheme"
for special chars, say 0x02 is an escape character for next special byte.
> The best (most general) solution to that would be to feed Unicode values
> (instead of chars) to GridText, then there is nearly unlimited space
> for private regions. I.e. translate Unicode -> display character set
> as late as possible, but that would mean that also HTML.c has to deal
> with text as Unicode values instead of chars. Eventually that would be
> cleaner, but not trivial to change (especially not breaking CJK and
> "Transparent").
they can be shifted to "private use area" by a mask.
> One advantage of doing Unicode -> display is that a larger glyph
> repertoire could be used if the terminal supports some way for it -
> at least dec graphics characters that are in addition to the normally
> printable output chars, (by switching to curses alternate-character-set)
> or possibly VGA fonts of 512 characters (some are available for linux
> console).
> Klaus
- Re: lynx-dev Lynx character entity references fix, (continued)
- Re: lynx-dev Lynx character entity references fix, Leonid Pauzner, 1999/03/07
- Re: lynx-dev Lynx character entity references fix, Klaus Weide, 1999/03/09
- Re: lynx-dev Lynx character entity references fix, Leonid Pauzner, 1999/03/09
- Re: lynx-dev Lynx character entity references fix, Klaus Weide, 1999/03/10
- Re: lynx-dev Lynx character entity references fix, Leonid Pauzner, 1999/03/11
- Re: lynx-dev Lynx character entity references fix, Klaus Weide, 1999/03/12
- Re: lynx-dev Lynx character entity references fix, Leonid Pauzner, 1999/03/12
- Re: lynx-dev Lynx character entity references fix, Klaus Weide, 1999/03/12
- Re: lynx-dev Lynx character entity references fix,
Leonid Pauzner <=
- Re: lynx-dev Lynx character entity references fix, Leonid Pauzner, 1999/03/14
- Re: lynx-dev Lynx character entity references fix, Klaus Weide, 1999/03/15
- Re: lynx-dev Lynx character entity references fix, Leonid Pauzner, 1999/03/15
- Re: lynx-dev Lynx character entity references fix, Leonid Pauzner, 1999/03/10