lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev Lynx character entity references fix


From: Leonid Pauzner
Subject: Re: lynx-dev Lynx character entity references fix
Date: Fri, 12 Mar 1999 15:22:11 +0300 (MSK)

12-Mar-99 00:54 Klaus Weide wrote:
> On Thu, 11 Mar 1999, Leonid Pauzner wrote:

>> OK, changing of "assume charset" for unlabelled document gives the folowing
>> (grep UC_MapGN from trace log):
>>
>> UC_MapGN: Using 1 <- 26 (windows-1251)
>> UC_MapGN: Using 1 <- 1 (iso-8859-15)
>> UC_MapGN: Using 2 <- 2 (cp850)
>> UC_MapGN: Using 1 <- 3 (windows-1252)
>> UC_MapGN: Using 2 <- 4 (cp437)
>> UC_MapGN: Using 1 <- 5 (dec-mcs)
>> UC_MapGN: Using 2 <- 6 (macintosh)
>> UC_MapGN: Using 1 <- 7 (next)
>> UC_MapGN: Using 2 <- 8 (hp-roman8)

> Ok, you have probably just now done more runtime testing of what switching
> occurs than I ever did.  :)   Note that those TRACE lines only occur when
> one of the four slots is changed (not when it is re-used).

There there two more charsets not shown above: iso-8859-1 and us-ascii
(before iso-8859-15) - apparently constant slot #0.

>> It is for "forward" translation and apparently slots #3 and #4 are not used.

> Make that #0 and #3.

> But #0 is used for iso-8859-1.  The four slots are initially set to fixed
> tables, whose correspondence to charsets can be seen in UCdomap.h:

>     CONST char *UC_GNsetMIMEnames[4] =
>         {"iso-8859-1", "x-dec-graphics", "cp437", "x-transparent"};

OK, thanks. It was not clear which tables are constant/used/unused/vary.

> The first and last one are then never changed.  This code in UC_MapGN()
> flips between using (and changing) the two middle ones:

>         if (UC_lastautoGN == GRAF_MAP) {
>             Gn = IBMPC_MAP;
>         } else {
>             Gn = GRAF_MAP;
>         }

> So it's a primitive caching scheme.  As long as one switches between
> a set of documents with two different charsets, or three different
> charsets of which one is iso-8859-1, no re-initializing of the
> tables that depend on charset->Unicode mapping is necessary.

> The "x-transparent" slot may never actually get used, and I am not sure
> whether that ever was the case.

>> > Not invented by me, taken from the original linux code.

> And so are the initial contents of the four slots - they are given by the
> four hardwired tables in UCdomap.c.

> That means that there is some very minimal support for doing some forward
> translations before any reading of the .tbl data, but this may never be
> used (since chartrans initialization occurs early), and changes in other
> functions may be necessary.

>> >> So I just "add" num_n256 so things works without index overrun
>> >> (and hopefully with a proper result) and postpone more UCDomap.c changes
>> >> for dev.Next - patch from your side really welcome :-)
>>
>> > Are changes necessary, and for what purpose?
>> Removing of num_n256 staff gives core dump at startup.

> How about just testing for (UCInfo[UC_charset_in_hndl].unicount == NULL)
> in the places where you use UCInfo[UC_charset_in_hndl].num_n256?
> That should make num_n256 unnecessary, and shows more directly what
> you are trying to avoid - access of unicount and unitable tables that are
> not there.

Already done in another pending patch: num_n256 invention was temporarily
just to made things explicit instead of being hidden.

>> Another way may be to set UChndl = -1 in LYRegister_with_LYCharSets()
>> to simulate "old" style behaviour (but not for utf-8).
>> All UCTrans* functions preserved by UChndl >= 0 check.

> UChndl = -1 used to have a useful meaning: that a character set
> is known to "old method", but not known to "new method".  This was
> used in the UCCan* functions (therefore also in UCSetTransParams())
> [and maybe other places?] With your changes (as of dev.19) UChndl != -1
> has become invariantly true (except in the case of some internal error).
> So you had to make some changes in UCCan* / UCNeedNot*.  [Hey I'm sure
> you know all this; I'm kind of recapitulating for myself.]

I'm thinking on undo some UCCanTranslate* changes to support UChndl >= 0 back,
this handler is a couple of bytes and can be removed at the last stage.

> In general your changes seem to aim at simplifying things (with the
> final goal to get completely rid of "old" stuff?) and and at making
> things clearer.  I think using UChndl = -1 to mean something else than
> it used to doesn't make things clearer though.

> I leave it to you to find the best way (and reserve right to complain...)

The real simplification may be #ifdef'ing some heavy code
that deal with "old" style usage (in SGML.c, HTPlain.c, LYCharUtils.c (Uh!),
and at the last stage - from HTMLDTD.c, LYCharSets.c, ...)
It is a "bloating binary" item and also a problem of maintaining
such ungomogenouse piece of code in general.

> There are just too many combinations of settings/flags that can occur,
> N document charsets X N display charsets X where (plain text,html text,
> ALT text,HREF) X raw flag X <???>, it's near impossible to systematically
> test all these cases when some internal changes are made.  Well that's my

Chartrans implementation currently work without a problem (only minore fixes
were done since late 2.7.1.ac), it is a UCDomap.c engine
and chartrans calls around lynx.  I have no intention to change it,
an obvious aim is to exclude numerous "special cases"
which should not be too hard (the problem is not to get broken CJK
and UTF-8 implementation I cannot test handy).

> excuse for not wanting to change too much (if it ain't broke don't fix
> it).  [The other excuse is to keep things as flexible as possible (leaving
> some stuff in that is "currently" unused or underused - for "one day" using
> it), but otoh getting rid of some redundancy is a worthwhile goal...]

No problem - it may be left #ifdef'ed in the code
(but since it will not be used it will not be actively tested/maintained
to a greater chance became broken in future by occasional lynx changes, yes).

>       Klaus

p.s. The real problem I see is a limited capacity of space for lynx special
characters like HT_NON_BREAK_SPACE, HT_EM_SPACE, etc. (see GridText.h),
which mapped to < 32 area: we cannot add more, say HT_EN_SPACE
(and we probably have Vietnamese implementation already broken,
though nobody interested seems). Indirect usage of "old" entities translation
may effectively solve the problem, but I am not sure.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]