lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

lynx-dev allowed chars in entity names (was Re: character entity referen


From: Klaus Weide
Subject: lynx-dev allowed chars in entity names (was Re: character entity references fix)
Date: Mon, 8 Mar 1999 18:06:15 -0600 (CST)

On Sun, 7 Mar 1999, Leonid Pauzner wrote:
> 7-Mar-99 08:22 Klaus Weide wrote:

> > It seems none of the b.something entities can work, because the dot
> > terminates entity parsing.  Are these even *meant* to be used in HTML
> > (of any version)?  Does Lynx use the wrong syntax for recognizing
> > character entities, *are* dots allowed in their names?
> Dots are not allowed currently within lynx entities,
> but seems there is no restriction in SGML and they are registered with ISO.

Somehow I always had assumed that only letters and digits are allowed
in character entity names (all the existing HTML entities, followed
that pattern, as well as the lynx code).  Now I checked.

If I understand the meaning of an SGML declaration right, the allowed
characters in entity names are governed by

         NAMING   LCNMSTRT ""
                  UCNMSTRT ""
                  LCNMCHAR ".-_:"
                  UCNMCHAR ".-_:"
                  NAMECASE GENERAL YES
                           ENTITY  NO

in <http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html>.
(Can anyone confirm or explain this further?)

That would mean that '.', '-', '_', and ':' are allowed as (non-initial)
part of entity names (as well as element names etc.), in addition to (ASCII)
letters and numbers.

HTML 2 has only
                          LCNMCHAR ".-"
                          UCNMCHAR ".-"

That means that the parsing of entity names in lynx in various places
(at least LYCharUtils.c, SGML.c) is too strict.  Basically, tests like

         (unsigned char)*cp < 127 &&
               isalnum((unsigned char)*cp)

should be replaced with

         (unsigned char)*cp < 127 &&
              (isalnum((unsigned char)*cp) ||
               *cp=='.' || *cp=='-' || *cp=='_' || *cp==':')

(modulo checking whether this is OK for EBCDIC).

As long as Lynx doesn't actually have any character entities with such names
in the set of recognized ones, it shouldn't make any difference though.
But for "b.something" it would be necessary.

> >> We should probably decide whether we want lynx act strictly as HTML 4.0
> >> and reject everything else or keep as much as possible. Any vote?
> 
> Done. The second table in entities.h is strict HTML4.0 entities list
> (252 entries mapped to unicode 1:1), it is #ifdef'ed
> with ENTITIES_HTML40_ONLY (a better name?) and may be used
> _instead_ of the current table (~995 entries without reverse mapping).
> The smaller table useful for page validation while larger may be safer
> for future standards - who knows?

Sounds like a good plan.

   Klaus

reply via email to

[Prev in Thread] Current Thread [Next in Thread]