w3-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[W3-dev] Bogus treatment of numeric HTML entities


From: Hrvoje Niksic
Subject: [W3-dev] Bogus treatment of numeric HTML entities
Date: Fri, 28 Nov 2003 23:55:39 +0100
User-agent: Gnus/5.1002 (Gnus v5.10.2) XEmacs/21.4 (Rational FORTRAN, linux)

Take this chunk of HTML:

<pre>
;; Emacs &#12398; &#12496;&#12483;&#12501;&#12449;&#12391;&#12398; face 
&#24773;&#22577;&#12434;&#35501;&#12415;&#21462;&#12387;&#12390;&#12289;&#12381;&#12398;&#23383;&#20307;(&#22826;&#23383;&#12539;
;; &#26012;&#20307;)&#12539;&#33394;&#12434;&#21453;&#26144;&#12375;&#12383; 
HTML 
&#12434;&#20986;&#21147;&#12377;&#12427;&#12503;&#12525;&#12464;&#12521;&#12512;&#12391;&#12377;&#12290;
</pre>

Mozilla displays it correctly, provided it has the necessary fonts at
its disposal.  On the other hand, the W3 currently shipped with XEmacs
chokes with "Wrong type argument: char-or-string-p, 12398".  A little
investigation of how W3 treats entities shows the cause of the
problem:

      (let ((repl (cdr-safe (assq w3-p-s-num 
w3-invalid-sgml-char-replacement))))
        (insert (or repl (mule-make-iso-character w3-p-s-num)))))

In other words, W3 takes the numeric entity and expects
`mule-make-iso-character' to convert it to a character.  I don't know
what `mule-make-iso-character' is supposed to do because it's
undocumented.  But under XEmacs, it simply returns CHAR (a number!)
unchanged.  Since 12398 does not correspond to the code of any Mule
character, an error is signaled.

The above code snippet is preceded by a comment:

      ;; char-to-string will hopefully do something useful with characters
      ;; larger than 255.  I think in MULE it does.  Is this true?
      ;; Bill wants to call w3-resolve-numeric-entity here, but I think
      ;; that functionality belongs in char-to-string.
      ;; The largest valid character in the I18N version of HTML is 65533.
      ;; ftp://ds.internic.net/internet-drafts/draft-ietf-html-i18n-01.txt
      ;; wrongo!  Apparently, mule doesn't do sane things with char-to-string
      ;; -wmp 7/9/96

This comment is wrong on several levels:

1. char-to-string should not be expected to do anything sane with
   integers; it accepts a single character and converts it to a
   one-character string.  (It's true that it also handles integers,
   but under XEmacs that's just for convenience and backward
   compatibility.)

2. To convert a character's integer representation in Mule to an
   actual character, you can use int-char.  This applies to XEmacs,
   GNU Emacs doesn't have a character data type that I know of.

3. int-char is worthless for HTML entities because, as stated above,
   it converts a *Mule* integer representation to char.  A numeric
   entity is, on the other hand, the code point of a UCS character!

   This is why the above fails: (int-char 12398) returns nil, and
   (insert 12398) fails, as does (char-to-string 12398).

To summarize: numeric HTML entities use UCS code points, which means
that they need to be converted to Emacs chars by code that understands
Unicode.  char-to-string cannot do that because decoding external
representations is not its job, and int-char cannot do that because it
works on Mule's internal representation of characters, which needn't
be (and currently isn't) UCS.

Fortunately, there is a function present in both XEmacs and GNU Emacs
that groks some Unicode code points.  It is `decode-char', preloaded
with GNU Emacs 21+ and available as part of the Mule-UCS package under
XEmacs.  (Mule-UCS is obsolete, but it works well enough for
decode-char.  Use (progn (require 'unicode) (require 'un-define)) to
load it up.)  You can call it like this:

    (decode-char 'ucs 12398)
      => ?<japanese squiggle>

    ;; Under GNU Emacs:
    (decode-char 'ucs 12398)
      => <Mule integer repr of the squiggle>

In other words, there *should* be a `w3-resolve-numeric-entity'
(sorry, William), and it could look like this:

(defun w3-resolve-numeric-entity (code)
  (cond ((< code 256)
         ;; Mule and UCS should agree about [0, 256) range.
         (char-to-string code))
        ((and (fboundp 'decode-char)
              ;; Must check because CODE could be bogus, and because
              ;; decode-char doesn't work for all valid UCS code
              ;; points in all versions of Mule.
              (decode-char 'ucs code))
         (char-to-string (decode-char 'ucs code)))
        (t
         ;; Can't decode this char.
         (format "[UCS-%d]" code))))

The above code should look like this:

      (let ((repl (cdr-safe (assq w3-p-s-num 
w3-invalid-sgml-char-replacement))))
        (insert (or repl (w3-resolve-numeric-entity w3-p-s-num)))))

You might want W3 to attempt loading `unicode' and `un-define' under
XEmacs.  But they load awfully long, so you might not want to do that
after all.  Since you have to check for missing decode-char anyway, it
probably doesn't make a difference in practice -- the users who care
about Unicode entities can load them up.

I don't have a patch for this, and I haven't tried this code, but I
believe this information should be sufficient for a maintainer to
easily correct the problem.  If more info is needed, please let me
know!

If you want me to see the replies, please keep me in the Cc, as I'm
not subscribed to this list.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]