lynx-dev changes for Japanese (was: dev.16 patch)

lynx-dev
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
lynx-dev changes for Japanese (was: dev.16 patch)

From:	Klaus Weide
Subject:	lynx-dev changes for Japanese (was: dev.16 patch)
Date:	Sun, 12 Dec 1999 18:13:58 -0600 (CST)
On Mon, 13 Dec 1999, Hataguchi Takeshi wrote:

> I checked the behavior with half width katakana 
> and wrote a patch for dev.16.

It's great to see development of this on lynx-dev.

Some comments below, but I may not really know what I am talking about.
There are certainly a lot of details I don't understand.

> On Mon, 6 Dec 1999, Klaus Weide wrote:
> 
> > running on Windows or something else.  Yet it seems a lot of the more
> > recently added code for Japanese is Windows-specific.  It seems I don't
> > even understand the problem, so no surprise that I don't understand the
> > solutions.
> 
> Really? 
> 
> I believe almost all Hiroyuki's code for Japanese is ifdef'd by 
> CJK_EX and isn't Windows-specific. I havn't looked at all code 
> ifdef'd by CJK_EX yet, but it shouldn't be Windows-specific.

I shouldn't have generalized.  Since you have looked at this closer,
and understand the different character encodings better, I accept that
you are right in general.

What I was specifically thinking of was the discovery of
SUPPORT_MULTIBYTE_EDIT.  (I ws recently looking at LYStrings.c to
get an idea how hard it would be tu add better support for UTF-8 to
the line-editor.  Hadn't paid much attention to those #ifdef'd sections
there before - there's so much of it, I tend to blend it out.)
SUPPORT_MULTIBYTE_EDIT seems to be defined only in two of the
Windows-specifc makefiles, makefile.msc and makefile.bcb, and
nowhere else.  It is not mentioned in INSTALLATION, README.defines,
not explained in the makefiles, not ifdef'd with any *_EX, and I
couldn't find it mentioned in CHANGES, so I wonder where it's coming
from.

Anyway, it uses a hardwired IS_KANA macro that seems to be completely
Shift_JIS specific.  I think it should test the current display
character set instead.  Something like 

#define IS_KANA(c) (HTCJK==JAPANESE && current_char_set == SHIFT_JIS &&\
                    0xa0 <= c && c <= 0xdf)

with SHIFT_JIS perhaps defined in UCdomap.c, equivalent to LATIN1, US_ASCII,
UTF8, TRANSPARENT.

Is it true that, once Japanese text is in the HText structure, it is
always converted to the right d.c.s., i.e., either EUC-JP or Shift_JIS?
I hope so, otherwise having the distinction wouldn't make much sense...

But I don't understand who needs the SUPPORT_MULTIBYTE_EDIT code.  It
seems to me every CJK charset user should need it, is that not the case?
If it's true, then there shouldn't even be a special macro
SUPPORT_MULTIBYTE_EDIT.  And it should of course not be Windows-specific.

This leaves me wondering about the most basic functioning of line-editing
with CJK display character sets.  What happens when you delete, for
example with the backspace key, one half of a multibyte character, without
that special code?  Does it work at all?

> > > [...]
> > > 0xA1    0xFF61  # HALFWIDTH IDEOGRAPHIC FULL STOP
> > [...]
> > > 0xDF    0xFF9F  # HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
> > 
> > Thank you for the explanation.
> > 
> > The existence of those 1-byte codes is something I totally neglected
> > in my recent changes (for WHEREIS search highlighting glitches, I think
> > you know what I mean).  That means that the code should be correct for
> > EUC-JP, but still not for Shift-JIS.  (Since WHEREIS operates on the
> > end result of Lynx's formatting and conversions, I suppose it should be
> > correct for Display Character Set == "Japanese (EUC-JP)" and incorrect
> > for D.C.S. == "Japanese (Shift_JIS)", independent of the original charset
> > of the document as transmitted, as long as Lynx's conversion was otherwise
> > correct.)
> 
> There is a code for half width katakana, but we don't always have 
> fonts for it. So I think it's better Lynx converts half width katakana
> into full width to display. If CJK_EX is defined, 
> Lynx actually does for almost all cases.
> # Lynx didn't convert in the source mode, but my patch will improve it.

Does that also apply to text/plain files?

You may also want to check the source mode variants, -preparsed and
-prettysrc.

> It seems (WHEREIS search) highlighting works well if CJK_EX is defined, 
> but it doesn't if not defined because half width katakana can be
> in the screen. I think Lynx should always convert half width katakana 
> into full width. Are there any side effects?

Maybe with alignment of preformatted text, including text/plain?

Should this be a separate (configuration?) option, rather than everything
being covered by CJK_EX?

Is it too hard to deal with half width katakana in all the necessary
places, rather than "forbidding" it?  I assume it would be a lot of
work.

But it may also ba a lot of work to find all the places where conversion
would need to take place.  For example, a user may enter those characters
in a form (or paste them in from the clipboard).

> There may be other wrong effects with Lynx when a document includes
> half width katakana. For example, which I found, Lynx fails to parse 
> TAGs in the below case.
> 
>     X<p> (Assume X is half width katakana)
> 
> # Precisely speaking, half width katakana is one byte in Shift_JIS and
> # is two byte in EUC-JP. Lynx fails only when it's written in Shift_JIS.

But if there are half width characters in EUC-JP that are encoded as two
bytes, the WHEREIS highlighting code should also fail.  I don't understand
how it can work, functions in LYString.c at least seem to assume that
if (HTCJK != NOCJK) then an eightbit character is always the first of a
pair (for example in LYmbcsstrlen and LYno_attr_mbcs_strstr).


> I'll attach the example file, you can try it with setting 
> Display Character Set as Japanese (Shift_JIS or EUC-JP)
> without Japanese font.

I admit I havn't tried it yet.

> Applying my patch, Lynx can parse it 
> as expected (I believe).
> 
>  
> +#if 0 /* This doesn't seemed to be valid code.
> +       * ref: http://www.isi.edu/in-notes/iana/assignments/character-sets
> +       */
>  #define IS_EUC_LOS(lo)       ((0x21<=lo)&&(lo<=0x7E))        /* standard */
> +#endif

Could it be necessary for some of the other EUC (not -JP) codes?
Or could it be an attempt to support (from the IANA list)

   Name: Extended_UNIX_Code_Fixed_Width_for_Japanese
   ...
                code set 3: JIS X0212-1990 (a double 7-bit byte set)
                            restricted to A0-FF in
                            the first byte
                and 21-7E in the second byte


>  
> -#ifdef CJK_EX        /* 1998/11/24 (Tue) 17:02:31 */
> +#if 0 /* This should be a business of GridText */

That's part of what I don't understand at all. :)
(what should be whose business.)

> +    if ((HTCJK==JAPANESE) && (context->state==S_in_kanji) &&
> +     !IS_JAPANESE_2BYTE(kanji_buf,(unsigned char)c)) {
> +#if CJK_EX
> +     if (IS_SJIS_HWKANA(kanji_buf) && (last_kcode == SJIS)) {
> +         JISx0201TO0208_SJIS(kanji_buf, &sjis_hi, &sjis_lo);
> +         PUTC(sjis_hi);
> +         PUTC(sjis_lo);
> +     }
> +     else
> +         PUTC('=');
> +#else
> +     PUTC('=');
> +#endif
> +     context->state = S_text;
> +    }

(This seems to be the place where the failure with
>     X<p> (Assume X is half width katakana)
comes in, right?

But now that problem shouldn't be too hard to solve, after kanji_buf
has been introduced.

> @@ -1744,6 +1761,7 @@
>       **  (see below). - FM

The comment (of which this is last line) should also be changed,
it says
        **  We could try to deal
        **  with it by holding each first byte and then checking
        **  byte pairs, but that doesn't seem worth the overhead

so it doesn't apply any more...

>       */
>       context->state = S_text;
> +     PUTC(kanji_buf);
>       PUTC(c);

You probably should also flush out the new kanji_buf in SGML_free
(if it is a valid character).  It could be the last character of
a file.  Of course that's rare, but it could even be valid HTML
(</BODY></HTML> tags are not required).

>       break;
>  
> @@ -1772,7 +1790,7 @@
>           **  to having raw mode off with CJK. - FM
>           */
>           context->state = S_in_kanji;
> -         PUTC(c);
> +         kanji_buf = c;
>           break;

This S_in_kanji handling is only done from S_text state.
Would it need to be repeated for S_value, S_quoted, and S_dquoted
to get attibute values right (for example, ALT text)?  And maybe
more of the states?  Would it have to be duplicated in HTPlain.c
for plain text and source view?

All that doesn't sound like fun.

   ---

Hmm, so where does an explicit "charset" come in, if there is one?
I.e., "text/html;charset=euc-jp" vs. "text/html;charset=shift_jis".
It seems some things in SGML.c should depend on it, but I don't see
it being considered at all.

   ---

Excuse my ramblings.  A lot of it may really be based on false
assumptions, every sntence should have an "as far as I understand".

   Klaus
[Prev in Thread]
Current Thread
[Next in Thread]
dev.16 patch (was: Re: lynx-dev Re: 283dev15 for Win32), Hataguchi Takeshi, 1999/12/12
- lynx-dev changes for Japanese (was: dev.16 patch), Klaus Weide <=
Prev by Date: lynx-dev storing cookies for -dump/-source (was: libreadline)
Next by Date: lynx-dev reload_read_cfg() crash on DJGPP
Previous by thread: dev.16 patch (was: Re: lynx-dev Re: 283dev15 for Win32)
Next by thread: lynx-dev proposed lynx.cfg change
Index(es):
- Date
- Thread