lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev Re: lynx should respect LANG


From: Klaus Weide
Subject: Re: lynx-dev Re: lynx should respect LANG
Date: Thu, 1 Jun 2000 13:05:23 -0500 (CDT)

On Thu, 1 Jun 2000, Henry Nelson wrote:

> To answer your question, not very well.  Lynx also seems to take the
> first line of the directory listing as a file.  But worse than that is
> that it assumes the file or directory name is ALL the user/group/size/
> date information prefixed to the file name (see the References section).
> Weird; never seen this on any server before.  Here's an example:
> 
> % cd TEST
> % /bin/ls -al  <<== only with /bin/ls; /usr/ucb/ls uses English "total 26"
> 合計 26

Yes, recognition of some strings that normally appear in FTP LIST
responses is hardwired, including the "total "...  Basically that is
necessary because the LIST format isn't standardized at all.  Dan
Bernstein's EPLF (which lynx understands) hasn't widely caught on...
So lynx uses heuristics that work most of the time.

>    2. 
> ftp://address@hidden/home/henry/TEST/%20henry%20%20%20%20user%20%20%20%20%20259%20%206%B7%EE%20%201%C6%FC%20%2010%3A54%20wglog1422
> [...]

I'm not sure why it switches to this interpretation.  Probably some
other factor is involved - the 'unsure_type' flag in HTFTP.c -
which would not be set if the server had responded to SYST with
something that lynx recognizes as a "UNIX" server.

> Anyway, I tend to agree with ipswitch that Solaris 2.6 is broken.  Their
> recommendation was to upgrade to Solaris 2.8.  (Linking /bin/ls to
> /usr/ucb/ls was the poor-man's solution for me.)  Just a curiosity.

Someone could further fine-tune lynx's heuristics to cover that case.
But if FTP server admins universally disable that behavior, because it
doesn't work with other clients anyway, there's not much of a point
(or even a test case).

> Okay, here's the gory details (all 8 combinations):
> Emulator   LANG    Lynx DCS   result
>   EUC    japanese   EUC-JP     okay
>   EUC    japanese  Shift-JIS   glyphs wrong/blank, positioning wrong, keys 
> work
>   EUC       ja      EUC-JP     okay
> * EUC       ja     Shift-JIS   total lockup of emulation (even ^C, ^D dead)
>  SJIS    japanese   EUC-JP     glyphs wrong, positioning okay, keys work
>  SJIS    japanese  Shift-JIS   okay
>  SJIS       ja      EUC-JP     glyphs wrong, positioning okay, keys work
>  SJIS       ja     Shift-JIS   okay

Thank you for the table.  Now I understand better.

> The asterisk marks the combination I have been calling "fatal."  I suppose if
> done from a console, there would be no way to recover other than a hardware
> reboot.  Why take the fun out of Un*x, right? :)  Although matching DCS to
> LANG would prevent the worst-case scenario, it would not guarantee a correct
> rendering.

I assume other applications also need a correct setup (matching between
'Emulator' and 'LANG' columns) in order to display correctly.  Including
for simple things like error messages from a locale-aware 'ls' (although
you seem to have disabled that, see above :) ).  So if the environment
has been setup correctly for other commands, matching DCS to LANG should
do the right thing.

> > My nkf apparently cannot deal with JIS X 0212 characters (which can be
> > validly encoded in EUC-JP and in ISO-2022-JP-2), while my iconv can.
> > I guess those characters aren't much used.  (Current lynx doesn't
> > treat them right, either.  I have made some changes in my copy to
> > add support.)
> 
> Don't even know what 0212 are :(.  I've never been able to figure out
> how I'm supposed to use ISO-2022-JP-2 :( :(.  (If it's easy and you have
> the time, kindly write me off list.  Thanks.)

I prefer to respond here, so somebody who knows better can correct me.

EUC-JP, as described in the IANA charset registry at
<http://www.isi.edu/in-notes/iana/assignments/character-sets), is a
character encoding scheme (or, in MIME lingo, "charset") that contains
four different "code sets":

               code set 0: US-ASCII (a single 7-bit byte set)
               code set 1: JIS X0208-1990 (a double 8-bit byte set)
                           restricted to A0-FF in both bytes
               code set 2: Half Width Katakana (a single 7-bit byte set)
                           requiring SS2 as the character prefix
               code set 3: JIS X0212-1990 (a double 7-bit byte set)
                           restricted to A0-FF in both bytes
                           requiring SS3 as the character prefix

Code set 0 is what makes you be able to read English text intermixed
with Japanese EUC text.  Code set 1 characters are what you normally
use for Japanese.  Code set 2 - you know what is is.  Code set 3
contains additional "fullwidth" characters, defined in a different
standard.

With X, and some fonts in jisx0212 encoding installed, I can view the
available code set 3 glyphs with the command
    xfd -fn "*jisx0212*"
and the kterm terminal emulator, once properly configured and in EUC mode,
recognizes 'code set 3' characters (by the SS3 == 0x8F character prefix)
and shows them correctly.

ISO-2022-JP is another character encoding schemes, that has provisions
for (using the above labels) code set 0, 1, and 2 characters).
ISO-2022-JP-2 is an extension of ISO-2022-JP with provisions for code
sets 0, 1, 2, and 3, as well as several other (non-Japanese0 code
sets.

What you guys in Japan normally call just "JIS", or "JIS 7 bit" or
similar, is ISO-2022-JP, as far as I understand.

Shift_JIS covers only code sets 0, 1, and 2.  So it's not possible
to express JIS X0212 characters in Shift_JIS at all.  One cannot
convert EUC-JP (or ISO-2022-JP-2) to Shift_JIS without loss, if
such characters are present.  So, with Shift_JIS's (i.e., Microsoft's)
popularity such as it is, I assume JIS X0212 characters just aren't
being much used on the Web.  Nor needed by you guys, apparently.

All this is probably much simplified, without paying attention to
different standards versions, Microsoft private extensions and
deviations, etc.

> > the display character set form the command line, but you have to
> > compile with -DMISC_EXP, and it's undocumented.  Consider this
> > a plug for (helping test) MISC_EXP.
> >      lynx -dump -convert_to="text/plain;charset=euc-jp" ...
> 
> This could be a real time-saver.  Do I have to use -dump?  Is it possible/
> meaningful-to-do-so in an interactive session?  Might just add it to my
> alias to lynx.

No, yes.  The -convert_to is just a small feature that gives direct
access from the command line to some mechanisms that are already present.
It doesn't add new code (except for parsing the option itself).
Quoting CHANGES, for 1999-10-21 (2.8.3dev.13)
* experimental command line option -convert_to, only compiled in if new
  MISC_EXP symbol is defined.  This takes a string in the form of a MIME type,
  which can also be combined with an appended ";charset=" parameter.  (This
  needs shell quoting of course).  The charset value can be used to set the
  display character set from the command line.  The MIME type can be one of the
  non-official types used internally, for some interesting effects (crashing
  lynx not excluded).  Try www/download, www/source, www/dump, or some
  unrecognized string.

   Klaus



; To UNSUBSCRIBE: Send "unsubscribe lynx-dev" to address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]