lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Flynx-dev Re: msg00798.html (was: 0x2276 handling)


From: Foteos Macrides
Subject: Flynx-dev Re: msg00798.html (was: 0x2276 handling)
Date: Thu, 7 May 1998 02:44:03 -0400

"Leonid Pauzner" <address@hidden> wrote:
>>         To illustrate this further, in that FAQ, for each line which has CJK
>> dibyte characters the author has placed a line with homologous ASCII strings
>> below it.  For non-CJK document charsets, which Lynx can handle via
>> Unicode-based character conversions, that wouldn't (and ideally, should not
>> ever) be necessary (we get the Cyrillic or Greek equivalents of "Kung Fu"
>> strings automatically :).  Leonid in effect was asking what Lynx should
>> do in such cases, but no matter what it does within the contraints of
>> the current CJK implementation, the screen display is not going to be
>> interpretable in such cases by people who might otherwise "sound out"
>> CJK strings converted to ASCII strings.
>
>Correct.
>
>But I was interesting mainly not in fonetic equivalents which may be useful
>only for short things like name or address, but more technical and simple 
>issue:
>how the `transperent' layout may look like if the character map <256,
>e.g. a few 8bit characters are restricted like x80-x9F for iso-8859-x.
>There should be a `unified' representation form like numeric UA8UBA...
>which I saw under some circumstance.

        Note that the so-called "transparent" Display Character Set is
different from the RAW mode setting.  When RAW mode is on, it acts as an
instruction to treat the document charset as being the same as the
current Display Character Set.  Any 8-bit characters that are appropriate
for the current DCS should be used directly (rather than going through a
conversion to Unicode and then back to what they were in the first
place), but inappropriate characters should be filtered (ignored;
but see below).  You should toggle on RAW mode when you know the
document charset is the same as your DCS, but no information about the
document charset was obtained, and your assumed charset is not the same
as your DCS (e.g., you have Lynx set to assume iso-8859-1, and your DCS
is koi8-r, and you know or suspect that the document is koi8-r, but the
server didn't indicate that via a charset parameter in the Content-Type
header, and the author didn't indicate it via a META; or, you're accessing
a local text/plain file that you know is koi8-r, and it doesn't have a
suffix that you've mapped to "text/plain;charset=koi8-r").

        The "transparent" DCS is one designed by Klaus, which, as he
described it, is "more raw than RAW mode".  I did include it in v2.7.2,
but with apprehension, because it's basically a diagnostic for programmers.
One for the most part can use it like the RAW mode toggle, but it's more
of a pain to go to the 'o'ptions menu and change the DCS than to simply
hit the '@' key at any time, if needed.

        The Uhh representation of characters should be used in the case
where the character is valid for the document charset, but the Display
Character Set has no representation for it, and there is no 7-bit
approximation for it in def7_uni.tbl.  That would apply to the CJK
di-bytes when your DCS does not correspond to the document charset,
but then your screen would be filled with Uhh characters (and they're
not in fact Unicode; also see below).  Otherwise, it should be relatively
rare, now, as the Unicode tables have become rather extensive.

        For any raw x80-x9F characters when the document charset is
iso-8859-x, Lynx just ignores them.  It traditionally ignored numeric
character references in that range (e.g, "&#145;"), as well, but in
v2.7.2 I changed that to the error recovery of assuming they're
MS Windows characters due to FrontPage's misuse of numeric character
references.  But that's *only* for numeric character references, not
raw 8-bit characters.


>> >The truth is, however, I am not having a lot of luck using Win32 Lynx
>> >2.7.1ac-0.81, if there is a meta tag describing the character set, e.g.,
>> ><META HTTP-EQUIV="Content-Type" CONTENT="text/html;CHARSET=x-euc-jp">.
>> >Without a meta tag, a document is rendered okay.  Guess it means I need
>> >to upgrade to 2.8.
>>
>>         That has nothing to do with CJK support, per se, beyond that you've
>> set up euc-jp as your assumed charset, and therefore get what you want when
>> Lynx has no idea what is the actual charset.  Klaus had bad logic and an
>> incomplete list for charset synonyms (which blew it on "x-euc-jp" and
>> "x-shift_jis").  He changed the logic to more like what I had in v2.7.2
>> and supplemented the synonyms lists equivalently, shortly before he took
>> off.
>>
>>                                         Fote
> 
>Actually, 2.8+ ignore META charset if one is CJK (in most cases)
>but display is not, I don't know why.
>I check the code yesterday but haven't sove the problem.
>2.7.2 *seems* do the same job in this area (according to the code).
>Assume_charset have only effect here. So today I have fixed ^A in Options,
>which was not working `sometimes' until you reload by Ctrl-R.

        If it is a CJK charset, and the Display Character Set is not, then
Lynx has no good way to deal with the situation (nothing it can do to make
the document even phonetically readable), so all that matters is that it
does not do something which could result in detrimental consequences (e.g.,
processing characters in a way which might be treated as X-OFF).  The
choices are to 1) force a download offer, 2) fill the screen with Uhh
characters (with overhead to do those conversions, that don't yield any
readability), or 3) ignore the charset information and set an assumed
charset which should be "safe", e.g., iso-8859-1, or whatever the user
has set as assumed in conjunction with the current Display Character Set.
The code has done each of those, at various points in its development,
and presently does the third, but it's still a case of choosing among
no good alternatives, and so no choice is good.  I like forcing a
download offer when thinking about it tonight, but would rather not
think about it any more; to borrow Wayne's phrase it gives me a headache.
It's easy to force a download offer in HTMIME.c when getting a
Content-Type header with charset parameter from a server, but not in
LYCharUtils.c when already committed to rendering, and then getting a
META element which makes that a bad committment.  However, Henry was
referring to a failure when he *has* selected a CJK Display Character
Set, and it should not fail in that case (and didn't, as far as I know,
in v2.7.2, nor would I expect it to in v2.8. I won't even try to
address the gibberish followups he posted, though.  Sorry :).

                                Fote
-- 
Foteos Macrides

reply via email to

[Prev in Thread] Current Thread [Next in Thread]