lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LYNX-DEV charsets


From: Klaus Weide
Subject: Re: LYNX-DEV charsets
Date: Tue, 15 Jul 1997 19:57:21 -0500 (CDT)

On Tue, 15 Jul 1997, Ricardas Cepas wrote:

>                 Hello,
> 
>         I have found in  various README files that current
> Lynx can work on UTF-8 console. 

(I assume you are using linux.  That is all it has ever been
tested with, and then not much.)

> How do exactly enable this
> ?  If I  set display  charset  to Unicode  UTF-8 and  META
> tag  has  `charset=UTF-8'  or unicode-1-1-utf-8  it  works
> to  some extent.   

That is currently the only situation where utf-8 display is 
currently supposed to work: 
       net -> utf-8 -> lynx -> utf-8 -> display.

(If there are unicode characters in the utf-8 text which the
display or console driver cannot really display, it is up to
them to deal with it.  They may display a blank or something
else.  Lynx doesn't know.)

The following also works:
       net -> utf-8 -> lynx -> (appropriate 8-bit display
                               character set) -> display

(But unicode characters in the utf-8 text which cannot be shown
in the display character set will have some ASCII replacement or
Unnn shown.)

> But  for other  charsets everything  is
> translated to  ASCII and  U??? numbers. 

... but this doesn't work:
      net -> (an 8-bit charset) -> lynx -> utf-8 -> display.

In other words, lynx doesn't *generate* UTF-8 from other character
sets/encoding.  Not so much because it would be difficult or 
impossible, but for lack of interest so far.  Also because in general
the UTF-8 display doesn't work very well (see below).

[ Note: everything below applies only for UNICODE UTF 8 as "Display
character set", not for utf-8 displayed with some other Display
chharacter set. ]

> And where  is this SLANG_????_HACK ?

You have to add a -DSLANG_MBCS_HACK, by hand, to the compilation
options.  Simplest would be to add it in the top-level makefile,
on the SITE_DEFS = ... line.

Lynx doesn't write its output directly to the terminal, it hands
the characters to either slang or (n)curses for display.
Now, neither slang nor curses know anything about multibyte
characters (like UTF-8).  So when they get (hex) C4 8D they assume
that that's *two* characters taking up two screen positions,
when in effect (if the console is in utf-8 mode) it is really
just *one* character.  Slang and ncurses both count characters
written to a (virtual) screen line, to keep track of the current
position, and this counting gets confused by multibyte characters.

This leads to (roughly) three kinds of problems:

1. Slang or ncurses may think prematurely that a line is "full"
   (the end of the displayable length is reached).  I found that
   in this case slang truncated the line, (some version of) ncurses
   wrapped to the next line (which creates more of a mess).

2. Optimization for movement within a line gets messed up.

3. Optimization between screen updates gets messed up.

The solution for 3. is to force a full screen redraw after changes
(going to a different page etc.), with clearok() before a refresh()
or equivalent.

For 2., I found that slang is trying less hard to be clever than
ncurses.  (Apparently.)  Only when it had to output a number of space
(0x20) characters in a row would it send a sequence for repositioning
the cursor within the line to the screen (which is a problem, since
slang has the wrong idea what the position should be).  This isn't
solved, but becomes visible only for lines with text after a number
of spaces.  (for example, preformatted tabular data with blanks in
some columns).  Ncurses seemed to try more optimization, so would
lead to text in the "wrong" place within a line more often.

Finally, for the 1. problem (line truncation) and for slang, the
SLANG_MBCS_HACK is the workaround I found.  It tricks slang into
believing that it has, instead of a line length of (e.g.) 80
characters, 6 * 80 = 480 positions available.  (The factor six is
for the extreme case where all characters are utf-8 encoded as
six bytes.)  So slang won't cut off any line prematurely; and
since lynx also keeps track of line length, and does know about
utf-8 characters and about the real max. length (e.g. 80), things
hopefully work out.

I hope this explains things somewhat. I haven't tested the utf-8
display in a while.  If you find something that seems to contradict
my description, please let me know.

    Klaus

;
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.
;

reply via email to

[Prev in Thread] Current Thread [Next in Thread]