lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

lynx-dev UTF-8 display questions (was: Superscripts)


From: Klaus Weide
Subject: lynx-dev UTF-8 display questions (was: Superscripts)
Date: Wed, 7 Jun 2000 13:40:26 -0500 (CDT)

[ Ignoring invalid encoding of raw non-ASCII characters in Sergei's
message... ]

On 7 Jun 2000, Sergei Pokrovsky wrote:
> >>>>> "Sergei" == Sergei Pokrovsky <address@hidden> wrote:
> ...
>   >>> Also linebreaks often occur in strange places, like in the last
>   >>> paragraph of my example.

> Here is such an example, see
> 
> http://www.esperanto.mv.ru/KompLeks/UTF8/AL.html#ANTAUxTRAKTADO
> 
> Namely, the piece
> 
> =========
> <p><i>Angle:</i> preprocessing <br>
> <i>Ruse:</i> 
> &#1087;&#1088;&#1077;&#1076;&#1074;&#1072;&#1088;&#1080;&#1090;&#1077;&#1083;&#1100;&#1085;&#1072;&#1103;
>  
> &#1086;&#1073;&#1088;&#1072;&#1073;&#1086;&#1090;&#1082;&#1072;, 
> &#1087;&#1088;&#1077;&#1087;&#1088;&#1086;&#1094;&#1077;&#1089;&#1089;&#1080;&#1088;&#1086;&#1074;&#1072;&#1085;&#1080;&#1077;
> 
> <p>foo ... 
> =========
> 
> is rendered as
[...]
> This is not due to excessively long lines, because an equivalent text
> in UTF-8 produces exactly same output: [...]

It *is* due to "excessively" long lines in a sense, but not in the
input stream.

You are running into one of the fundamental problems of displaying
UTF-8 with a display library that is not UTF-8-aware.  The display
library (ncurses in your case, iirc) makes the assumption that one
byte == one character (position).  So it would wrap a line (or possibly
truncate it) after 80 characters in a 80x25 window.

That means that, for lines full of UTF-8 characters in a range where
each character is encoded as two bytes (which includes Cyrillic),
only about half of the available horizontal display width is usable.

Lynx 2.8.3 is actually improved in this respect: now lynx takes this
into account and breaks the line in an appropriate place.  That's why
you see the line broken between words, and not broken or truncated
in the middle of the third word, which would be the case in previous
versions.

There is one existing workaround, but only if you compile lynx with
slang instead of (n)curses:  compile with SLANG_MBCS_HACK defined.
For example, (this is the way I pass additional flags to the compile
process)

   ./configure --with-screen=slang [...]
   make SITE_DEFS="-DSLANG_MBCS_HACK"

It works well for me in most $TERMinal types (but not all - although
those aren't UTF-8 capable anyway).

With the advent of an UTF-8-aware ncurses, this problem may be solved  
soon (for those who have it) (after code changes in lynx).

   Klaus


; To UNSUBSCRIBE: Send "unsubscribe lynx-dev" to address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]