[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Groff] UTF-8 out-of-the box experience
From: |
Markus Kuhn |
Subject: |
[Groff] UTF-8 out-of-the box experience |
Date: |
Thu, 03 May 2001 09:47:53 +0100 |
Our department upgraded machines yesterday to the brand new Red Hat 7.1
release. Here a few impressions I collected while I demonstrated the
UTF-8 capabilities to my colleagues. UTF-8 locales are available now and
% LANG=en_GB.UTF-8 xterm &
is all that is needed to enter the Unicode world.
I had to unset LESSCHARSET in some people's environment. It is obsolete
now and and if it is set, it just hinders less in autodetecting that
UTF-8 should be activated. I found in "man man" in the BUGS section the
tip "If you see blinking \255 or <AD> instead of hyphens, put
`LESSCHARSET=latin1' in your environment." This tip is obsolete now,
harmful and should definitely be removed.
I ran into a few embarrassing bugs that still haven't been fixed though
I think they have been mentioned here several times before.
The combination of "man" (version 1.5h) and "groff" (GNU troff version
1.16.1) is seriously broken in a UTF-8 locale. Even for ASCII only web
pages, groff inserts Latin-1 SHY bytes, which result in an ugly
malformed UTF-8 sequence. It is very disappointing that this doesn't
work correctly out-of-the-box, because the underlying groff mechanics
for UTF-8 output is already in place and seems to work correctly:
zcat /usr/share/man/man7/groff_char.7.gz | groff -mandoc -Tutf8 - | less
produces the desired results, whereas
man groff_char
does not.
The required fix here is that groff should get a new output device
-Tplaintext which specifies plaintext encoded according to the current
locale (just query nl_langinfo(CODESET) and see whether it says "UTF-8"
or "ISO-8859-*" or something like that). Then in /etc/man.config, we
could simply replace
NROFF /usr/bin/groff -Tlatin1 -mandoc
with
NROFF /usr/bin/groff -Tplaintext -mandoc
and man would automatically work properly in both ISO-8859 and UTF-8
locales.
"less" (less 358+iso247) is also still broken and completely messes up
in UTF-8 mode the handling of backspace boldification used by nroff.
This still distorts the output of any man page. Test case:
perl -e 'use utf8; print "a\ba_\bb\n"' | less
correctly shows a bold "a" and an underlined "b", but
perl -e 'use utf8; print "\x{20ac}\b\x{20ac}_\b\x{2203}\n"' | less
fails to show either a bold euro sign or an underlined there-exists sign.
(Perl 5.6 or newer required here)
UTF-8 locale support under X11 (XFree86 4.0.3) also seems still *very*
broken. For example, I would have hoped that
perl -e 'use utf8; print "\x{20ac}"' | xmessage -file -
(all under LANG=en_GB.UTF-8) shows me a window with the euro sign, but
what I get instead is display of "â\202¬". :-(
I also tried vi quickly (VIM 6.0z ALPHA) with LANG=en_GB.UTF-8, but when
I used "vi UTF-8-demo.txt", I just got garbled text on the screen. man
vi did not contain the search string "uni" or "utf". Couldn't figure out
whether the vim 6.0z that comes with RH 7.1 has any UTF-8 support. It
certainly didn't work out-of-the-box.
Summary: Red Hat 7.1 is not even suited to make a 5 min demonstration of
its UTF-8 locale support without serious embarrassment. xterm is pretty
much the only UTF-8 application that works at the moment.
Required action:
- fix less backspace bug
- fix groff to support locale-dependent selection of output encoding
(-Tplaintext or so)
- fix man.config to use groff -Tplaintext instead of -Tlatin1
- fix xman to use ISO10646-1 fontset when in UTF-8 locale such that
groff_char man page is shown with all characters.
- make sure that LESSCHARSET is not set anywhere
- fix vi to activate UTF-8 mode in UTF-8 locale
- test the SUSE 7.2 beta to avoid the same problems there
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
- [Groff] UTF-8 out-of-the box experience,
Markus Kuhn <=