[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Why does Groff decompose Unicode glyphs in intermediate output?
From: |
Robin Haberkorn |
Subject: |
Why does Groff decompose Unicode glyphs in intermediate output? |
Date: |
Sun, 10 Nov 2024 06:13:32 +0300 (MSK) |
User-agent: |
Alpine 2.26 (BSF 649 2022-06-02) |
Dear groffers,
can anybody explain why in Groff 1.23.0:
# echo -n 'й' | preconv -eutf-8
.lf 1 -
\[u0439]
But:
# echo -n 'й' | preconv -eutf-8 | groff -wall -Z -Tutf8
x T utf8
x res 240 24 40
x init
x F -
p1
x font 1 R
f1
s10
V40
H0
md
DFd
Cu0438_0306
H24
n40 0
x trailer
V2640
x stop
In other words, while preconv gave the expected U+0439, Groff transforms
this into a combining character. This is then converted back into U+0439
by grotty:
# echo -n 'й' | preconv -eutf-8 | groff -wall -Z -Tutf8 | grotty | hexdump -C
00000000 d0 b9 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |................|
00000010 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |................|
*
00000040 0a 0a 0a 0a |....|
00000044
I am writing my own Groff postprocessor [1] and this gives me headaches.
Is there any algorithm to convert the combining characters back to single
codepoints or am I supposed to use large translation tables for that?
Somehow grotty is obviously doing it, but I haven't yet read the source
code.
There appears to be a Unicode composition algorithm in iconv(). glib wraps
this to g_unichar_compose().
It appears, I would have to wrap this in my programming language (SciTECO)
as well, if I'd like to support all of the glyphs with diacritics it in my
postprocessor.
IMHO groff shouldn't decompose characters that haven't been decomposed in
its input.
Best regards,
Robin
[1]: https://github.com/rhaberkorn/sciteco/blob/master/doc/grosciteco.tes
- Why does Groff decompose Unicode glyphs in intermediate output?,
Robin Haberkorn <=