groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Why does Groff decompose Unicode glyphs in intermediate output?


From: Robin Haberkorn
Subject: Why does Groff decompose Unicode glyphs in intermediate output?
Date: Sun, 10 Nov 2024 06:13:32 +0300 (MSK)
User-agent: Alpine 2.26 (BSF 649 2022-06-02)

Dear groffers,

can anybody explain why in Groff 1.23.0:

# echo -n 'й' | preconv -eutf-8
.lf 1 -
\[u0439]

But:

# echo -n 'й' | preconv -eutf-8 | groff -wall -Z -Tutf8
x T utf8
x res 240 24 40
x init
x F -
p1
x font 1 R
f1
s10
V40
H0
md
DFd
Cu0438_0306
H24
n40 0
x trailer
V2640
x stop

In other words, while preconv gave the expected U+0439, Groff transforms this into a combining character. This is then converted back into U+0439 by grotty:

# echo -n 'й' | preconv -eutf-8 | groff -wall -Z -Tutf8 | grotty | hexdump -C
00000000  d0 b9 0a 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a   |................|
00000010  0a 0a 0a 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a   |................|
*
00000040  0a 0a 0a 0a                                        |....|
00000044

I am writing my own Groff postprocessor [1] and this gives me headaches. Is there any algorithm to convert the combining characters back to single codepoints or am I supposed to use large translation tables for that? Somehow grotty is obviously doing it, but I haven't yet read the source code. There appears to be a Unicode composition algorithm in iconv(). glib wraps this to g_unichar_compose(). It appears, I would have to wrap this in my programming language (SciTECO) as well, if I'd like to support all of the glyphs with diacritics it in my postprocessor.

IMHO groff shouldn't decompose characters that haven't been decomposed in its input.

Best regards,
Robin

[1]: https://github.com/rhaberkorn/sciteco/blob/master/doc/grosciteco.tes


reply via email to

[Prev in Thread] Current Thread [Next in Thread]