Hi Oliver,
At 2023-04-25T16:25:49+0200, Oliver Corff wrote:
In the meantime, I had a look at that Russian hyphenation file, and to
my relief, the structure of the groff hyphenation pattern files is
that of TeX hyphenation pattern files, which I have worked on before.
Yup. They were born that way.
But... the hyphenation file hyphen.ru in the aforementioned source is
not usable in the current set-up because the Russian syllable
fragments are encoded in KOI-8, an 8 bit encoding based on a GOST
Standard of the USSR.
So, it does not match the internal code representation of Unicode code
points.
No, it doesn't. But some of the other hyphenation pattern files don't,
either; if you look you will see that they're encoded variously in ISO
646, ISO 8859-1, ISO 8859-2, and ISO 8859-15.
This is because groff's hyphenation pattern file parser doesn't
understand UTF-8.
That would be a nice thing to have.
hyphen.ru does a very sneaky thing that I did not think was possible
before Nikita Ivanov dropped it on our doorstep and I took a closer look
at the KOI8-R encoding.
You might know that code points in the "C1 Controls" block of Unicode
(U+0080..U+009F) are invalid input characters to groff. groff uses them
for internal, bespoke purposes.[1] This is a barrier to making groff
support UTF-8 input directly, as noted in our documentation.[2][3]
But an interesting property of KOI8-R is that none of the glyphs it
heaps up in the C1 region are alphabetic.
Therefore they don't require hyphenation.
Therefore the Russian hyphenation patterns, using KOI8-R, can masquerade
effectively as an ISO 8859 encoding.
This is the same deal that lets us support ISO 8859-{2,15} in our
hyphenation patterns. GNU troff doesn't actually care what these code
points "are", it only needs to know their values to make hyphenation
decisions. The intelligibility of the hyphenation patterns to a human
reader is determined by the character encoding, but within the range
U+00A0..U+00FF (actually more than that: U+0021..U+007F as well), groff
has no dog in the semantic interpretation fight.
Since groff internally seems to work with Unicode code positions, the
question is: in which format should the hyphenation patterns be
presented to groff? As-is, that is as utf8 text, or in \[u04xx] form?
That does not seem to work either, according to my last experiment.
For now, neither; the KOI8-R cheat seems to work fine, as far as I can
tell or understand. Admittedly, I'm not a Russian speaker. But I
believe the contributor is.
Eventually, we will need a way for our hyphenation pattern file reader
function[6] to interpret UTF-8 input. The cleanest thing to do would be
to have it use the same facility as regular GNU troff input stream
reading support for UTF-8. But that has to be written first.
Regards,
Branden
[1] https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.h
[2] https://www.dropbox.com/sh/17ftu3z31couf07/AAC_9kq0ZA-Ra2ZhmZFWlLuva?dl=0
Open groff.YYYY-MM-DD.pdf, where the date changes from time to time;
see pages 73 and 84 (as of this writing).
[3] It can be done; it's just harder than migrating from ASCII to UTF-8.
My idea is to relocate all these bespoke groff symbols to the
Unicode Private Use Area. But for that we need to change groff's
string class[4] to build upon either (1) wide characters or (2)
multibyte characters. My preference is to go straight to char32_t.
[4] groff, having been first written in about 1989, does not use the
Standard C++ library string class. This has proven unproblematic;
it's implemented well and I'm not aware of any defect _ever_ being
exposed in it. (This illustrates that James Clark was a better C++
programmer than most.) But if I change it, someone's going to ask
me why I don't just migrate to Standard C++ library facilities for
it and I need a good answer. I'm working on that. When defending
my engineering decisions, I prefer to be equipped with stone tablets
strong enough smash over the head of my interlocutor. I'm not quite
there yet with groff strings: The Next Generation.
While I'm pontificating I'll opine that I'm not a huge fan of C++ as
a language, but I have found with groff that, given discipline, and
by maintaining a clear view of its roots in C (_also_ not my
favorite language--but one alienating, enemy-making rant at a time),
and not picking up every f***ing new feature that gets shoved into
the language as soon as (or before) it's standardized, it _can_ be
managed. But I also think that the C++ templating facility was, in
implementation, one of the worst features ever developed for any
programming language.
I've decided to try to keep groff's C++ codebase ISO C++98
compatible for the foreseeable future, even though there are _some_
aspects of later C++ standards that I like quite a bit. (Simple
things, like proper damn data types and constants for null
pointers.) Clark wrote groff before name spaces, templates, and
exceptions were added to the language, so you don't see them in its
sources--it's pretty much in "Annotated Reference Manual C++", but
if you look carefully you _will_ find some use of vec<>, added by
later contributors. And I have seen the pre-template,
preprocessor-based implementation of "ITABLES" and "PTABLES", and
no, I don't think it's prettier than templates. The interesting
thing is, 30+ years after adding these generic programming
facilities, nothing in groff _ever_ specialized them beyond the the
base types they were initially used with. I find that suggestive.
If you want to see generics done right, look at Ada.[5] <mic drop>
[5] Yes, the background of C++ templates' authorship is a tragedy.
[6]
https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/env.cpp#n3790