groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Questions concerning hyphenation patterns for non-Latin languages, e


From: Oliver Corff
Subject: Re: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian
Date: Wed, 26 Apr 2023 09:19:41 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0

Hi Branden,

thank you very much for the sharing your insight regarding groff internals.

I tried your demonstration, replacing the text file with my own file (utf8-encoded Cyrillic), and I did not succeed to reproduce your results.

I copied all Russian-related macros (ru.tmac, hyphen.ru and koi8-ru.tmac) into my ../current/tmac directory (production system is still 1.22.4), and running groff results in unusable output.

The headline "Abstract" gets translated into Russian, but is displayed in non-utf8 format. All utf8-text is ok. If I omit the -k option then utf8-encoded text is unusable as well, but this is no surprise.

Do I miss something from post-1.23.0 that enables the described magic? Or is there a flow in my own approaches and processes?

Best regards,

Oliver.


On 26/04/2023 06:42, G. Branden Robinson wrote:
Hi Oliver,

At 2023-04-25T20:02:00+0200, Oliver Corff wrote:
Yes, KOI8-R has the Cyrillic uppercase in 0xE0..0xFF, lowercase in
0xC0..0xDF; in the control code area, there are no letters in the
human sense of the word. I had a look at the current groff
documentation referenced by your footnote, and I imagine that
KOI8-R-encoded Cyrillic text will be processed seamlessly (that was
the basic assumption behind my recent and only temporary suggestion to
process Greek in ISO encoding), yet my input is \[u04xx]-style Unicode
Cyrillic.
Right.  I don't think we can support that at present.

Somehow Cyrillic input in utf8, made readable by preconv(1), should
match the letter code positions in KOI8-R, otherwise pattern matching
for hyphenation would fail.
For Unicode-encoded Cyrillic input, I think you're going to need to
covert the input to KOI8-R first with iconv.

How is Unicode Cyrillic text in groff internally represented? When
dumping gtroff output to the console, I see u04xx codepoints. In my
naive understanding I assume it would be the same internally.
At 2023-04-25T16:25:49+0200, Oliver Corff wrote:
Since groff internally seems to work with Unicode code positions, the
question is: in which format should the hyphenation patterns be
presented to groff? As-is, that is as utf8 text, or in \[u04xx] form?
That does not seem to work either, according to my last experiment.
I didn't squarely address this question of yours earlier, which might
have helped.  Sorry about that.

There are a couple of answers to that depending on what stage of
processing we're talking about, but the earlier one is of more interest.

groff internally represents characters as bytes.  8-bit bytes.  That's
all we have.

We support Unicode code points the same way we represent everything else
that isn't ASCII--with "special characters".  \(hy, \[coproduct],
\[u0400] and so on.

I tried the KOI8-R-encoded hyphenation file in my little russ.ms
document, but no hyphenation was introduced. I set the .hy register
etc., but nothing happened: no hyphenation. That's also why I put
these monster words with 30-odd characters into the file and forced
everything to be in two-column mode, in order to make the
line-breaking as challenging as possible.
Hmm.  Did you load the Russian localization file, as suggested by the
documentation?

Here's an exhibit I've prepared.

$ file ATTIC/udhr-ru-koi8r.ms
ATTIC/udhr-ru-koi8r.ms: troff or preprocessor input, ISO-8859 text
$ iconv -f koi8-r -t utf8 ATTIC/udhr-ru-koi8r.ms
.nr LL 28n
.LP
Все люди рождаются свободными и равными в своем достоинстве и правах.
Они наделены разумом и совестью и должны поступать в отношении друг
друга в духе братства.
.LP
Каждый человек должен обладать всеми правами и всеми свободами,
провозглашенными настоящей Декларацией, без какого бы то ни было
различия, как-то в отношении расы, цвета кожи, пола, языка, религии,
политических или иных убеждений, национального или социального
происхождения, имущественного, сословного или иного положения.
.LP
Кроме того, не должно проводиться никакого различия на основе
политического, правового или международного статуса страны или
территории, к которой человек принадлежит, независимо от того, является
ли эта территория независимой, подопечной, несамоуправляющейся или
как-либо иначе ограниченной в своем суверенитете.
.LP
Каждый человек имеет право на жизнь, на свободу и на личную
неприкосновенность.
.LP
Никто не должен содержаться в рабстве или в подневольном состоянии;
рабство и работорговля запрещаются во всех их видах.
.LP
Никто не должен подвергаться пыткам или жестоким, бесчеловечным или
унижающим его достоинство обращению и наказанию.
.LP
Каждый человек, где бы он ни находился, имеет право на признание его
$ ./build/test-groff -ms -mru -Tutf8 ATTIC/udhr-ru-koi8r.ms




Все люди рождаются свободны‐
ми  и равными в своем досто‐
инстве и правах. Они наделе‐
ны  разумом  и  совестью   и
должны поступать в отношении
друг друга в духе братства.

Каждый  человек должен обла‐
дать всеми правами  и  всеми
свободами,  провозглашенными
настоящей  Декларацией,  без
какого  бы то ни было разли‐
чия,  как‐то   в   отношении
расы,   цвета   кожи,  пола,
языка, религии, политических
или иных  убеждений,  нацио‐
нального   или   социального
происхождения, имущественно‐
го,  сословного  или   иного
положения.

Кроме того, не должно прово‐
диться  никакого различия на
основе политического, право‐
вого или международного ста‐
туса страны или  территории,
к  которой человек принадле‐
жит,  независимо  от   того,
является  ли  эта территория
независимой,     подопечной,
несамоуправляющейся      или
как‐либо иначе  ограниченной
в своем суверенитете.

Каждый  человек  имеет право
на жизнь, на  свободу  и  на
личную неприкосновенность.

Никто  не должен содержаться
в рабстве или в подневольном
состоянии; рабство  и  рабо‐
торговля запрещаются во всех
их видах.

Никто не должен подвергаться
пыткам  или жестоким, бесче‐
ловечным или  унижающим  его
достоинство    обращению   и
наказанию.

Каждый человек, где бы он ни
находился,  имеет  право  на
признание     его     право‐
субъектности.






That's what I get, 6 blank lines of vertical margin at the top and
bottom and everything.

There is another strong argument against any KOI8-R hack. It does not
have the full Cyrillic alphabet. Even Russian typesetting is defective
(modern Russian has 33 letters, if you include pre-modern Russian, the
character set grows even more), let alone other languages written in
Cyrillic (like Ukrainian, Mongolian and Kazakh). These languages have
a larger vowel set than Russian and in the case of Mongolian and
Kazakh use vowel symbols which are best matched by umlauts in the
Latin alphabet: compare уг and үг, толь and төлөө. So, a Mongolian
word like төлөвлөгөө or төлөөлөгчдийн would never be writable, let
alone be hyphenatable in KOI8-R. Kazakh and Bashkyr alphabets, for
instance, comprise about 42 letters.
I was aware of some of these issues (particularly the imperfect coverage
of Ukrainian in KOI8-R, a question with ramifications beyond typesetting
these days).  A big advantage to Nikita's approach is that it works with
what we have.

So, for me there are sound reasons not to try to make KOI8-R work
*somehow*, as it would not solve the fundamental problems just
mentioned.
We're not having to put hacks into any part of groff to accommodate
Nikita's contribution.  Under those conditions, and as long as we
acknowledge its limitations (only "Great" Russian in KOI8-R encoding is
supported) it seems hard to say no.

With a little help, we can support KOI8-U; the alphabetic characters it
adds remain in the Latin-1 extension code block, replacing box-drawing
symbols that we don't predefine special characters for anyway.  (If you
want those, a groff document in any encoding can access them by loading
the rfc1345.tmac package new to groff 1.23.0.[1])  All we need is for
someone to contribute support just as Nikita has.

The hyphenation file parser you referred to looks innocent enough to
the untrained eye. Do you think expanding the current ^^xx notation to
^^^^xxxx notation would derail the input processor?
No, because groff's hyphenation codes correspond to character code
points, and those are only one byte wide in groff anyway.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/charinfo.h#n28

A hyphenation file would then not be human-readable, but this is a
minor problem; hyphenation patterns look highly inintelligible anyway.
I think it would be a win if we could consume TeX hyphenation files
exactly as they ship them.  groff's mailing list is not, as far as I can
tell, thick with hyphenation specialists.  For that matter, the TeX
community may not be, either.

Regards,
Branden

[1] https://git.savannah.gnu.org/cgit/groff.git/tree/contrib/rfc1345

--
Dr. Oliver Corff
Wittelsbacherstr. 5A
10707 Berlin
G E R M A N Y
Tel.: +49-30-85727260
Mail: oliver.corff@email.de

Attachment: russ.ms
Description: Text Data

Attachment: russ.pdf
Description: Adobe PDF document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]