[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Groff] .hcode request with german umlauts inside utf8 input file
From: |
Ralph Corderoy |
Subject: |
Re: [Groff] .hcode request with german umlauts inside utf8 input file |
Date: |
Mon, 28 Jul 2014 14:26:24 +0100 |
Hi Carsten,
> Now the error message is "hyphenation code must be ordinary
> character". So I understand that the only correct file enocding for
> .hcode with umlauts is latin1 (ISO 8859-1)? Or is there any chance to
> use 7-bit input like \[uXXXX]?
>
> $ printf ".hcode ä ä"|preconv -e utf-8|troff
>
> Prints error "hyphenation code must be ordinary character"
No, it looks like you're right. `info groff' says
-- Request: .hcode c1 code1 [c2 code2 ...]
Set the hyphenation code of character C1 to CODE1, that of C2 to
CODE2, etc. A hyphenation code must be a single input character
(not a special character) other than a digit or a space.
To make hyphenation work, hyphenation codes must be set up. At
start-up, groff only assigns hyphenation codes to the letters
`a'-`z' (mapped to themselves) and to the letters `A'-`Z' (mapped to
`a'-`z'); all other hyphenation codes are set to zero. Normally,
hyphenation patterns contain only lowercase letters which should be
applied regardless of case. In other words, the words `FOO' and
`Foo' should be hyphenated exactly the same way as the word `foo' is
hyphenated, and this is what `hcode' is good for. Words which
contain other letters won't be hyphenated properly if the
corresponding hyphenation patterns actually do contain them. For
example, the following `hcode' requests are necessary to assign
hyphenation codes to the letters `ÄäÖöÜüß' (this is needed for
German):
.hcode ä ä Ä ä
.hcode ö ö Ö ö
.hcode ü ü Ü ü
.hcode ß ß
Without those assignments, groff treats German words like
`Kindergärten' (the plural form of `kindergarten') as two substrings
`kinderg' and `rten' because the hyphenation code of the umlaut a is
zero by default. There is a German hyphenation pattern which covers
`kinder', so groff finds the hyphenation `kin-der'. The other two
hyphenation points (`kin-der-gär-ten') are missed.
This request is ignored if it has no parameter.
So it isn't happy with the \[] that preconv is producing.
$ echo .hcode ä ä | preconv -e utf-8
.lf 1 -
.hcode \[u00E4] \[u00E4]
$
Werner, is it a preconv bug that it doesn't produce ISO-8859-1 (latin1)
output where possible, e.g. ä rather than \[u00E4], given that's groff's
default input encoding? It stops it being used for .hcode.
One could post-process preconv's output if \u[00..] doesn't occur
without meaning a byte of that value.
$ echo .hcode ä ä |
> preconv -e utf-8 |
> perl -pe 's/\\\[u00([\dABCDEF]{2})]/chr hex $1/ge' |
> recode iso-8859-1..dump
UCS2 Mne Description
002E . full stop
006C l latin small letter l
0066 f latin small letter f
0020 SP space
0031 1 digit one
0020 SP space
002D - hyphen-minus
000A LF line feed (lf)
002E . full stop
0068 h latin small letter h
0063 c latin small letter c
006F o latin small letter o
0064 d latin small letter d
0065 e latin small letter e
0020 SP space
00E4 a: latin small letter a with diaeresis
0020 SP space
00E4 a: latin small letter a with diaeresis
000A LF line feed (lf)
$
Cheers, Ralph.