[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Groff] handling of composing and combined Unicode characters
From: |
Werner LEMBERG |
Subject: |
Re: [Groff] handling of composing and combined Unicode characters |
Date: |
Tue, 10 Jan 2006 07:58:52 +0100 (CET) |
> Find attached two files find.vi.after-preconv and
> find.vi.after-preconv-decomposed, containing two test cases:
> 1) \[u1EBF] = 'e' \u0302 \u0301
> 2) 'x' \u0302 \u0301 (no precomposed character in Unicode)
>
> I call
> $ troff -mandoc -Tutf8 < find.vi.after-preconv[-decomposed]"
>
> The expected behaviour of troff should be that it emits a composed glyph,
> created by src/roff/troff/input.cpp:composite_glyph_name(), right?
> In both cases the behaviour is different:
> 1) find.vi.1:6: warning: can't find special character `u0065_0302_0301'
> 2) find.vi.1:6: warning: can't find special character `u0302'
> find.vi.1:6: warning: can't find special character `u0301'
>
> Is my assumption right?
groff sees \[u1EBF] and automatically decomposes it to a generic
*entity name*, namely `u0065_0302_0301'. It doesn't care whether
\[u0302] and \[u0301] exist actually. This process is documented in
groff.info (Using Symbols):
* A glyph representing more than a single input character will be
named
`u' COMPONENT1 `_' COMPONENT2 `_' COMPONENT3 ...
Example: `u0045_0302_0301'.
For simplicity, all Unicode characters which are composites must
be decomposed maximally (this is normalization form D in the
Unicode standard); for example, `u00CA_0301' is not a valid glyph
name since U+00CA (LATIN CAPITAL LETTER E WITH CIRCUMFLEX) can be
further decomposed into U+0045 (LATIN CAPITAL LETTER E) and
U+0302 (COMBINING CIRCUMFLEX ACCENT). `u0045_0302_0301' is thus
the glyph name for U+1EBE, LATIN CAPITAL LETTER E WITH CIRCUMFLEX
AND ACUTE.
* groff maintains a table to decompose all algorithmically derived
glyph names which are composites itself. For example, `u0100'
(LATIN LETTER A WITH MACRON) will be automatically decomposed
into `u0041_0304'. Additionally, a glyph name of the GGL is
preferred to an algorithmically derived glyph name; groff also
automatically does the mapping. Example: The glyph `u0045_0302'
will be mapped to `^E'.
* glyph names of the GGL can't be used in composite glyph names;
for example, `^E_u0301' is invalid.
Either you register `u0045_0302_0301' with .char directly in your
document (or in a proper macro file, say, `vi.tmac'), or you add this
to the devutf8 font description files. I prefer the latter.
Currently, I don't have time to add complete Vietnamese support by
myself, but doing so should be straightforward. Patches welcome :-)
Werner