help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Composed Sequences


From: Richard Wordingham
Subject: Re: Composed Sequences
Date: Sat, 26 Feb 2022 19:46:16 +0000

On Sat, 26 Feb 2022 17:35:22 +0200
Eli Zaretskii <eliz@gnu.org> wrote:

> > Date: Sat, 26 Feb 2022 15:11:44 +0000
> > From: Richard Wordingham <richard.wordingham@ntlworld.com>
> >   
> > > > Different renderers give different clusters, and thus, by
> > > > default, different cursor motion!    
> >   
> > > Not "different renderers", but "different fonts".  
> > 
> > I experimented with the Tai Tham composition-function-table entry
> > 
> > (list (vector "[\u1a20-\u1aad]+" 0 'font-shape-gstring))
> > 
> > For GNU Emacs 23.4.1 (i386-mingw-nt6.2.9200) using Uniscribe, the
> > word ᨠᩣ᩠ᨿ <1A20 HIGH KA, 1A63 AA, 1A60 SAKOT, 1A3F LOW YA>, the
> > glyph string for Version 0.8 of my font Da Lekh is divided into two
> > clusters as identified by the 'glyph' values [0 1 6688...] [0 1
> > 6688...] [2 3 6752...] and confirmed by ordinary cursor motion.
> > While this division into <1A20, 1A63> and <1A60, 1A3F> is not the
> > Unicode division into grapheme clusters, it accords with what are
> > natively namable clusters.
> > 
> > For GNU Emacs 27.1 (build1 i686-w64-mingw32) of 2020-08-21, which
> > uses HarfBuzz, the same word is one indivisible cluster (at least
> > with Version 0.13 of the same font).  I think this is a change in
> > the behaviour of HarfBuzz.  
> 
> If you must have the last word in this.  (It's quite clear that in
> gray areas, such as Tai Tham, and where a shaping engine has a bug or
> a misfeature, the results will also depend on the shaping engine.  But
> that is not the main lesson to be taken home from the original issue,
> which btw was with Arabic, not Tai Tham.)

The original query was how the cursor could wind up being displayed
inside a cluster as defined by the composition rules.  The answer is
that it is always allowed at the boundary of graphemes, as defined
below.

It does, unfortunately, seem that the Uniscribe behaviour results from
oppressive coding, rather than any desire to support default grapheme
clusters (Unicode) or the like.

> > > Emacs
> > > obeys the decisions of the font designers.  

> > Unless they recorded the positions of the boundaries between the
> > parts of a ligature!  

> I don't understand what you mean by that.

The GDEF table of an OpenType font records the boundary between the
components of a ligature glyph, via the 'ligature caret list' table
therein. These data, if they exist, are amongst the 'decisions of the
font designers'.

Annoyingly, the font designers may be overridden by the rendering
engine designers.  A font designer can merge 'graphemes', but seemingly
not split 'graphemes'.

Glossary:

cluster  - sequence of coded characters presented to the shaping engine
           to be shaped.

grapheme - A sequence of coded characters which the shaping engine
           treats as a unit for the purpose of 'hit detection'.

(Perhaps this glossary has been published somewhere.)

In principle, a glyph may be shared between two graphemes, but I doubt
that Emacs has a mechanism to support that.

> Emacs behaves according to what the shaping engine tells us about the
> number of graphems in the cluster.  Each grapheme is (by default) a
> single unit for the purposes of cursor motion: Emacs will not let you
> "enter" the grapheme, even if it is make out of several glyphs.  But
> there's nothing in particular that Emacs expects from the number and
> order of the graphemes in a cluster, we just use what the shaping
> engine hands back to us.  And the cursor motion in Emacs is by default
> in logical order, i.e. in the increasing order of buffer positions of
> the original codepoints.

I hope you mean "several characters", not "several glyphs".  The
exception is related to disable-point-adjustment and its relatives, and
I think also to undisplayed buffers.

Richard.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]