emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strip accents and sorting [was: BibTeX issues]


From: Eli Zaretskii
Subject: Re: strip accents and sorting [was: BibTeX issues]
Date: Thu, 29 Aug 2019 10:10:37 +0300

> Date: Wed, 28 Aug 2019 22:26:38 -0500
> From: "Roland Winkler" <address@hidden>
> Cc: address@hidden
> 
> On Wed Aug 28 2019 Eli Zaretskii wrote:
> > > From: Roland Winkler <address@hidden>
> > > If there was a generic function strip-accents, then BibTeX mode could
> > > certainly use it within its bibtex-generate-autokey machinery.
> > 
> > I don't think we have such a function, but it shouldn't be hard to
> > write one, using the facilities in ucs-normalize.el.
> 
> Interesting! What are the intended use cases for ucs-normalize.el
> and the algorithms that it implements?

To implement the functionalities described in UAX#15 Unicode
Normalization Forms (http://www.unicode.org/reports/tr15/).  We
already use some of that in implementing the utf8-hfs file-name
encoding (used by macOS).

> I had never much thought about this.  But there is obviously a
> problem when one tries to sort a database where the keys may contain
> more fancy utf characters. (This problem must be well-known in the
> utf world).  Naivly one might hope that the following lines are
> properly sorted according to string-lessp

As Martin points out, you should use string-collate-lessp instead for
these use cases.

> Of course, this is due to the fact that a German umlaut can be
> represented with its own character or with a combining diaeresis.
> These two ways of presenting an umlaut look the same, but they are
> not the same for string-lessp.

The Unicode Standard mandates that they be handled identically,
including in searching and sorting.  We don't yet implement that 100%,
but see char-fold.el for a partial (and not very efficient)
implementation during search.

> Now, one solution would be to simply strip off the combining
> characters by decomposing the characters.  Or is there a possibility
> to teach a sorting algorithm that the first letter of ä-combine is
> "the same" as the first letter of ä-umlaut and all this should
> appear near a-plain instead of past o-plain?

Both should be possible.  To entirely strip the combining accents, you
can use ucs-normalize, and then filter out all characters whose
canonical combining class is non-zero.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]