help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Is there a way to "asciify" a string?


From: Richard Wordingham
Subject: Re: Is there a way to "asciify" a string?
Date: Thu, 31 May 2018 23:52:07 +0100

On Thu, 31 May 2018 17:08:47 +0200 (CEST)
"S. Champailler" <schampaillerspam@skynet.be> wrote:

> I second that, removing accents and other "nationalities" is much
> trickier than one might expect (you can look at Java example, the
> Java unicode support is quite complete), especially for lanugages far
> away from english such as russian. By "tricky" I mean there are
> *hundreds* of edge cases. Nevertheless, there are ways do sort of do
> what you want by playing with thigsn such as "non spacing combining
> characters", "normalized strings", etc. If you have the opportunity,
> just try to do it, the great lesson you'lll get of that is that human
> languages are super complexe (and thus super interesting).

Make sure you transliterate the string first.  Remember that stripping
out Indic vowels (many of which are gc=Mn) is no more reasonable than
stripping out ASCII vowels.

> Today, everyone should use Unicode, it's much simpler. Many file
> systems support unicode.

But be warned that some very different strings may compare equal.  The
Unicode Collation algorithm is highly likely *not* to be the default.
Windows XP used to compare strings of Canadian Aboriginal Syllabics of
the same length as equal.  I remember using sort -u to remove duplicates
from a list of words on a Linux distribution, and finding that I only
had one left. I now play safe and do that sort of trick in the C locale.

Richard.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]