aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] A few dictionary-related issues


From: Lars Aronsson
Subject: Re: [aspell-devel] A few dictionary-related issues
Date: Tue, 18 Mar 2003 00:46:25 +0100 (CET)

Pauliuc George wrote:
> a few huge documents.  And I was quite impressed by the
> results of aspell.  If aspell knew the misspelled word than
> that word is in most cases the first option.  Never seen
> such results with other spell checkers.  Probably is because

I agree that aspell gives the best suggestions.  But ispell/MySpell is
better on some other issues.  I wish Aspell's "suggestion engine" as a
component would be integrated into ispell/MySpell.

> Anyway, here's one issue: in Romanian we use the dash (-)
> about the same way English uses the apostrophe (').

Swedish has a similar issue with colon (:) and dash (-), and in some
places with digits.  For example, English "3rd" = Swedish "3:e", while
"3:c" would be bad spelling.  Also 743:e is good, but 7A3:e is bad, so
it would be nice to have regular expressions as part of the
dictionary, all patterns matching "[0-9]*3:e" are good words.

> The final issue (even more twisted as the ones above ;-):
> because of badly implemented Romanian char support many
> documents are made without the diacritics.  So, instead of î
> we have i and so on.  This is a very particular case for a
> spell checker (I don't know any other language with such an
> issue) - to add the diacritics.

For Swedish, leaving out the diacritics was common in the 1980s,
but has gone away.  People don't accept it, and always make fun of
those who leave out diacritics, so it is OK for a spelling program to
report these cases as errors.

For German, it is different.  There is a long tradition of rewriting
German ä as ae, ö as oe, and ü as ue.  People still use this on many
German mailing lists, even though their keyboards and mailing software
should support the diacritics.  It is not reversible.  Writing
"Poesie" as "Pösie" would be wrong.  A "generous" dictionary would
have to contain Götter, Goetter, and Poesie, but not Pösie.  This
could be generated automatically by software from a "strict"
dictionary that only contains "Götter" and "Poesie", like this:

  ( cat strict ;
    sed 's/ö/oe/g;s/ä/ae/g;s/ü/ue/g' strict ) | sort -u > generous


-- 
  Lars Aronsson (address@hidden)
  Aronsson Datateknik - http://aronsson.se/





reply via email to

[Prev in Thread] Current Thread [Next in Thread]