aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] Thoughts on using aspell for Indian language ing


From: Kevin Atkinson
Subject: Re: [aspell-devel] Thoughts on using aspell for Indian language ing
Date: Mon, 13 Nov 2006 06:03:08 -0700 (MST)

On Mon, 13 Nov 2006, address@hidden wrote:

On 11:42:07 am 11/13/06 Kevin Atkinson <address@hidden> wrote:
[...]
The explanation below did nothing to explain why you want to be able
to store the "kra" conjunct, especially since the conjunct doesn't
exist in Unicode.
[...]

Sorry, maybe I am assuming too much of a familiarity with Indian languages.
A conjunct in Hindi, such as "kra", should be treated as a new entity,
on par with base consonants that exist in Unicode. Let us take the word
"chakra", चक्र, for example. I am not sure if you can see the proper
rendering, but there should be two glyphs.

I can with a little trouble (save it to a text file, than open it on gedit on a machine with the right font installed)

Linguistically, this consists
of the consonant "ca" (U091A), and the conjunct "kra", क्र (U0915 +
U094D +
U0930), and the UTF-8 storage would be U091A U0915 U094D U0930.

So how many "letters"?  Is that 3 or 4?  Is U094D considered a "letter"?

Now, any
calculations of edit distance, such as swap, etc., should use the
consonant "ca" and the conjunct "kra", not the individual Unicode
characters. If for example, we operated on the individual characters, a
swap might move the "halant" (U094D) ahead of the "ka" (U0915), making the
character sequence U091A U094D U0915 U0930. As the "halant" is what is
used to construct conjuncts, this makes a new conjunct, "chka", च्क
(U091A
+ U094D + U0915), followed by the consonant "ra", र (U0930). This is not
desirable, as a confusion of spelling would never arise between "chka"
and "kra".

So it is never the case you might want to substitute a letter in the conjunct with another letter? I assume you would. I would also assume that you would want to consider two conjuncts which are the same except for one letter as closer than two completely different conjuncts?

Also how likely is it that the user will swap two glyphs?

Also if you every want to implement any sort of true soundslike I would think you would want to work with letters not syllables.

 Hope this makes more sense. I will come up with a more detailed
write-up including a description of conjuncts, and why one should
use syllables, rather than characters, as the basic units for Indian
language spellchecking. Some of these issues, maybe most of them, can
be made up for by appropriate soundslike rules. I really should try out
some quantitative tests first.

Possible but you really need a "looks like" rather than a "soundslike".
I agree if you want to unique represent each syllable you may run out of symbols to use.

However, it may me better to just use a syllable aware edit distance.

...

I now understand the issue. However, I think that the fact that Aspell is 8-bit internally is a very small factor. Converting Aspell to be 16-bit internally will not magically fix this issue. I don't even think it will make it significantly easier to solve.

I do believe to truly handle this situation well some modifications will need to be made to Aspell. I suggest you start studying readonly_ws.cpp and suggest.cpp. I while ago I wrote some docs on how Aspell works:
  http://lists.gnu.org/archive/html/aspell-devel/2005-09/msg00007.html
  http://lists.gnu.org/archive/html/aspell-devel/2005-10/msg00000.html
which may be helpful.

I will get back to you latter with some ideas on how to approach this issue. If you already thought of some please share them.
reply via email to

[Prev in Thread] Current Thread [Next in Thread]