aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] Thoughts on using aspell for Indian language ing


From: gora
Subject: Re: [aspell-devel] Thoughts on using aspell for Indian language ing
Date: Mon, 13 Nov 2006 12:49:13 +0100

On 11:42:07 am 11/13/06 Kevin Atkinson <address@hidden> wrote:
[...]
> The explanation below did nothing to explain why you want to be able
> to store the "kra" conjunct, especially since the conjunct doesn't
> exist in Unicode.
[...]

Sorry, maybe I am assuming too much of a familiarity with Indian languages.
A conjunct in Hindi, such as "kra", should be treated as a new entity,
on par with base consonants that exist in Unicode. Let us take the word
"chakra", चक्र, for example. I am not sure if you can see the proper
rendering, but there should be two glyphs. Linguistically, this consists
of the consonant "ca" (U091A), and the conjunct "kra", क्र (U0915 +
U094D +
U0930), and the UTF-8 storage would be U091A U0915 U094D U0930. Now, any
calculations of edit distance, such as swap, etc., should use the
consonant "ca" and the conjunct "kra", not the individual Unicode
characters. If for example, we operated on the individual characters, a
swap might move the "halant" (U094D) ahead of the "ka" (U0915), making the
character sequence U091A U094D U0915 U0930. As the "halant" is what is
used to construct conjuncts, this makes a new conjunct, "chka", च्क
(U091A
+ U094D + U0915), followed by the consonant "ra", र (U0930). This is not
desirable, as a confusion of spelling would never arise between "chka"
and "kra".
  If instead, one operated on conjuncts (actually, the operations need
to be on syllables), a swap would end up looking like the conjunct "kra"
followed by the consonant "ca", with the storage sequence being U0915
U094D U0930 U091A.
  Hope this makes more sense. I will come up with a more detailed
write-up including a description of conjuncts, and why one should
use syllables, rather than characters, as the basic units for Indian
language spellchecking. Some of these issues, maybe most of them, can
be made up for by appropriate soundslike rules. I really should try out
some quantitative tests first.

Regards,
Gora

P.S. I was also toying with the idea of writing an aspell UNO component
     to enable usage from OpenOffice. I see that there has been some
     discussion on this earlier. Do you think that such a component
     would still be useful, or has the integration of Hunspell into
     OpenOffice made something like this unnecessary?





reply via email to

[Prev in Thread] Current Thread [Next in Thread]