Re: [aspell-devel] Thoughts on using aspell for Indian language checking

On 11/12/06, address@hidden <address@hidden> wrote:

On 2:38:54 pm 11/12/06 Kevin Atkinson <address@hidden> wrote:
[...]
> On Sat, 11 Nov 2006, address@hidden wrote:
[...]
> >  1. I would like to volunteer to work on writing a proper C++
> >    interface to aspell.
[...]
>
> I assume you are already aware that Aspell has a C interface.  I do
> not have a C++ for one very good reason: binary compatibility.  If
> anything I would like to improve the exiting C interface to support
> the extended functionally.  Not create a new one.

OK, makes sense. What I have done at present is to expose internals
like aspell edit distance costs, and scores through several levels
of get/set functions in the C++ classes, and finally include them in
the C interface. The code is a little ugly in places, and patches 0.60.3.
If you think that this would be useful, I will make a design to export
scores, present it for approval, and then patch 0.60.4, or a later
development version. As you recommend, I will continue to use the C
interface. The reason that I was proposing to work on a C++ interface
was that somewhere in the aspell documentation there was a note that
such an interface would be useful.

> >  2. I have done some more work on making bindings to aspell
> >    available in other programming languages, and, at present,
> >    Python, Perl and C# bindings are available, through SWIG. What I
> >    would like to do is first build a C++ class-based interface, and
> >    use that as a basis for a consistent interface across all
> >    languages.
[...]

> This is better done via the C interface.

That is what I am doing at present, and there is no real problem with
it. The only advantage to a C++ interface would be that it is then
easier to ensure consistency of classes across different object-oriented
programming languages. I guess that it would still be OK to wrap a C++
class around the C interface, and then use that for the SWIG bindings.

> >  3. I see some major stumbling blocks in making aspell work
> >    properly with Indian languages. Perhaps the most significant one
> >    is that in Indian languages it makes sense to deal with
> >    syllables (a clump of consonants, possibly with vowel
> >    modifiers), rather than with individual characters. Thus, for
> >    example, edit distance operations should work on syllables. This
> >    is a little difficult, though not impossible, to do with the
> >    present, non-Unicode, internal functioning of aspell. One way
> would be to have a function inside score_list() that reconverts to
> >    Unicode, and works on syllables. However, it seems silly to do
> >    this, rather than having Unicode throughout. I am aware of
> >    Kevin's arguments for retaining the 128-character space used by
> aspell, but do not see a                    ^^^ that 256.
                                             Oops, sorry.
> >    clean mechanism for handling complex scripts within such a
> >    framework. Comments on this would be appreciated.
>
> Please explain the issue to me in detail or give me some good links.

I have been talking to several people about this, and I think that
it would be best for us to prepare a proper system requirements
specification. There has been much talk about an Indian language
spellchecker, but I am not aware of any (certainly, there are no
open-source ones) that perform with anywhere near the accuracy of
aspell in English.
  The problem might be that the examples will be in Hindi. I could
provide English transliterations of the example Hindi words, and
there are free OpenType, Unicode Hindi fonts available that will at
least let you see the words. I will do my best to get this written up
soon.

> Off hand with out knowing the full details of the problem the answer
> would be to store them internally using syllables, rather then full
> characters. If necessary make use of the Unicode normalization code
> to convert input from the full characters to syllables and back again.

The problem there would be that both the base characters, and the
syllables are needed, and the total number of these might be more than
256 in many Indian languages.
  Thank you for your comments. They are much appreciated, and a working
Hindi spellchecker would be a great advertisement for open-source in
India.

Regards,
Gora

_______________________________________________
Aspell-devel mailing list
address@hidden
http://lists.gnu.org/mailman/listinfo/aspell-devel

From:	Ethan Bradford
Subject:	Re: [aspell-devel] Thoughts on using aspell for Indian language checking
Date:	Sun, 12 Nov 2006 20:12:39 -0800