aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] Thoughts on using aspell for Indian language checking


From: Kevin Atkinson
Subject: Re: [aspell-devel] Thoughts on using aspell for Indian language checking
Date: Sun, 12 Nov 2006 22:32:36 -0700 (MST)

On Sun, 12 Nov 2006, address@hidden wrote:

On 2:38:54 pm 11/12/06 Kevin Atkinson <address@hidden> wrote:
[...]
On Sat, 11 Nov 2006, address@hidden wrote:
[...]
 1. I would like to volunteer to work on writing a proper C++
   interface to aspell.
[...]

I assume you are already aware that Aspell has a C interface.  I do
not have a C++ for one very good reason: binary compatibility.  If
anything I would like to improve the exiting C interface to support
the extended functionally.  Not create a new one.

OK, makes sense. What I have done at present is to expose internals
like aspell edit distance costs, and scores through several levels
of get/set functions in the C++ classes, and finally include them in
the C interface. The code is a little ugly in places, and patches 0.60.3.
If you think that this would be useful, I will make a design to export
scores, present it for approval, and then patch 0.60.4, or a later
development version. As you recommend, I will continue to use the C
interface.

OK let me know when you have a patch.

The reason that I was proposing to work on a C++ interface
was that somewhere in the aspell documentation there was a note that
such an interface would be useful.

The idea was to build a C++ interface on top of the C one.  Ideally it
should be defined completely in header files to avoid the nasty issues
of binary compatibility.

 2. I have done some more work on making bindings to aspell
   available in other programming languages, and, at present,
   Python, Perl and C# bindings are available, through SWIG. What I
   would like to do is first build a C++ class-based interface, and
   use that as a basis for a consistent interface across all
   languages.
[...]

This is better done via the C interface.

That is what I am doing at present, and there is no real problem with
it. The only advantage to a C++ interface would be that it is then
easier to ensure consistency of classes across different object-oriented
programming languages. I guess that it would still be OK to wrap a C++
class around the C interface, and then use that for the SWIG bindings.

I really don't see a point on adding this extra layer of indirection
just to "ensure consistency of classes across different
object-oriented programming languages".

 3. I see some major stumbling blocks in making aspell work
   properly with Indian languages. Perhaps the most significant one
   is that in Indian languages it makes sense to deal with
   syllables (a clump of consonants, possibly with vowel
   modifiers), rather than with individual characters. Thus, for
   example, edit distance operations should work on syllables. This
   is a little difficult, though not impossible, to do with the
   present, non-Unicode, internal functioning of aspell. One way
would be to have a function inside score_list() that reconverts to
   Unicode, and works on syllables. However, it seems silly to do
   this, rather than having Unicode throughout. I am aware of
   Kevin's arguments for retaining the 128-character space used by
aspell, but do not see a                    ^^^ that 256.
                                            Oops, sorry.
   clean mechanism for handling complex scripts within such a
   framework. Comments on this would be appreciated.

Please explain the issue to me in detail or give me some good links.

I have been talking to several people about this, and I think that
it would be best for us to prepare a proper system requirements
specification. There has been much talk about an Indian language
spellchecker, but I am not aware of any (certainly, there are no
open-source ones) that perform with anywhere near the accuracy of
aspell in English.
 The problem might be that the examples will be in Hindi. I could
provide English transliterations of the example Hindi words, and
there are free OpenType, Unicode Hindi fonts available that will at
least let you see the words. I will do my best to get this written up
soon.

Yes please let me know.

Off hand with out knowing the full details of the problem the answer
would be to store them internally using syllables, rather then full
characters. If necessary make use of the Unicode normalization code
to convert input from the full characters to syllables and back again.

The problem there would be that both the base characters, and the
syllables are needed, and the total number of these might be more than
256 in many Indian languages.
 Thank you for your comments. They are much appreciated, and a working
Hindi spellchecker would be a great advertisement for open-source in
India.

I have done a systematic survey of all languages and the conclusion it
that it any written languages not based on hanzi (Chinese, Japanese,
Korean) will fit in an 8-bit character set.  See
http://aspell.net/man-html/Supported.html.  If I am missing something
let me know.

BTW: I assume you know that there is a very basic dictionary Hindi
dictionary available for Aspell.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]