aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] Thoughts on using aspell for Indian language ing


From: Jose Da Silva
Subject: Re: [aspell-devel] Thoughts on using aspell for Indian language ing
Date: Mon, 13 Nov 2006 16:19:14 -0800
User-agent: KMail/1.7.2

In case other readers are following the unicode mentioned in these 
threads: http://www.unicode.org/charts/PDF/U0900.pdf


On November 13, 2006 05:03 am, Kevin Atkinson wrote:
> On Mon, 13 Nov 2006, address@hidden wrote:
> > On 11:42:07 am 11/13/06 Kevin Atkinson <address@hidden> wrote:
> I do believe to truly handle this situation well some modifications
> will need to be made to Aspell.  I suggest you start studying
> readonly_ws.cpp and suggest.cpp.  I while ago I wrote some docs on
> how Aspell works:
> http://lists.gnu.org/archive/html/aspell-devel/2005-09/msg00007.html
> http://lists.gnu.org/archive/html/aspell-devel/2005-10/msg00000.html
> which may be helpful.
>
> I will get back to you latter with some ideas on how to approach this
> issue.  If you already thought of some please share them.

Just an idea, but maybe this could be done more modular as plug-ins.
So, if spellchecking English, you use English plug-ins, Hindi - use 
Hindi plug-ins, French - French plug-ins, etc and likewise with other 
languages which may benefit from other scoring methods that aren't 
English-like based.

Each language has probably preferred methods on suggesting words whether 
it is soundslike, lookslike, based on swapping, consonants, syllables, 
accents, halants, gender, etc.
For example, Aspell may like to strip accents and swap characters for 
English, but as Gora indicates, perhaps Aspell will benefit from some 
sort of plugin geared towards Hindi that is more Hindi based which 
treats groups of characters as 1 and then swapping that 1 with another 
likely group of characters.

--------------

Other ideas, From reading the list and viewing other points (example 
Kevin mentioning Ethan's scoring being considered for maybe 0.61, some 
changes I've made are in 0.61 and not in 0.60, etc...), I'm guessing 
Kevin is trying to close-off the 0.60.x as a stable version with really 
no big changes beyond obvious bug fixes.
Perhaps the big changes like Ethan's scoring and Gora's fixes that are 
incorporated (and working) in the 0.60.3 version could be diff(ed) to 
take the modifications/fixes and start applying those ideas towards the 
0.61 version.

-------------------

Comments about the initial questions Gora asked in this thread:

> 1. I would like to volunteer to work on writing a proper C++
> interface to aspell. This would include a public interface that
> exposes only the normal spellchecking facilities in a class, as well
> as a testing interface that provides access to internals like the
> scores, weights, and even costs for computing edit distance. I
> already have something that makes the testing part available, but it
> is rather hacked up. If we can discuss what might be an interface
> that can get accepted into aspell, I would be glad to work on it.
> 2. I have done some more work on making bindings to aspell available
> in other programming languages, and, at present, Python, Perl and C#
> bindings are available, through SWIG. What I would like to do is
> first build a C++ class-based interface, and use that as a basis for
> a consistent interface across all languages. Besides the bindings,
> this would include example programs for using them, as well as GUI
> implementations in at least one language that provide a front-end to
> spellchecking, as well as to the testing framework.

C++ may be easier for some users, but if you want to have a binary 
compatible library, you should really try to stay with one language 
like C so that it remains somewhat binary compatible with existing 
programs.

> 3. I see some major stumbling blocks in making aspell work properly
> with Indian languages. Perhaps the most significant one is that in
> Indian languages it makes sense to deal with syllables (a clump of
> consonants, possibly with vowel modifiers), rather than with
> individual characters. Thus, for example, edit distance operations
> should work on syllables. This is a little difficult, though not
> impossible, to do with the present, non-Unicode, internal functioning
> of aspell. One way would be to have a function inside score_list()
> that reconverts to Unicode, and works on syllables. However, it seems
> silly to do this, rather than having Unicode throughout. I am aware
> of Kevin's arguments for retaining the 128-character space used by
> aspell, but do not see a clean mechanism for handling complex scripts
> within such a framework. Comments on this would be appreciated.

As mentioned earlier, it's about 254 characters (+ linefeed + 0x00) but 
on the surface, it can look like unicode U0900...U09ff.
In this case, it may make more sense to create special read_only.cpp and 
suggest.cpp specific routines to treat groups of syllables as one. If 
it is in some sort of plug-in type of format, it may open the door to 
other languages and their special needs.

> 4. There are other niceties that would improve spellchecking in
> Indian languages, such as the use of a morphological analyser to
> identify the type of the word, and also its gender if it is a noun.
> This can however, probably be handled by a pre or post filters to
> normal aspell checking.

I would suggest to please go ahead with the idea, someone has to be the 
1st to try and build something, so if you have the energy and 
determination to do it, please go ahead, and eventually the other 
(languages) will follow in time. I would suggest maybe trying to keep 
your ideas universal so that they can be modified / adapted for other 
languages.

Some of us do run-out-of-steam, so hearing someone pledging to Volunteer 
to do something goes a long way, and I suggest it is better to try and 
take Aspell as far as you can, possibly improve it versus creating yet 
another fork (you can notice various versions and types of spell 
checkers out there in various states of maintenance or disrepair, some 
have good ideas, while others ran-out-of-steam while being built).

Cheers!




reply via email to

[Prev in Thread] Current Thread [Next in Thread]