Re: [aspell-devel] Thoughts on using aspell for Indian language ing

On 11/13/06, address@hidden <address@hidden> wrote:

On 2:03:08 pm 11/13/06 Kevin Atkinson <address@hidden> wrote:
> On Mon, 13 Nov 2006, address@hidden wrote:
[...]
> >  Linguistically, this consists
> >  of the consonant "ca" (U091A), and the conjunct "kra", à¤à¥à¤°
> >  (U0915 + U094D +
> >  U0930), and the UTF-8 storage would be U091A U0915 U094D U0930.
>
> So how many "letters"?  Is that 3 or 4?  Is U094D considered a
> "letter"?

That is 4 letters, including the initial consonant that is separate.
The conjunct itself is three.
>
> >  Now, any
> >  calculations of edit distance, such as swap, etc., should use the
> >  consonant "ca" and the conjunct "kra", not the individual Unicode
> >  characters. If for example, we operated on the individual
> >  characters, a swap might move the "halant" (U094D) ahead of the
> >  "ka" (U0915), making the character sequence U091A U094D U0915
> >  U0930. As the "halant" is what is used to construct conjuncts,
> >  this makes a new conjunct, "chka", à¤à¥à¤ (U091A
> >  + U094D + U0915), followed by the consonant "ra", à¤° (U0930).
> >  This is not desirable, as a confusion of spelling would never
> >  arise between "chka" and "kra".
>
> So it is never the case you might want to substitute a letter in the
> conjunct with another letter?  I assume you would.  I would also
> assume that you would want to consider two conjuncts which are the
> same except for one letter as closer than two completely different
> conjuncts?

Yes, it is desirable to substitute a letter in the conjunct with
another letter, but the above example, where moving the halant changes
the structure of the word is unlikely to be a likely mistake. I have
to think this through further, but maybe an edit distance mechanism
that keeps the position of the halant immutable might be the way to go.

> Also how likely is it that the user will swap two glyphs?

Not very likely as a typing error. However, it is quite likely that one
syllable might be substituted mentally for another while thinking about
what to write.

> Also if you every want to implement any sort of true soundslike I
> would think you would want to work with letters not syllables.

I will need more advice from you on this, but I would have thought
that syllables are better to work with, especially as most Indian
languages are spelt phonetically.

> >   Hope this makes more sense. I will come up with a more detailed
> >  write-up including a description of conjuncts, and why one should
> >  use syllables, rather than characters, as the basic units for
> >  Indian language spellchecking. Some of these issues, maybe most of
> >  them, can be made up for by appropriate soundslike rules. I really
> >  should try out some quantitative tests first.
>
> Possible but you really need a "looks like" rather than a
> "soundslike". I agree if you want to unique represent each syllable
> you may run out of symbols to use.
>
> However, it may me better to just use a syllable aware edit distance.

That is a very good suggestion, and I have to try it out.

> I now understand the issue.  However, I think that the fact that
> Aspell is 8-bit internally is a very small factor.  Converting Aspell
> to be 16-bit internally will not magically fix this issue.  I don't
> even think it will make it significantly easier to solve.

Yes, the 8-bit size is not so much the issue. It is more that if the
internal representation were Unicode, it would be easier to use
existing libraries to parse syllables. However, a workaround is
probably not too difficult.

> I do believe to truly handle this situation well some modifications
> will need to be made to Aspell.  I suggest you start studying
> readonly_ws.cpp and suggest.cpp.  I while ago I wrote some docs on
> how Aspell works:    http://lists.gnu.org/archive/html/aspell-devel/20
> 05-09/msg00007.html
>     http://lists.gnu.org/archive/html/aspell-devel/2005-10/msg00000.htm
> l
> which may be helpful.

Thanks. These look useful.

> I will get back to you latter with some ideas on how to approach this
> issue.  If you already thought of some please share them.

I am realising that linguistically I am probably in over my depth with
Hindi. However, we are meeting this Sat., along with some literary
Hindi folk, and I am talking to experts in other Indian languages, to
plan out an approach. I will certainly make these available, probably
on a Wiki page.
  Thanks for all the interest that you have shown in this.

Regards,
Gora

_______________________________________________
Aspell-devel mailing list
address@hidden
http://lists.gnu.org/mailman/listinfo/aspell-devel

From:	Ethan Bradford
Subject:	Re: [aspell-devel] Thoughts on using aspell for Indian language ing
Date:	Mon, 13 Nov 2006 09:37:15 -0800