aspell-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Aspell-user] Re: Feedback on our approach to Arabic


From: Ethan Bradford
Subject: [Aspell-user] Re: Feedback on our approach to Arabic
Date: Sat, 11 Mar 2006 19:32:17 -0800

I don't see having archaic words as a particular problem.  It only reduces quality when a user misspells into one.  Besides, some people might use them, and even if they know how to spell them, we don't want to bother them with spelling suggestions!

I don't see that we have a lot of options besides the Buckwalter data at this moment.  I think Arabic is too inflected to build a spell-checker from a straight word list.

Speaking of testing, does anybody on this list have good advice on testing a new dictionary?  Just the obvious?

On 3/10/06, Mohammed Sameer <address@hidden> wrote:
Sorry for the delay,

I've tried generating a sample wordlist and it was fine, I don't really know why we
assumed that aspell won't work with Arabic and M. Elzubeir started the Duali project
and I forked Duali and coded Baghdad.

Well, I have a working spell checker implementation that is using the Duali data set
which is originally the Buckwalter data set.

I can say that the set is not really accurate, It was identifying some misspelled words
as correct and it was failing to identify correct words, While I can accept it not to
identify all the correct words. I can't accept it saying that some misspelled words
are correct.

We have a lot of old words in Arabic that are not really used and being a native
Arabic speaker, I don't think that it's a good idea to list them in our wordlist.
If you use such words then you don't need a spell checker because definitely your
language background is solid enough :-)
I don't like the Buckwalter data set because it contains some incorrect words "of course
it might be a problem in my implementation but it might be a problem with the data set
itself" and because no one really had a look at it and removed old words.

My idea was to generate a somehow authentic data set but I don't have enough *modern*
Arabic text and even if I do, Who is going to check it for errors ? I'm a coder, Not a
linguist and Although I'm a native Arabic speaker, My language is not really that good
and I don't really have much time. All the people out there complaining about the
an Arabic spell checker didn't help in that part and I can say that I'm stuck.

I'm welling to maintain the list of course, But I'm really unable to generate
the initial one.

I can't tell you not to use the Buckwalter data set as I don't have a replacement for
you even if I don't like it and I know that I should either do something or STFU.

Best regards,

On Tue, Mar 07, 2006 at 10:46:26PM -0800, Ethan Bradford wrote:
>
>    Hi, Mohammed et al.  Gokalp Yapici and I are also working on getting Arabic
>    for Aspell.  I thought we could share our plans to see if anybody wants to
>    offer us helpful feedback.
>    For character-set data, we started with the Farsi implementation in Aspell,
>    which uses utf-8 as the word-list encoding and Windows Arabic as the
>    internal encoding.
>    For a word list, our plan is to use the data from Buckwalter's Arabic
>    morphological analyzer -- the same data used in the Duali attempt at Arabic
>    spell checking.  This data has a complex specification of the structure of
>    an Arabic word, which we'll need to translate into the simpler format
>    required by Aspell.
>    In Buckwalter's format, each stem, prefix, or suffix is a member of a stem,
>    prefix, or suffix class.  Three auxilliary files specify which prefix
>    classes can connect to which stem classes; which stem classes can connect to
>    which suffix classes; and which prefix classes are compatible with which
>    suffix classes.
>    If it weren't for that last file, this would be an easy problem: it would
>    just be a matter of translating code names.  Instead, we'll write perl
>    scripts to recognize the easy translations (when no prefix/suffix
>    combination is allowd, or all combinations are allowed), and do the easy
>    thing.  For the harder combinations (where some of the prefixes go to some
>    of the suffixes) we'll expand out the prefixes or the suffixes (whichever
>    there are fewer of), combining them with the stems as new "stem" entries.
>    There are a total of 170 affix (suffix and prefix) classes to start with.
>    We'll probably more than run out of Aspell class codes (they're limited to
>    255) with the new classes we're creating.  If that's very severe, I'll see
>    if we can't get Aspell updated to allow more suffix classes.  Otherwise,
>    we'll just explicitly expand out the combinations which lead to the fewest
>    new entries in the stem list.
>    What are some of the issues we haven't thought of?  Any feedback is welcome!

--
GNU/Linux registered user #224950
Proud Egyptian GNU/Linux User Group <www.eglug.org> Admin.
Life powered by Debian, Homepage: www.foolab.org
--
Don't send me any attachment in Micro$oft (.DOC, .PPT) format please
Read http://www.gnu.org/philosophy/no-word-attachments.html
Preferable attachments: .PDF, .HTML, .TXT
Thanx for adding this text to Your signature


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFEET9Jy2aOKaP9DfcRAmvDAKCOu1s8qbhxAeADTuekIHedgb+gygCfZg/j
86BFFCgyCwWVV+VRKc5pQps=
=sILT
-----END PGP SIGNATURE-----




reply via email to

[Prev in Thread] Current Thread [Next in Thread]