aspell-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Aspell-user] Re: Feedback on our approach to Arabic


From: Mohammed Sameer
Subject: [Aspell-user] Re: Feedback on our approach to Arabic
Date: Fri, 10 Mar 2006 10:56:41 +0200
User-agent: Mutt/1.5.11+cvs20060126

Sorry for the delay,

I've tried generating a sample wordlist and it was fine, I don't really know 
why we
assumed that aspell won't work with Arabic and M. Elzubeir started the Duali 
project
and I forked Duali and coded Baghdad.

Well, I have a working spell checker implementation that is using the Duali 
data set
which is originally the Buckwalter data set.

I can say that the set is not really accurate, It was identifying some 
misspelled words
as correct and it was failing to identify correct words, While I can accept it 
not to
identify all the correct words. I can't accept it saying that some misspelled 
words
are correct.

We have a lot of old words in Arabic that are not really used and being a native
Arabic speaker, I don't think that it's a good idea to list them in our 
wordlist.
If you use such words then you don't need a spell checker because definitely 
your
language background is solid enough :-)
I don't like the Buckwalter data set because it contains some incorrect words 
"of course
it might be a problem in my implementation but it might be a problem with the 
data set
itself" and because no one really had a look at it and removed old words.

My idea was to generate a somehow authentic data set but I don't have enough 
*modern*
Arabic text and even if I do, Who is going to check it for errors ? I'm a 
coder, Not a
linguist and Although I'm a native Arabic speaker, My language is not really 
that good
and I don't really have much time. All the people out there complaining about 
the
an Arabic spell checker didn't help in that part and I can say that I'm stuck.

I'm welling to maintain the list of course, But I'm really unable to generate
the initial one.

I can't tell you not to use the Buckwalter data set as I don't have a 
replacement for
you even if I don't like it and I know that I should either do something or 
STFU.

Best regards,

On Tue, Mar 07, 2006 at 10:46:26PM -0800, Ethan Bradford wrote:
> 
>    Hi, Mohammed et al.  Gokalp Yapici and I are also working on getting Arabic
>    for Aspell.  I thought we could share our plans to see if anybody wants to
>    offer us helpful feedback.
>    For character-set data, we started with the Farsi implementation in Aspell,
>    which uses utf-8 as the word-list encoding and Windows Arabic as the
>    internal encoding.
>    For a word list, our plan is to use the data from Buckwalter's Arabic
>    morphological analyzer -- the same data used in the Duali attempt at Arabic
>    spell checking.  This data has a complex specification of the structure of
>    an Arabic word, which we'll need to translate into the simpler format
>    required by Aspell.
>    In Buckwalter's format, each stem, prefix, or suffix is a member of a stem,
>    prefix, or suffix class.  Three auxilliary files specify which prefix
>    classes can connect to which stem classes; which stem classes can connect 
> to
>    which suffix classes; and which prefix classes are compatible with which
>    suffix classes.
>    If it weren't for that last file, this would be an easy problem: it would
>    just be a matter of translating code names.  Instead, we'll write perl
>    scripts to recognize the easy translations (when no prefix/suffix
>    combination is allowd, or all combinations are allowed), and do the easy
>    thing.  For the harder combinations (where some of the prefixes go to some
>    of the suffixes) we'll expand out the prefixes or the suffixes (whichever
>    there are fewer of), combining them with the stems as new "stem" entries.
>    There are a total of 170 affix (suffix and prefix) classes to start with.
>    We'll probably more than run out of Aspell class codes (they're limited to
>    255) with the new classes we're creating.  If that's very severe, I'll see
>    if we can't get Aspell updated to allow more suffix classes.  Otherwise,
>    we'll just explicitly expand out the combinations which lead to the fewest
>    new entries in the stem list.
>    What are some of the issues we haven't thought of?  Any feedback is 
> welcome!

-- 
GNU/Linux registered user #224950
Proud Egyptian GNU/Linux User Group <www.eglug.org> Admin.
Life powered by Debian, Homepage: www.foolab.org
--
Don't send me any attachment in Micro$oft (.DOC, .PPT) format please
Read http://www.gnu.org/philosophy/no-word-attachments.html
Preferable attachments: .PDF, .HTML, .TXT
Thanx for adding this text to Your signature

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]