aramorph-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Aramorph-users] A contribution for AraMorph


From: Ahmed El-dawy
Subject: Re: [Aramorph-users] A contribution for AraMorph
Date: Mon, 13 Jun 2005 11:51:38 +0300

This is the patch for the arabic toknizer.
It uses a range set (found as a new class at the end of the file) to
check for arabic letters. Note that the RangeSet is not yet complete,
I just made the features we are using now and I will try to complete
it ASAP to be reusable.

For changing the format of the dictionaries, I think it is better to
change it into XML format. Then, we can translate it into Arabic
UTF-8. Have you decided a structure for the XML files? If you don't I
can write a DTD file and send it to you for checking. If you have
already made a DTD file I can write a small patch for translating
current dictionaries to XML, and another one to translate it into
Arabic. For the latest one I will use AraMorph to change translitered
words into Arabic words.

On 6/13/05, Pierrick Brihaye <address@hidden> wrote:
> Hi,
> 
> Ahmed El-dawy wrote:
> 
> > Yes, I got your idea. You can work with the previous changes that
> > makes the startup just fine.
> 
> Why not write the interface and the class yourself and... submit a patch ?
> 
> > For the SolutionsHandler, I don't think it is very useful to enhance
> > it unless it is really a bottle neck.
> 
> It shouldn't be, especially with a fast dictionary backend. See below.
> 
> > Expect an incoming patch for the ArabicTokenizer that will use a range
> > set (I will make it soon) for recognizing Arabic letters instead of a
> > long list of if statements.
> 
> OK. But see below...
> 
> > My next step is to translate the dictionaries into Arabic instead of
> > the translitered format. I think I saw this in the TODO list.
> 
> Yes. But there are other things. IMHO, the best would be to have an
> UTF-8 XML format. This obvious (in 2005) solution, is now the one
> retained by Tim Buckwalter. See :
> http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02
> and especially http://www.ldc.upenn.edu/Catalog/docs/LDC2004L02/readme.txt.
> 
> Unfortunately, I'm afraid that we'll have to imagine our own format
> because the new one is undocumented and *not* open source. See, however
> and arabized version of the dictionaries :
> http://cvs.arabeyes.org/viewcvs/projects/duali/data/
> 
> I don't know whether they are in sync with ours though...
> 
> Since I absolutely want to keep the original dictionaries, any
> transformation code (arabic text or XML) should be integrated to the
> project (this is not the case in Duali AFAIK).
> 
> Feel free to write one once we agree on the format.
> 
> > If I
> > succeded in this, we will not have to romanize the words before
> > running the Tim Buckwalter algorithm.
> 
> This is indeed a bottleneck. IMHO, it would be better to solve this
> problem rather than trying to make a more or less efficient
> transliterator. Furthermore, with a native arabic dictionary, we could
> use the native Lucene tokenizers. Well... I plan to provide one to the
> Lucene project.
> 
> > After that, I will try to make a JDBC dictionary handler. For this, I
> > will use IBM CloudScape database. It is open source and also very
> > simple.
> 
> I plan to develop a Lucene backend first : it's fast and pretty
> straightforward (just a directory). Regarding an SQL backend, I would
> lean towards HSQL.
> 
> Cheers,
> 
> --
> Pierrick Brihaye, informaticien
> Service régional de l'Inventaire
> DRAC Bretagne
> mailto:address@hidden
> +33 (0)2 99 29 67 78
> 
> 
> _______________________________________________
> Aramorph-users mailing list
> address@hidden
> http://lists.nongnu.org/mailman/listinfo/aramorph-users
> 


-- 
Regards,
Ahmed Saad

Attachment: ArabicTokenizer.txt
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]