aramorph-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Aramorph-users] A contribution for AraMorph


From: Pierrick Brihaye
Subject: Re: [Aramorph-users] A contribution for AraMorph
Date: Mon, 13 Jun 2005 09:51:13 +0200
User-agent: Mozilla/5.0 (Windows; U; Win98; fr-FR; rv:1.7.8) Gecko/20050511

Hi,

Ahmed El-dawy wrote:

Yes, I got your idea. You can work with the previous changes that
makes the startup just fine.

Why not write the interface and the class yourself and... submit a patch ?

For the SolutionsHandler, I don't think it is very useful to enhance
it unless it is really a bottle neck.

It shouldn't be, especially with a fast dictionary backend. See below.

Expect an incoming patch for the ArabicTokenizer that will use a range
set (I will make it soon) for recognizing Arabic letters instead of a
long list of if statements.

OK. But see below...

My next step is to translate the dictionaries into Arabic instead of
the translitered format. I think I saw this in the TODO list.

Yes. But there are other things. IMHO, the best would be to have an UTF-8 XML format. This obvious (in 2005) solution, is now the one retained by Tim Buckwalter. See : http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02 and especially http://www.ldc.upenn.edu/Catalog/docs/LDC2004L02/readme.txt.

Unfortunately, I'm afraid that we'll have to imagine our own format because the new one is undocumented and *not* open source. See, however and arabized version of the dictionaries : http://cvs.arabeyes.org/viewcvs/projects/duali/data/

I don't know whether they are in sync with ours though...

Since I absolutely want to keep the original dictionaries, any transformation code (arabic text or XML) should be integrated to the project (this is not the case in Duali AFAIK).

Feel free to write one once we agree on the format.

If I
succeded in this, we will not have to romanize the words before
running the Tim Buckwalter algorithm.

This is indeed a bottleneck. IMHO, it would be better to solve this problem rather than trying to make a more or less efficient transliterator. Furthermore, with a native arabic dictionary, we could use the native Lucene tokenizers. Well... I plan to provide one to the Lucene project.

After that, I will try to make a JDBC dictionary handler. For this, I
will use IBM CloudScape database. It is open source and also very
simple.

I plan to develop a Lucene backend first : it's fast and pretty straightforward (just a directory). Regarding an SQL backend, I would lean towards HSQL.

Cheers,

--
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:address@hidden
+33 (0)2 99 29 67 78




reply via email to

[Prev in Thread] Current Thread [Next in Thread]