[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Aramorph-users] A contribution for AraMorph
From: |
Pierrick Brihaye |
Subject: |
Re: [Aramorph-users] A contribution for AraMorph |
Date: |
Mon, 13 Jun 2005 09:51:13 +0200 |
User-agent: |
Mozilla/5.0 (Windows; U; Win98; fr-FR; rv:1.7.8) Gecko/20050511 |
Hi,
Ahmed El-dawy wrote:
Yes, I got your idea. You can work with the previous changes that
makes the startup just fine.
Why not write the interface and the class yourself and... submit a patch ?
For the SolutionsHandler, I don't think it is very useful to enhance
it unless it is really a bottle neck.
It shouldn't be, especially with a fast dictionary backend. See below.
Expect an incoming patch for the ArabicTokenizer that will use a range
set (I will make it soon) for recognizing Arabic letters instead of a
long list of if statements.
OK. But see below...
My next step is to translate the dictionaries into Arabic instead of
the translitered format. I think I saw this in the TODO list.
Yes. But there are other things. IMHO, the best would be to have an
UTF-8 XML format. This obvious (in 2005) solution, is now the one
retained by Tim Buckwalter. See :
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02
and especially http://www.ldc.upenn.edu/Catalog/docs/LDC2004L02/readme.txt.
Unfortunately, I'm afraid that we'll have to imagine our own format
because the new one is undocumented and *not* open source. See, however
and arabized version of the dictionaries :
http://cvs.arabeyes.org/viewcvs/projects/duali/data/
I don't know whether they are in sync with ours though...
Since I absolutely want to keep the original dictionaries, any
transformation code (arabic text or XML) should be integrated to the
project (this is not the case in Duali AFAIK).
Feel free to write one once we agree on the format.
If I
succeded in this, we will not have to romanize the words before
running the Tim Buckwalter algorithm.
This is indeed a bottleneck. IMHO, it would be better to solve this
problem rather than trying to make a more or less efficient
transliterator. Furthermore, with a native arabic dictionary, we could
use the native Lucene tokenizers. Well... I plan to provide one to the
Lucene project.
After that, I will try to make a JDBC dictionary handler. For this, I
will use IBM CloudScape database. It is open source and also very
simple.
I plan to develop a Lucene backend first : it's fast and pretty
straightforward (just a directory). Regarding an SQL backend, I would
lean towards HSQL.
Cheers,
--
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:address@hidden
+33 (0)2 99 29 67 78
- Re: [Aramorph-users] A contribution for AraMorph, Ahmed El-dawy, 2005/06/05
- Re: [Aramorph-users] A contribution for AraMorph, Pierrick Brihaye, 2005/06/05
- Re: [Aramorph-users] A contribution for AraMorph, Ahmed El-dawy, 2005/06/08
- Re: [Aramorph-users] A contribution for AraMorph, Pierrick Brihaye, 2005/06/08
- Re: [Aramorph-users] A contribution for AraMorph, Ahmed El-dawy, 2005/06/12
- Re: [Aramorph-users] A contribution for AraMorph, Pierrick Brihaye, 2005/06/12
- Re: [Aramorph-users] A contribution for AraMorph, Ahmed El-dawy, 2005/06/13
- Re: [Aramorph-users] A contribution for AraMorph,
Pierrick Brihaye <=
- Re: [Aramorph-users] A contribution for AraMorph, Ahmed El-dawy, 2005/06/13
- Re: [Aramorph-users] A contribution for AraMorph, Pierrick Brihaye, 2005/06/13