Re: [Aramorph-users] A contribution for AraMorph

aramorph-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Aramorph-users] A contribution for AraMorph

From:	Pierrick Brihaye
Subject:	Re: [Aramorph-users] A contribution for AraMorph
Date:	Mon, 13 Jun 2005 09:51:13 +0200
User-agent:	Mozilla/5.0 (Windows; U; Win98; fr-FR; rv:1.7.8) Gecko/20050511

Hi,

Ahmed El-dawy wrote:

Yes, I got your idea. You can work with the previous changes that
makes the startup just fine.


Why not write the interface and the class yourself and... submit a patch ?

For the SolutionsHandler, I don't think it is very useful to enhance
it unless it is really a bottle neck.


It shouldn't be, especially with a fast dictionary backend. See below.

Expect an incoming patch for the ArabicTokenizer that will use a range
set (I will make it soon) for recognizing Arabic letters instead of a
long list of if statements.


OK. But see below...

My next step is to translate the dictionaries into Arabic instead of
the translitered format. I think I saw this in the TODO list.

Yes. But there are other things. IMHO, the best would be to have anUTF-8 XML format. This obvious (in 2005) solution, is now the oneretained by Tim Buckwalter. See :http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02and especially http://www.ldc.upenn.edu/Catalog/docs/LDC2004L02/readme.txt.

Unfortunately, I'm afraid that we'll have to imagine our own formatbecause the new one is undocumented and *not* open source. See, howeverand arabized version of the dictionaries :http://cvs.arabeyes.org/viewcvs/projects/duali/data/


I don't know whether they are in sync with ours though...

Since I absolutely want to keep the original dictionaries, anytransformation code (arabic text or XML) should be integrated to theproject (this is not the case in Duali AFAIK).


Feel free to write one once we agree on the format.

If I
succeded in this, we will not have to romanize the words before
running the Tim Buckwalter algorithm.

This is indeed a bottleneck. IMHO, it would be better to solve thisproblem rather than trying to make a more or less efficienttransliterator. Furthermore, with a native arabic dictionary, we coulduse the native Lucene tokenizers. Well... I plan to provide one to theLucene project.

After that, I will try to make a JDBC dictionary handler. For this, I
will use IBM CloudScape database. It is open source and also very
simple.

I plan to develop a Lucene backend first : it's fast and prettystraightforward (just a directory). Regarding an SQL backend, I wouldlean towards HSQL.


Cheers,

--
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:address@hidden
+33 (0)2 99 29 67 78

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Aramorph-users] A contribution for AraMorph, Ahmed El-dawy, 2005/06/05
- Re: [Aramorph-users] A contribution for AraMorph, Pierrick Brihaye, 2005/06/05
  - Re: [Aramorph-users] A contribution for AraMorph, Ahmed El-dawy, 2005/06/08
    - Re: [Aramorph-users] A contribution for AraMorph, Pierrick Brihaye, 2005/06/08
    - Re: [Aramorph-users] A contribution for AraMorph, Ahmed El-dawy, 2005/06/12
    - Re: [Aramorph-users] A contribution for AraMorph, Pierrick Brihaye, 2005/06/12
    - Re: [Aramorph-users] A contribution for AraMorph, Ahmed El-dawy, 2005/06/13
    - Re: [Aramorph-users] A contribution for AraMorph, Pierrick Brihaye <=
    - Re: [Aramorph-users] A contribution for AraMorph, Ahmed El-dawy, 2005/06/13
    - Re: [Aramorph-users] A contribution for AraMorph, Pierrick Brihaye, 2005/06/13

Prev by Date: Re: [Aramorph-users] A contribution for AraMorph
Next by Date: Re: [Aramorph-users] A contribution for AraMorph
Previous by thread: Re: [Aramorph-users] A contribution for AraMorph
Next by thread: Re: [Aramorph-users] A contribution for AraMorph
Index(es):
- Date
- Thread