aramorph-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Aramorph-users] XML tables


From: Ahmed El-dawy
Subject: Re: [Aramorph-users] XML tables
Date: Mon, 15 Aug 2005 11:41:47 +0300



On 8/11/05, Pierrick Brihaye <address@hidden> wrote:
> So what about the hierarchy of the stems dictionary? Please give me more information for this point.

See : http://www.nongnu.org/aramorph/english/dictionaries.html#Stems.
The root "ktb" (";--- ktb" in the file) has *many* lemmas. However,
keeping a trace of it may help in writing a root analyzer (useful for
linguists ;-).
 
This means that the stems dictionary will be stored as a list of dictionary entries like prefix dictionary, and all entries will have attributes, or whatever, for lemma-id and root. I think it will be better to be <root> tags, with <lemma> tags inside, and then <entry> tags inside each <lemma> tag. This is better because there will not be any redundency.

>     <!ATTLIST lemma lemma-id CDATA #REQUIRED>
>
>     an id attribute hould be enough.
>
> That's an easy one.

... and a good pratice ;-)
 
??? See the previous point :-|

>      > I have also made an
>      > XMLDictionaryHandler which parses XML tables, using digester from
>      > Jakarta commons, and loads them into memory.
>
>     What does the digester adds to a sandart XML parser ?
>
> Digester is event based. This is faster and requires less memory when
> the XML file is passes only once. The dictStems.xml file is about 32
> MB!!! It would certainly make an Out of Memory Exception if it is all
> loaded in memory.

Eeeer... the Java *standard* SAX parser does it, doesn't it ? A SAX
parser is really the thing we need here : big file, poor structure.
 
I saw the SAX parser and I can use it. However, Digester is much easier to use. Also I have done it already using Digester. Is there a problem using libraries from Jakarta Project? You already use MultiHashMap from Jakarta Commons (even though it is not necessary).

BTW, still as a quick answer : I think that the 3 compatibility tables may be merged in one single file.
 
What is the gain of this? I don't think it will make the startup or size any faster!

--
regards,
Ahmed Saad

reply via email to

[Prev in Thread] Current Thread [Next in Thread]