aramorph-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Aramorph-users] XML tables


From: Pierrick Brihaye
Subject: Re: [Aramorph-users] XML tables
Date: Wed, 15 Jun 2005 11:43:55 +0200
User-agent: Mozilla/5.0 (Windows; U; Win98; fr-FR; rv:1.7.8) Gecko/20050511

Hi again,

Ahmed El-dawy wrote:

Actually I made this part like this because it is so similar to this
of the sample at LDC.

The problem is that people from the LDC know what they are talking about :-)

Also it keeps the xml files small if we consider
this matter.

You're right, but since we will provide the files in a jar file, the redundancy (and thus the compression rate) will be roughly the same.

Anyway, it is changed now to be more readable. Actually,
I don't know the meaning of the word (pos) till now :)

POS = "part of speech".

May be changed to :
<glosses>
 <gloss>and</gloss>
 <gloss>by/with</gloss>
</glosses>

Yes, you are right at this. I've changed it in the new version.

Thank you. On this point, my implementation differs from the original one because I try to "shift" prefixes and suffixes from the stem definition (and may even generate words with a NO_STEM type, eg. bihi, fyha...). Splitting the values in the XML files makes this process more obvious IMHO.

We may also go further however by introducing <prefix-pos/gloss>, <stem-pos/gloss> and <suffix-pos/gloss> which are *very* accurate in the stems dictionary. What's your mind on this point ?

And, of course, the arabic words sould be encoded... in arabic.

I will do it after making the xml files, maybe at the same program who
translates current dictionary to xml files.

It would be a good idea.

By the way, there's a
problem if we transformed to xml using the transliteration. One symbol
used is (>) which is already used for closing tag names in XML. We
will have to transform this into &gt;.

Correct. This shouldn't be a problem if the files are generated through a SAX parser that will handle this escaping automatically.

Regarding, the stems dictionary, the format has to be slightly different
because we have additional information (see
http://www.nongnu.org/aramorph/english/dictionaries.html) :

<root>ktb</root>
<lemmaID>katab-u_1</lemmaID>

and, maybe, a "normalised" lemma
<lemma>katab</lemmaID>

I know that the lemma is the one at a line starting with two
semicolons (;;), but what is this root?

;--- ktb

We first have to check if this formalism is consistent throughout the stems dictionary...

Regarding the compatibility tables, something like this would be nice :


See the current version (attached) and tell me

Since the DTDs are very short, you shoud embed then in the XML files. See http://www.thescarms.com/XML/DTDTutorial.asp.

Regarding :

<grammatical-categories>
  <grammatical-category>wa/CONJ</grammatical-category>
</grammatical-categories>

we may consider a :

grammatical-categories|grammatical-category

content-model. I don't know if it's accurate though since such a less verbose format may introduce unnecessary processing difficulties.

Well, using an XML file format would greatly help us in providing a Web interface that could allow adding new words in the dictionaries.

Cheers,

--
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:address@hidden
+33 (0)2 99 29 67 78




reply via email to

[Prev in Thread] Current Thread [Next in Thread]