freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freecats-Dev] Bilingual File Format (again)


From: Henri Chorand
Subject: [Freecats-Dev] Bilingual File Format (again)
Date: Tue, 01 Apr 2003 21:43:31 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

Hi Marc,

Firstly, thank you to all members of Free CATS for your confidence in
> the future of the OmegaT project.

Three is a crowd, as Buster Keaton (I think) once said ;-)

Keith will no doubt begin making contributions of his own to the list
> in due course, but at the moment both he and I are under time pressure
> owing to other activities. Please be patient!

No problem - we all are very busy persons.

Now, concerning the Famous Bilingual file format:

When I was thinking about starting from scratch, I started wanting to make up a very simple, yet extensible enough, design:

1) Being ignorant about XML, apart from its most basic principles, I still wanted it to be XML-based, that is, tag-based - XML is the future, or so I heard everybody knowledgeable say.

2) I knew it had to include formatting info. I first looked for tagging info in XML specs and asked a few techies around, only to find out that, in itself, XML specs do not say anything about it. I then supposed we could use HTML's set of formatting tags, at least for a start (I assumed that for the more complicated formatting tags found in some DTP & word-processor, we could simply find a way to keep them unchanged and end up with a pseudo-Wysiwyg approach that would be good enough for us translators).

3) I also knew that we had to parse any XML source file sequentially, in a "dumb" way (only caring about its text & formatting contents and leaving its structure unchanged, even and especially if it was supposedly weird or malformed, and all the more since most existing HTML files found today ARE badly formatted from XML's point of view).

4) I spoke to Thierry about it, and it emerged that we could envision a bilingual file format made up of our own custom tags (beginning of TU (including anciliary TU info), middle of TU, end of TU). We would have kept the internal tags (Trados' tw4winInternal style) within the TUs' source & target segments, and proudly left unchanged all "structure" tags (Trados' tw4winExternal style).


An industry-standard tagged bilingual file format would be a major breakthrough. I am currently in the position of arguing vehemently
> that TMX, and not Trados' native translation memory format, should
> be regarded as the industry-standard translation memory format.

We all agree about TMX. The question that remains is about the bilingual file format.

> Trados though, with its "uncleaned file" format, has a format for
> which there is no industry-standard equivalent, and so the Trados
> format can effectively claim this status by default.  :-(

Trados' one is either:
- nicely based on character styles for MS Word / RTF files (but we don't want to work within a MS Word framework, do we?) - tagged (proprietary) for HTML / XML files - with a voluntarily blatant incompatibility between Trados 3 (.BIF) & Trados 5 and later.

Wordfast cleverly clones Trados (Word version) on this, but has no tagged format for HTML/XML files as it preps them in order to allow them to be also translated with MS Word. Yves will correct me if I'm wrong.


However, I find it difficult to conceive of an industry-standard
> tagged bilingual file format in the absence of an industry-standard
> tagged (monolingual) word processing file format.

Of course we must end up designing our own.

For me, the question is, how to nicely & efficiently design something, starting from what is readily available.

If, for the sake of argument, OpenOffice.org's file format (which is
> at least open, documented, extensible, and has been submitted to the
> W3C for formal recognition as a standard) is accepted as the standard
> for a *monolingual* word processing file format, the step to a tagged
> bilingual file format is trivial.

Exactly. I believe this is what we need, for the following obvious reasons:

- It's an open, XML-based, tagged document format - certainly the best one available today.

- Keith's OmegaT already understands it pretty well.

It may well be possible to add such functionality with no alteration to the OOo code, purely by modification of the XML mechanisms (DTD
> etc.).

I request a vote from the project team, as I believe we could all agree on this.


(...)
Why wait for the appearance of a bilingual file format? There are
> lots of conversion filters which would be advantageous in their own
> right. .po to TMX, for example, and vice-versa, would be beneficial
> to OmegaT - I think that benefit is independent of a bilingual file
> format. Even TMX2 to TMX1 would be an advantage. It may well be
> that such filters already exist.

I'm not sure I fully got you.
The way I understand it, any conversion filter should work between a given native file format and our own bilingual format, and the CAT software should only care about properly translating bilingual format files. If we don't go that way, how do we do?

Of course, the above comes from a restricted mind (mine), in that Open Office already provides a nice bunch of conversion filters.

Therefore, the obvious goal seems to be able to introduce our own TU-level and segment-level (within TU) tags within the existing OO Writer file format in a way that will, as much as possible, avoid disturbing OO.

The way we implement these filters is another problem and may somewhat depend on which tool we want to end up with. Performance is not too much an issue, portability is, as well as ease of use - integration within our translation tool, whether it's OmegaT in its present form or directly within OO Writer.


FYI, I'm presently trying to contact people at Sun in order to raise interest about our efforts and to obtain some help.

On the subject of conversion filters, some initial work was done
> within the Semerkent project, see:

http://sourceforge.net/projects/semerkent/

It looks like they now merged with http://www.gtranslator.org/.

So - thanks to you, Marc - I just discovered Yet Another Free CAT software project. Am I wrong in assuming some of their efforts might partially overlap our ones? Did you contact their team yet?


- though I agree with Simos that scripts are a much more practical
> solution as I believe learning Perl or tcl/tk in order to manipulate
> plain text formats such as XML is within the realms of many
> translators' abilities.
Learning C for this purpose is a different proposition.

Well, in this case, you probably mean C++, as C does not even have a string type, if I remember well.
Another solution for quickly building up filters would be Python.


Cheers,

Henri





reply via email to

[Prev in Thread] Current Thread [Next in Thread]