Re: [Freecats-Dev] Bilingual File Format (again)

freecats-dev
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Freecats-Dev] Bilingual File Format (again)

From:	Marc Prior
Subject:	Re: [Freecats-Dev] Bilingual File Format (again)
Date:	Wed, 2 Apr 2003 07:07:55 +0200
----------  Forwarded Message  ----------
Subject: Re: [Freecats-Dev] Bilingual File Format (again)
Date: Wed, 2 Apr 2003 07:04:55 +0200
From: Marc Prior <address@hidden>
To: Henri Chorand <address@hidden>


Hi Henri,

I think the essential problem here is that you are presuming too much about
OmegaT. I understand your reasons for wanting to draw up bilingual file
format, and then using that as the basis for a translation memory
application. But OmegaT simply does not work that way. It has no bilingual
file format.

You can think of it like this:

When you create a project, OmegaT reads the source files and creates an
"empty" interim translation memory, i.e. a translation memory consisting of
source strings only (project_save.tmx). During the translation process,
OmegaT reads the original source texts, but writes to the interim translation
memory, progressively adding target strings to it. The user "sees" what a
appears to be a translated text, but that text does not exist at that stage
in the form of a single file. When the translation is complete and the user
executes the "Compile" function, OmegaT makes a copy of the source file and
replaces all the strings with the target strings. It also creates an "export"
translation memory in TMX format. (Despite its extension, project_save.tmx is
not strictly speaking in TMX format.)

At least, that's how I understand it - Keith will correct me if I'm wrong. If
I'm not mistaken, Deja Vu works in the same way.

Now, with regard to bilingual file formats, you need to consider the reason
for them. There are at least three possible reasons.

The first is your reason: to present a text in a form which enables a
translation memory application to work on it. As you see, not all translation
memory applications follow this model.

The second is to make documents portable during the translation process. As
those of us who are translators know, having portable translation memories
(TMX) is often not sufficient. It is often extremely useful to be able to
pass a document to another party which has been or is being translated by TM,
before it is finished, so that it can be revised by the other party before
being cleaned up (Trados terminology) or compiled (OmegaT terminology).

The third reason is to be able to produce documents which are inherently
bilingual (or even multilingual). As an illustration of the potential offered
by XML, such a document could contain, for example, the same text in both
English and French. The French reader's word processor would only display the
French text, the American reader's word processor would only display the
English text, and so XML helps to break down cultural imperialism and a war
between America and France is averted. :-))  The O'Reilly book on XML
actually gives this case as an example (without the bit about the war):

Reserved XML attribute xml:lang

xml:lang="iso_639 identifier"

And then, in the text itself:

<para xml:lang="en">Hello</para>
<para xml:lang="fr">Bonjour</para>

This is an extremely neat mechanism with fantastic new possibilities. The
danger, though, is that by concentrating on the possibilities, by envisaging
a bright new world in which everything is neat and tidy, we forget the actual
task in hand. The three reasons I gave for a bilingual file format are quite
different. From OmegaT's perspective, the first reason and third reasons are
completely irrelevant. The second reason, i.e. to enable OmegaT to deliver
"uncleaned" files, is, I suggest, very important but not absolutely crucial.
Also, although the XML solution such as that described in the O'Reilly book
is actually far neater and more forward-thinking, as long as most customers
and colleagues are using MS Word, it remains impractical. It would almost
certainly be more practical firstly, to support the RTF file format in
OmegaT, and then to implement a similiar solution to Wordfast's.

> I'm not sure I fully got you.
> The way I understand it, any conversion filter should work between a
> given native file format and our own bilingual format, and the CAT
> software should only care about properly translating bilingual format
> files. If we don't go that way, how do we do?

You are correct - you didn't understand me :-). What I am saying is that a
conversion filter between .po and TMX would enable mainstream translators
(people like you and me) to work on open-source localization projects whilst
still using tools more familiar to us. No disrespect to Stanislav, but KBabel
and other open-source localization tools are extremely unsuitable for
mainstream translators, for a number of reasons. I appreciate their
advantages for GUI localization, but expecting mainstream translators to use
KBabel to translate, for example, documents in HTML is like expecting an
open-source hacker to use MS Word to translate his source code.

As a move to bring the two communities closer together, for the benefit of
both, I am holding to the principle that translators should be allowed and
encouraged to use whatever tools they like - the open-source community should
only be interested in the result. So, if a tool such as OmegaT is capable of
reading and delivering data in all the necessary formats, whilst at the same
time being user-friendly from a translator's perspective, there should be no
objection to a translator using it. In order for that to be the case with
OmegaT, it needs to be able to handle .po files.

> Therefore, the obvious goal seems to be able to introduce our own
> TU-level and segment-level (within TU) tags within the existing OO
> Writer file format in a way that will, as much as possible, avoid
> disturbing OO.

Well, this goal has already been reached. I have completed several projects
with OmegaT and have not noticed any corruption whatsoever of OOo's markup
tags. Whilst I can understand the argument that there may be a better
implementation, I would certainly not be in favour of re-designing it at this
stage when there are much more important issues to be addressed. For example,
we are currently concentrating on getting OmegaT to support the widest
possible range of alphabets, and we actually have users (Japanese and
Macedonian, for example) who have been unable to use OmegaT for some time for
this reason.

> So - thanks to you, Marc - I just discovered Yet Another Free CAT
> software project. Am I wrong in assuming some of their efforts might
> partially overlap our ones? Did you contact their team yet?

I had some contact with the Semerkent project maintainer over two years ago.
Look at

www.marcprior.de/linux/tm.html

- you will find links there to some other open-source TM projects.

> Well, in this case, you probably mean C++, as C does not even have a
> string type, if I remember well.

I think the Semerkent site says C, but as far as I'm concerned they are both
"horriblement difficile" or "difficilement horrible"; if you see what I mean.

Regards,
Marc

-------------------------------------------------------
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Freecats-Dev] Bilingual File Format (again), Marc Prior <=
Prev by Date: [Freecats-Dev] Bilingual File Format (again)
Previous by thread: [Freecats-Dev] early proof of concept implementations?
Index(es):
- Date
- Thread