bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: xgettext and Windows newlines (CRLF) in multi-line source


From: Bruno Haible
Subject: Re: xgettext and Windows newlines (CRLF) in multi-line source
Date: Sun, 15 Aug 2010 02:28:52 +0200
User-agent: KMail/1.9.9

Hi,

Adrien Morel wrote:
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>

Discussions on mailing lists take place with plain-text mail. Please avoid
sending HTML formatted mails to mailing lists. There's is surely an option
for this in Thunderbird.

>  In a PHP file in Windows format (CR+LF line ending), some long texts are
>  split into lines, but still made of only one string (no quote at end of
>  lines). here is an example:  
 
>  $text = _("This is a too
>  long text so I wrote
>  it on several lines.");
 
>  Not shown here, the newlines are CR+LF, as it is a Windows file. When
>  xgettext parses the source code, it reports the following: 
 
>  msgid ""
>  "This is a too\n"
>  "long text so I wrote\n"
>  "it on several lines."
 
>  Which isn't correct. And indeed, it doesn't work when gettext look for
>  the string in .mo files, it's not found, and the original string is
>  printed instead of the translation. The content of the catalog should be:  
 
>  msgid ""
>  "This is a too\r\n"
>  "long text so I wrote\r\n"
>  "it on several lines."
 
>  I tried it that way (I msgfmt'ed this content and restarted the server),
>  and it worked!

Thanks for the report. So, what you are saying is that:

    When a source file in PHP syntax has Windows line endings, then
    newlines in string literals are encoded as CR LF, but when the source
    file has Unix line endings, then newlines in string literals are
    encoded as LF.

I consider this a flaw in the design of PHP, because
  1) For more than 10 years, the Unicode consortium recommends that on
     input, CR LF and LF should be treated the same. See
     <http://www.unicode.org/reports/tr13/tr13-9.html>
  2) PHP is used mainly for web programming, and it makes no sense for a
     web application to behave differently whether the programmer wrote
     his programs on a Windows or on a Unix machine, or whether the server
     is running on a Windows or on a Unix machine.

Because of this guideline, to treat CR LF and LF the same, strings in POT files
usually contain \n as newline marker. Usually - when the source file is using
Unix newlines, and xgettext is running on a Unix machine, or when the source
file is using Windows newlines, and xgettext is running on a Windows machine.

Currently, however, for a file with Windows newlines and xgettext running on a
Unix machines, the resulting POT file will contain \r\n as newline marker
inside strings. This may be considered a bug, but before I fix it, it would
be good to have an official statement about this issue from the PHP people.
I cannot find anything on this topic in
<http://www.php.net/manual/en/langref.php>. In this situation, xgettext also
emits warnings:
  warning: internationalized messages should not contain the `\r' escape 
sequence
The reason is that translator tools are supposed to work with \n and not with
\r\n.

I see two possible solutions for your problem:
  a) PHP should be fixed so that newlines in string literals are '\n',
     independent of the platform.
  b) The PHP gettext function family
     <http://de2.php.net/manual/en/book.gettext.php>
     gets changed to preprocess CR LF into LF in the argument string
     before looking up the translation.
For each of these solutions, you should report a bug at <http://bugs.php.net/>.

Other than that, there are two workarounds:
  c) You change the line terminator conventions of your source files.
     Many text editors on Windows nowadays support this.
  d) You convert CR LF to LF in all strings before you call gettext, using
     string operations <http://www.php.net/manual/en/book.strings.php>.

Bruno



reply via email to

[Prev in Thread] Current Thread [Next in Thread]