bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: xgettext and Windows newlines (CRLF) in multi-line source


From: Adrien Morel
Subject: Re: xgettext and Windows newlines (CRLF) in multi-line source
Date: Sun, 22 Aug 2010 17:17:15 +0200
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.9.2.8) Gecko/20100802 Lightning/1.0b2pre Thunderbird/3.1.2

Hi Bruno!

Le 15/08/2010 02:28, Bruno Haible a écrit :

Discussions on mailing lists take place with plain-text mail. Please avoid
sending HTML formatted mails to mailing lists. There's is surely an option
for this in Thunderbird.

Sorry about HTML, Thunderbird unfortunately handles that very weirdly (must SHIFT-click on "write" or "answer", didn't know that).

Thanks for the report. So, what you are saying is that:

     When a source file in PHP syntax has Windows line endings, then
     newlines in string literals are encoded as CR LF, but when the source
     file has Unix line endings, then newlines in string literals are
     encoded as LF.

Speaking about the way they are encoded in internal PHP processing, yes. PHP does not internally convert CRLF newlines into LF. But I think this is intended. You probably know that better than me.

I consider this a flaw in the design of PHP, because
   1) For more than 10 years, the Unicode consortium recommends that on
      input, CR LF and LF should be treated the same. See
      <http://www.unicode.org/reports/tr13/tr13-9.html>

But for me that means something else. I understand that PHP should not convert CRLF to LF, but treat them the same way. And the same applies to the gettext extension's code, it should find the "Hello you.\r\nWelcome!" string in the catalog even though the entry mention "Hello you.\nWelcome!"

   2) PHP is used mainly for web programming, and it makes no sense for a
      web application to behave differently whether the programmer wrote
      his programs on a Windows or on a Unix machine, or whether the server
      is running on a Windows or on a Unix machine.

Absolutely.

Because of this guideline, to treat CR LF and LF the same, strings in POT files
usually contain \n as newline marker. Usually - when the source file is using
Unix newlines, and xgettext is running on a Unix machine, or when the source
file is using Windows newlines, and xgettext is running on a Windows machine.

String in POT file could contain LF, CR, or CRLF, that should not change anything, because they should all be considered as the same entity, if I got it.

Currently, however, for a file with Windows newlines and xgettext running on a
Unix machines, the resulting POT file will contain \r\n as newline marker
inside strings. This may be considered a bug, but before I fix it, it would
be good to have an official statement about this issue from the PHP people.
I cannot find anything on this topic in
<http://www.php.net/manual/en/langref.php>. In this situation, xgettext also
emits warnings:
   warning: internationalized messages should not contain the `\r' escape 
sequence
The reason is that translator tools are supposed to work with \n and not with
\r\n.

I see two possible solutions for your problem:
   a) PHP should be fixed so that newlines in string literals are '\n',
      independent of the platform.

That could lead in many PHP programs to stop working I guess, since many developpers are not aware of that fact and rely on the presence of CRLF markers to catch newlines. I know it's a bad habit but it's a fact.

   b) The PHP gettext function family
      <http://de2.php.net/manual/en/book.gettext.php>
      gets changed to preprocess CR LF into LF in the argument string
      before looking up the translation.

That's, I'm convinced, the best solution. It means treating all newlines the same way, and the developers shouldn't worry anymore about having all newlines as LF on every platform.

For each of these solutions, you should report a bug at<http://bugs.php.net/>.

I'll do it, at least for the second one.

Other than that, there are two workarounds:
   c) You change the line terminator conventions of your source files.
      Many text editors on Windows nowadays support this.

Sure, Notepad++ which I use does that without any problem, but I'm only the one getting the files, I cannot please for this change.

   d) You convert CR LF to LF in all strings before you call gettext, using
      string operations<http://www.php.net/manual/en/book.strings.php>.

There are about 20,000 strings in the code, that would mean to change every _("...") call into a double call. I could simply define a new function, __() for example, which does this replacement and call _() afterwards.

Well, thank you for your time, I'll report here any further news.

Adrien



reply via email to

[Prev in Thread] Current Thread [Next in Thread]