[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gNewSense-users] Cyrillic presentation in gNS wiki

From: Sam Geeraerts
Subject: Re: [gNewSense-users] Cyrillic presentation in gNS wiki
Date: Thu, 19 Feb 2009 21:38:04 +0100
User-agent: Thunderbird (X11/20090105)

Karl Goetz schreef:
On Sun, 15 Feb 2009 11:41:36 +0100
Sam Geeraerts <address@hidden> wrote:

Sam Geeraerts wrote:
Sam Geeraerts wrote:
Dmitri Gabinski wrote:

Hi both.

When trying to edit Russian wiki pages (via Firefox 3.1 beta2, if that matters), I encounter the following problem: Cyrillic
characters are replaced with HTML surrogates, thus becoming such
chains as &#1044;&#1083;&#1103; &#1091;&#1076;&#1072;&#1083;&#1077;&#1085;&#1080;&#1103; &#1080; Editing is is way too labor-consuming and you cannot, for
example, use spell check.

Look, it’s the XXI century, why not use Unicode?

PmWiki is configured to use ISO-8859-1, because that's its default
configuration (And I suspect people with funny encoding weren't on
Brians mind when he set it up :P)

The problem is that the wiki is served with a charset of
ISO-8859-1 in the HTTP headers. So all the content up until now
has been entered in that encoding. If the server configuration
would be changed to UTF-8, all the content would have to be
converted to that as well.
I did some research: apparently the conversion can be done with
recode [1].

Thanks for your looking into this.

There's also a PmWiki recipe to convert input on the fly [2], but I think it's only useful if the content is already in UTF-8. It seems intended to catch input from a browser that is forced to another encoding (or one that can't handle UTF-8).


We seem to have two options with PmWiki when it comes to charset to use.
Here's a snippet from our config:

$WikiTitle = 'PmWiki';
$Charset = 'ISO-8859-1';
$HTTPHeaders = array(
  "Expires: Tue, 01 Jan 2002 00:00:00 GMT",
  "Cache-Control: no-store, no-cache, must-revalidate",
  "Content-type: text/html; charset=ISO-8859-1;");
$CacheActions = array('browse','diff','print');

I can change either or both of these, but I'm not sure what the
consequences would be ...

Grmbl, Charset is not documented (yet) [1]. I would have added a placeholder as suggested, but I'm not sure if I'm supposed to do that in [1] or in [2].

Anyway, I grepped through the code and it looks like Charset is the encoding in the meta-element (or xml declaration). So both Charset and HTTPHeaders should be changed after a conversion. I don't know much about PHP, but it seems more sensible to reuse Charset in HTTPHeaders. If that is valid then a bug report is in order.

I also stumbled upon some interesting comments to consider before using UTF-8 (in scripts/xlpage-utf-8.php):

    This script configures PmWiki to use utf-8 in page content and
    pagenames.  There are some unfortunate side effects about PHP's
    utf-8 implementation, however.  First, since PHP doesn't have a
    way to do pattern matching on upper/lowercase UTF-8 characters,
    WikiWords are limited to the ASCII-7 set, and all links to page
    names with UTF-8 characters have to be in double brackets.
    Second, we have to assume that all non-ASCII characters are valid
    in pagenames, since there's no way to determine which UTF-8
    characters are "letters" and which are punctuation.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]