fastcgipp-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Fastcgipp-users] UTF-8 POST value


From: Eddie Carle
Subject: Re: [Fastcgipp-users] UTF-8 POST value
Date: Tue, 16 Mar 2010 23:31:19 -0600

On Tue, 2010-03-16 at 13:10 -0600, Axel von Bertoldi wrote:
> When using narrow characters (i.e. <char>) I can replicate Alexey's
> problem regardless of the request method (GET or POST), the content
> type (application/x-www-form-urlencoded or multipart/form-data), or
> the method of retrieving the variable (requestVarGet or directly
> accessing the contents of Environment::Posts). In all these cases the
> correct string is returned, but it's with is 2 instead of 1 as
> expected.
> 
> Not sure where exactly the problem is here, but I guess it's because Ñ
> can't be represented using one narrow character. Eddie, is this
> correct?

This is actually standard behaviour. The c++ standard knows nothing at
this point of utf8. std::string returns the size of a string, not the
number of characters. When dealing with utf8 it is often better to use
the wchar_t wide characters as they are code converted to fixed
character sizes instead of variable. There are other string classes
built for using utf8 internally like in Qt, but I'm not a fan of the
overhead associated with code converting every time you want to check
the size of a string or index it. Better to do it once at the beginning.

> When using wide characters (<wchar_t>), it's a different story: In all
> but one of the above described combinations, the correct value and
> length are returned (on my computer at least). The exception is when
> the request method is POST and the content type is
> application/x-www-form-urlencoded, in this case garbage is returned
> (when retrieving the data in either way).

The wide character template of fastcgi++ only really works properly if
the input data is utf8. The default for web transmission is actually
iso8859. iso8859 is a fixed 8bit character set so there is no need to
use wide characters with that. The Ñ character and any other Latin
characters are displayed without the use of variable size characters.

I haven't looked at any code so I am just taking a guess but I bet the
urlencoded data is actually iso8859 encoded. There is no problem
converting that to wide character unicode until you start using
non-ascii iso8859 characters. This is of course because utf8 is ascii
compatible. Ascii can be utf8 but iso8859 can't. Well, the ascii part of
iso8859 can, but not the special characters. The charToString function
would get messed up if it runs in to non-utf8 characters when it is
called with wchar_t because it does code conversion from utf8 to some
sort of wide character unicode like utf32 or utf16.

> The problem in this case is occurring in fillPostsUrlEncoded, but may
> point to somewhere else. fillPostsUrlEncoded is short and basically
> copies the post data into a string as follows:
> 
>         std::basic_string<charT> queryString;
>         boost::scoped_array<char> buffer(new char[size]);
>         memcpy (buffer.get(), data, size);
>         charToString (buffer.get(), size, queryString);
>         doFillPostsUrlEncoded(queryString);
> 
> I think the problem might be in charToString (or my use of it) as
> that's where the data is corrupted. Eddie, any thoughts WRT this? Will
> do further testing.

As I said above, this is very likely due to charToString trying to code
convert non-utf8 data.

> Alexey, I'm not sure what to suggest other than to make sure when you
> define your Request class, to make sure the template parameter is a
> wchar_t. Like so

One thing to add: If you are going to use the wchar_t template parameter
as Axel suggests, be sure all data coming in is utf8. You can't mix and
match character encodings and expect the code converter to figure it
out. If you are going to mix and match character sets, leave it as char
as that will make the library completely ignore the encoding of the
textual data. You can use koir, iso8859, ascii, utf8, whatever. Just be
aware that it will stay as that encoding and the string objects will not
care if it is utf8 or how many bytes a "character" is when it tells you
how big it is or you try to index it.
--
        Eddie Carle
        
        This message has been signed with an RFC4880 signature. It is
        guaranteed to have originated from Eddie Carle and its contents
        can be validated against its signature.

Attachment: signature.asc
Description: This is a digitally signed message part


reply via email to

[Prev in Thread] Current Thread [Next in Thread]