bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gnu-libiconv] Support question: libiconv on system with glibc?


From: Bruno Haible
Subject: Re: [bug-gnu-libiconv] Support question: libiconv on system with glibc?
Date: Thu, 5 Feb 2009 01:09:55 +0100
User-agent: KMail/1.9.9

Hello,

Russell McOrmond wrote:
>    I have an environment where I would like to separate off as much of our 
> application into a chroot() environment as possible.  We figured that 
> using the sepatate libiconv would help, so that we didn't need to bring 
> into the chroot() environment all of glibc (IE: /usr/lib/gconv , etc).
> 
>    I have been having a problem getting libiconv to work in this 
> environment.  This is a RedHat Enterprise 4 machine (glibc 2.3.4)

Note that glibc uses the /usr/lib/gconv/ directory not only for iconv()
but also for locales that have the specified encoding. If your program,
at some point (for example in order to sort strings for a Japanese user)
sets the locale to ja_JP.EUC-JP, glibc will need access to
/usr/lib/gconv/EUC-JP.so.

But if the only locales that your program uses are the "C" locale and some
UTF-8 locales, then your approach to use libiconv instead of glibc's
iconv is workable.

> trying to compile libiconv 1.12.

Remember that building and installing libiconv on a glibc system is an
unusual situation. (It works, and is supported, but is not the normal
case.)

>    We have data that is encoded in UTF-16 which we are outputing in UTF-8 
> (very simple transcode), inserted into an HTML template.
> 
> The relevant part should be output in UTF-8 as:
> 
> <td>Chernozémique</td>
> 
> (Note the accented e)
> 
> Here is the test using 'od' to show the UTF-8 encoding when using the 
> glibc version of the iconv functions.
> 
> -bash-3.00$ sh ~/test-mapserv.sh | od -c -j247 -N23
> 0000367   <   t   d   >   C   h   e   r   n   o   z 303 251   m   i   q
> 0000407   u   e   <   /   t   d   >
> 0000416
> 
> And here is what happens when I use the libiconv version.
> 
> -bash-3.00$  export 
> LD_PRELOAD=/server/downloads/src/libiconv-1.12/lib/preloadable_libiconv.so
> -bash-3.00$ sh ~/test-mapserv.sh | od -c -j247 -N48
> 0000367   <   t   d   > 344 214 200 346 240 200 346 224 200 347 210 200

$ printf '\344\214\200\346\240\200\346\224\200\347\210\200' | iconv -f UTF-8 -t 
UCS-4LE | hexdump -e '"%06.6_ax  " 4/4 "%08X "' -e '"\n"'
000000  00004300 00006800 00006500 00007200

So the characters that are being output are U+4300, U+6800, etc. instead of
U+0043, U+0068 etc.

>    In case anyone is curious how iconv is being called, the relevant code 
> is here: 
> http://trac.osgeo.org/mapserver/browser/trunk/mapserver/mapstring.c#L1504
> 
>    The variable 'encoding' on input is set to "UTF-16" , so this is a 
> simple conversion from UTF-16 to UTF-8.

"UTF-16" is ambiguous. You better use UTF-16LE or UTF-16BE, depending on the
endianness of your machine.

But actually in your code the input is not encoded in UTF-16, it is a sequence
of wchar_t's. wchar_t are not necessarily Unicode at all, for example in Solaris
or FreeBSD they aren't. To convert from/to wchar_t using libiconv or glibc, use
an encoding name "wchar_t".

Btw (off-topic), in
 http://trac.osgeo.org/mapserver/browser/trunk/mapserver/mapstring.c#L1152
you have a very bad hash function: Strings which differ in 2 characters will
often lead to the same hash code. For example, the strings
  "A000000000000000X"
  "B000000000000000W"
  "C000000000000000V"
  "D000000000000000U"
  "E000000000000000T"
  "F000000000000000S"
  "G000000000000000R"
  "H000000000000000Q"
  "I000000000000000P"
  "J000000000000000O"
  "K000000000000000N"
  "L000000000000000M"
  "M000000000000000L"
will all yield the same hash code. This can drown the performance of an
application, see <http://www.haible.de/bruno/hashfunc.html>. Remember that
a hash table is no longer O(1) for each access if the elements are not
approximately equidistributed across the hash buckets.

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]