bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

答复: [bug-gnu-libiconv] iconv Bug report


From: 刘军民
Subject: 答复: [bug-gnu-libiconv] iconv Bug report
Date: Tue, 19 Jun 2007 10:39:44 +0800

Dear Bruno:
        Thanks your reply.
1. The byte sequence 0xa3 0xa0 is not valid GBK, and have many similar byte
sequence, example: 0xa140 0xfe65 etc.
I wrote a PHP script to print iconv unable convert valid GBK byte sequence:
<?php
for($a = 0x81;$a<=0xfe;$a++)
{
        for($b=0x40;$b<=0xfe;$b++)
        {
                if($b==0x7f) continue;
                $cnchar = chr($a).chr($b);
                $r='';
                $r = iconv('gbk','utf-8',$cnchar);
                if($r=='')
                {
                        printf("%0x%0x\n",$a,$b);
                }
        }
}
?>

2. Though the byte sequence 0xa3 0xa0 and other is not valid GBK, but it
often appear in Chinese system. In many Chinese systems, undefined GBK byte
sequence can be understand as double byte space (QuanJiao space). 

3. If texts contain undefined GBK byte sequence, use iconv convert it will
get error text. The reason is GBK charset is double byte charset, but iconv
only ignored first byte of double undefined GBK byte sequence, The second
byte and follow byte convert as a double GBK byte.
        The attach file test.txt will be convert error.
Both    "iconv -f gbk -t utf-8 test.txt" and "iconv -f gb18030 -t utf-8
test.txt" error.

4. My idea is convert undefined GBK byte sequence to double byte space
(U+3000). Attach file iconv-gbk.patch is a simple patch by I.

Liu Junmin 








 Though 

though

-----邮件原件-----
发件人: Bruno Haible [mailto:address@hidden 
发送时间: 2007年6月19日 6:35
收件人: 刘军民; address@hidden
主题: Re: [bug-gnu-libiconv] iconv Bug report

Dear 刘,

>          I find a bug at libiconv..if convert GBK to UTF-8 or UCS-2 with
> libiconv, probably will get error text.
> 
>          Example: a GBK encoding text “0xa3 0xa0 0xb0 0xa1”

The byte sequence 0xa3 0xa0 is not valid GBK.

To find out the encoding of this byte sequence, you can unpack a libiconv
distribution, and in the tests/ directory you find the conversion tables
for most supported character sets. When I do

   $ cd tests
   $ grep ^0xA3A0 *.TXT

I obtain the result:

   CP949.TXT:0xA3A0        0xC9DB
   GB18030-BMP.TXT:0xA3A0  0xE5E5

This means that 0xa3 0xa0 is valid in CP949 - but this is Korean, hence
not your case - and valid GB18030. So, if you specify "GB18030" instead of
"GBK", it should work.

For more details about chinese character sets, see
     http://www.haible.de/bruno/charsets/conversion-tables/Chinese.html
For advice regarding labelling of text, see
     http://www.haible.de/bruno/charsets/advice.html

>          I’m from Chinese and poor English,so I can’t write detailed. 

You write an understandable English, no problem. Maybe a dictionary, or a
translation tool like
     http://babelfish.altavista.com/
     http://www.google.de/language_tools
can help you get more expressive in English.

Bruno

Attachment: test.txt
Description: Text document

Attachment: iconv-gbk.patch
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]