[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-classpath] [bug #13532] The charset decoder for UnicodeLittle gives
From: |
Ito Kazumitsu |
Subject: |
[bug-classpath] [bug #13532] The charset decoder for UnicodeLittle gives wrong results |
Date: |
Fri, 24 Jun 2005 07:23:36 +0000 |
User-agent: |
w3m/0.5.1 |
URL:
<http://savannah.gnu.org/bugs/?func=detailitem&item_id=13532>
Summary: The charset decoder for UnicodeLittle gives wrong
results
Project: classpath
Submitted by: itokaz
Submitted on: Fri 06/24/05 at 07:23
Category: classpath
Severity: 3 - Normal
Status: None
Privacy: Public
Assigned to: None
Open/Closed: Open
Platform Version: None
_______________________________________________________
Details:
While running Java Excel API (http://www.andykhan.com/jexcelapi/),
which extracts character strings from an Excel worksheet using the
charset UnicodeLittle, I found a case where the charset decoder
returned broken strings. I found two causes of this problem.
1. Which endian to use.
UnicodeLittle is little endian. But if the data to be decoded does not
have a byte order mark, UTF-16Decoder assumes that it is big endian.
Although Sun's document says that UnicodeLittle is with byte-order mark,
the default byte order of UnicodeLittle should be little endian.
UnicodeLittle without byte order mark seems to be a common practice.
2. UTF-16Decoder's bug
UTF-16Decoder.java has something like (char) ((b1 << 8) | b2).
Let b1 be 0xA1 and b2 be 0xB1. Then,
(b1 << 8) | b2 = 0xA100 | 0xFFB1 = 0xFFB1
This is not our expected result: 0xA1B1.
And my patch follows.
--- gnu/java/nio/charset/UnicodeLittle.java.orig Tue Apr 19 19:12:23 2005
+++ gnu/java/nio/charset/UnicodeLittle.java Fri Jun 24 14:36:33 2005
@@ -64,7 +64,7 @@
public CharsetDecoder newDecoder ()
{
- return new UTF_16Decoder (this, UTF_16Decoder.UNKNOWN_ENDIAN);
+ return new UTF_16Decoder (this, UTF_16Decoder.LITTLE_ENDIAN);
}
public CharsetEncoder newEncoder ()
--- gnu/java/nio/charset/UTF_16Decoder.java.orig Tue Apr 19 19:12:23 2005
+++ gnu/java/nio/charset/UTF_16Decoder.java Fri Jun 24 15:44:04 2005
@@ -83,7 +83,7 @@
// handle byte order mark
if (byteOrder == UNKNOWN_ENDIAN)
{
- char c = (char) (((b1 & 0xFF) << 8) | (b2 & 0xFF));
+ char c = (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF));
if (c == BYTE_ORDER_MARK)
{
byteOrder = BIG_ENDIAN;
@@ -105,8 +105,9 @@
}
// FIXME: Change so you only do a single comparison here.
- char c = byteOrder == BIG_ENDIAN ? (char) ((b1 << 8) | b2)
- : (char) ((b2 << 8) | b1);
+ char c = byteOrder == BIG_ENDIAN ?
+ (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF)) :
+ (char) (((b2 & 0x00FF) << 8) | (b1 & 0X00FF));
if (0xD800 <= c && c <= 0xDFFF)
{
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/bugs/?func=detailitem&item_id=13532>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
- [bug-classpath] [bug #13532] The charset decoder for UnicodeLittle gives wrong results,
Ito Kazumitsu <=