[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug classpath/23008] New: The charset decoder for UnicodeLittle gives w
From: |
cvs-commit at developer dot classpath dot org |
Subject: |
[Bug classpath/23008] New: The charset decoder for UnicodeLittle gives wrong results |
Date: |
12 Aug 2005 00:02:22 -0000 |
While running Java Excel API (http://www.andykhan.com/jexcelapi/),
which extracts character strings from an Excel worksheet using the
charset UnicodeLittle, I found a case where the charset decoder
returned broken strings. I found two causes of this problem.
1. Which endian to use.
UnicodeLittle is little endian. But if the data to be decoded does not
have a byte order mark, UTF-16Decoder assumes that it is big endian.
Although Sun's document says that UnicodeLittle is with byte-order mark,
the default byte order of UnicodeLittle should be little endian.
UnicodeLittle without byte order mark seems to be a common practice.
2. UTF-16Decoder's bug
UTF-16Decoder.java has something like (char) ((b1 << 8) | b2).
Let b1 be 0xA1 and b2 be 0xB1. Then,
(b1 << 8) | b2 = 0xA100 | 0xFFB1 = 0xFFB1
This is not our expected result: 0xA1B1.
And my patch follows.
--- gnu/java/nio/charset/UnicodeLittle.java.orig Tue Apr 19 19:12:23 2005
+++ gnu/java/nio/charset/UnicodeLittle.java Fri Jun 24 14:36:33 2005
@@ -64,7 +64,7 @@
public CharsetDecoder newDecoder ()
{
- return new UTF_16Decoder (this, UTF_16Decoder.UNKNOWN_ENDIAN);
+ return new UTF_16Decoder (this, UTF_16Decoder.LITTLE_ENDIAN);
}
public CharsetEncoder newEncoder ()
--- gnu/java/nio/charset/UTF_16Decoder.java.orig Tue Apr 19 19:12:23 2005
+++ gnu/java/nio/charset/UTF_16Decoder.java Fri Jun 24 15:44:04 2005
@@ -83,7 +83,7 @@
// handle byte order mark
if (byteOrder == UNKNOWN_ENDIAN)
{
- char c = (char) (((b1 & 0xFF) << 8) | (b2 & 0xFF));
+ char c = (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF));
if (c == BYTE_ORDER_MARK)
{
byteOrder = BIG_ENDIAN;
@@ -105,8 +105,9 @@
}
// FIXME: Change so you only do a single comparison here.
- char c = byteOrder == BIG_ENDIAN ? (char) ((b1 << 8) | b2)
- : (char) ((b2 << 8) | b1);
+ char c = byteOrder == BIG_ENDIAN ?
+ (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF)) :
+ (char) (((b2 & 0x00FF) << 8) | (b1 & 0X00FF));
if (0xD800 <= c && c <= 0xDFFF)
{
------- Additional Comments From from-classpath at savannah dot gnu dot org
2005-06-27 06:34 -------
> Although Sun's document says that UnicodeLittle is with byte-order mark,
> the default byte order of UnicodeLittle should be little endian.
> UnicodeLittle without byte order mark seems to be a common practice.
Seeing the behavior of Sun's JDK, UnicodeLittle with or without byte order
mark should be treated as follows:
UnicodeLittle with correct byte order mark:
Ignore the byte order mark and continue assuming the byte order
to be little endian.
UnicodeLittle with incorrect byte order mark:
The byte sequence is malformed.
UnicodeLittle without byte order mark:
Continue assuming the byte order to be little endian.
Then the patch will be like this:
--- gnu/java/nio/charset/UnicodeLittle.java.orig Tue Apr 19 19:12:23 2005
+++ gnu/java/nio/charset/UnicodeLittle.java Mon Jun 27 14:44:27 2005
@@ -64,7 +64,7 @@
public CharsetDecoder newDecoder ()
{
- return new UTF_16Decoder (this, UTF_16Decoder.UNKNOWN_ENDIAN);
+ return new UTF_16Decoder (this, UTF_16Decoder.MAYBE_LITTLE_ENDIAN);
}
public CharsetEncoder newEncoder ()
--- gnu/java/nio/charset/UTF_16Decoder.java.orig Tue Apr 19 19:12:23 2005
+++ gnu/java/nio/charset/UTF_16Decoder.java Mon Jun 27 14:55:04 2005
@@ -54,6 +54,8 @@
static final int BIG_ENDIAN = 0;
static final int LITTLE_ENDIAN = 1;
static final int UNKNOWN_ENDIAN = 2;
+ static final int MAYBE_BIG_ENDIAN = 3;
+ static final int MAYBE_LITTLE_ENDIAN = 4;
private static final char BYTE_ORDER_MARK = 0xFEFF;
private static final char REVERSED_BYTE_ORDER_MARK = 0xFFFE;
@@ -81,32 +83,44 @@
byte b2 = in.get ();
// handle byte order mark
- if (byteOrder == UNKNOWN_ENDIAN)
+ if (byteOrder == UNKNOWN_ENDIAN ||
+ byteOrder == MAYBE_BIG_ENDIAN ||
+ byteOrder == MAYBE_LITTLE_ENDIAN)
{
- char c = (char) (((b1 & 0xFF) << 8) | (b2 & 0xFF));
+ char c = (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF));
if (c == BYTE_ORDER_MARK)
{
+ if (byteOrder == MAYBE_LITTLE_ENDIAN)
+ {
+ return CoderResult.malformedForLength (2);
+ }
byteOrder = BIG_ENDIAN;
inPos += 2;
continue;
}
else if (c == REVERSED_BYTE_ORDER_MARK)
{
+ if (byteOrder == MAYBE_BIG_ENDIAN)
+ {
+ return CoderResult.malformedForLength (2);
+ }
byteOrder = LITTLE_ENDIAN;
inPos += 2;
continue;
}
else
{
- // assume big endian, do not consume bytes,
+ // assume big or little endian, do not consume bytes,
// continue with normal processing
- byteOrder = BIG_ENDIAN;
+ byteOrder = (byteOrder == MAYBE_LITTLE_ENDIAN ?
+ LITTLE_ENDIAN : BIG_ENDIAN);
}
}
// FIXME: Change so you only do a single comparison here.
- char c = byteOrder == BIG_ENDIAN ? (char) ((b1 << 8) | b2)
- : (char) ((b2 << 8) | b1);
+ char c = byteOrder == BIG_ENDIAN ?
+ (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF)) :
+ (char) (((b2 & 0x00FF) << 8) | (b1 & 0X00FF));
if (0xD800 <= c && c <= 0xDFFF)
{
------- Additional Comments From cvs-commit at developer dot classpath dot org
2005-08-12 00:02 -------
Subject: Bug 23008
CVSROOT: /cvsroot/classpath
Module name: classpath
Branch:
Changes by: Tom Tromey <address@hidden> 05/08/11 23:51:30
Modified files:
. : ChangeLog
gnu/java/nio/charset: UTF_16Decoder.java
Log message:
For PR classpath/23008:
* gnu/java/nio/charset/UTF_16Decoder.java (decodeLoop): Correctly
mask bytes when constructing characters.
CVSWeb URLs:
http://savannah.gnu.org/cgi-bin/viewcvs/classpath/classpath/ChangeLog.diff?tr1=1.4392&tr2=1.4393&r1=text&r2=text
http://savannah.gnu.org/cgi-bin/viewcvs/classpath/classpath/gnu/java/nio/charset/UTF_16Decoder.java.diff?tr1=1.5&tr2=1.6&r1=text&r2=text
--
Summary: The charset decoder for UnicodeLittle gives wrong
results
Product: classpath
Version: unspecified
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: classpath
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: from-classpath at savannah dot gnu dot org
CC: bug-classpath at gnu dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23008
- [Bug classpath/23008] New: The charset decoder for UnicodeLittle gives wrong results,
cvs-commit at developer dot classpath dot org <=