bug-classpath
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-classpath] [bug #13532] The charset decoder for UnicodeLittle gives


From: Ito Kazumitsu
Subject: [bug-classpath] [bug #13532] The charset decoder for UnicodeLittle gives wrong results
Date: Fri, 24 Jun 2005 07:23:36 +0000
User-agent: w3m/0.5.1

URL:
  <http://savannah.gnu.org/bugs/?func=detailitem&item_id=13532>

                 Summary: The charset decoder for UnicodeLittle gives wrong
results
                 Project: classpath
            Submitted by: itokaz
            Submitted on: Fri 06/24/05 at 07:23
                Category: classpath
                Severity: 3 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
        Platform Version: None

    _______________________________________________________

Details:

While running Java Excel API (http://www.andykhan.com/jexcelapi/),
which extracts character strings from an Excel worksheet using the
charset UnicodeLittle, I found a case where the charset decoder
returned broken strings.  I found two causes of this problem.

1. Which endian to use.

UnicodeLittle is little endian.  But if the data to be decoded does not
have a byte order mark, UTF-16Decoder assumes that it is big endian.
Although Sun's document says that UnicodeLittle is with byte-order mark,
the default byte order of UnicodeLittle should be little endian.
UnicodeLittle without byte order mark seems to be a common practice.

2. UTF-16Decoder's bug

UTF-16Decoder.java has something like (char) ((b1 << 8) | b2).
Let b1 be 0xA1 and b2 be 0xB1. Then,

   (b1 << 8) | b2 = 0xA100 | 0xFFB1 = 0xFFB1

This is not our expected result: 0xA1B1.

And my patch follows.

--- gnu/java/nio/charset/UnicodeLittle.java.orig        Tue Apr 19 19:12:23 2005
+++ gnu/java/nio/charset/UnicodeLittle.java     Fri Jun 24 14:36:33 2005
@@ -64,7 +64,7 @@
 
   public CharsetDecoder newDecoder ()
   {
-    return new UTF_16Decoder (this, UTF_16Decoder.UNKNOWN_ENDIAN);
+    return new UTF_16Decoder (this, UTF_16Decoder.LITTLE_ENDIAN);
   }
 
   public CharsetEncoder newEncoder ()

--- gnu/java/nio/charset/UTF_16Decoder.java.orig        Tue Apr 19 19:12:23 2005
+++ gnu/java/nio/charset/UTF_16Decoder.java     Fri Jun 24 15:44:04 2005
@@ -83,7 +83,7 @@
             // handle byte order mark
             if (byteOrder == UNKNOWN_ENDIAN)
               {
-                char c = (char) (((b1 & 0xFF) << 8) | (b2 & 0xFF));
+                char c = (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF));
                 if (c == BYTE_ORDER_MARK)
                   {
                     byteOrder = BIG_ENDIAN;
@@ -105,8 +105,9 @@
               }
 
            // FIXME: Change so you only do a single comparison here.
-            char c = byteOrder == BIG_ENDIAN ? (char) ((b1 << 8) | b2)
-                                             : (char) ((b2 << 8) | b1);
+            char c = byteOrder == BIG_ENDIAN ?
+                 (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF)) :
+                 (char) (((b2 & 0x00FF) << 8) | (b1 & 0X00FF));
 
             if (0xD800 <= c && c <= 0xDFFF)
               {








    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?func=detailitem&item_id=13532>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/





reply via email to

[Prev in Thread] Current Thread [Next in Thread]