bug-classpath
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug classpath/23008] New: The charset decoder for UnicodeLittle gives w


From: cvs-commit at developer dot classpath dot org
Subject: [Bug classpath/23008] New: The charset decoder for UnicodeLittle gives wrong results
Date: 12 Aug 2005 00:02:22 -0000

While running Java Excel API (http://www.andykhan.com/jexcelapi/),

which extracts character strings from an Excel worksheet using the

charset UnicodeLittle, I found a case where the charset decoder

returned broken strings.  I found two causes of this problem.



1. Which endian to use.



UnicodeLittle is little endian.  But if the data to be decoded does not

have a byte order mark, UTF-16Decoder assumes that it is big endian.

Although Sun's document says that UnicodeLittle is with byte-order mark,

the default byte order of UnicodeLittle should be little endian.

UnicodeLittle without byte order mark seems to be a common practice.



2. UTF-16Decoder's bug



UTF-16Decoder.java has something like (char) ((b1 << 8) | b2).

Let b1 be 0xA1 and b2 be 0xB1. Then,



   (b1 << 8) | b2 = 0xA100 | 0xFFB1 = 0xFFB1



This is not our expected result: 0xA1B1.



And my patch follows.



--- gnu/java/nio/charset/UnicodeLittle.java.orig        Tue Apr 19 19:12:23 2005

+++ gnu/java/nio/charset/UnicodeLittle.java     Fri Jun 24 14:36:33 2005

@@ -64,7 +64,7 @@

 

   public CharsetDecoder newDecoder ()

   {

-    return new UTF_16Decoder (this, UTF_16Decoder.UNKNOWN_ENDIAN);

+    return new UTF_16Decoder (this, UTF_16Decoder.LITTLE_ENDIAN);

   }

 

   public CharsetEncoder newEncoder ()



--- gnu/java/nio/charset/UTF_16Decoder.java.orig        Tue Apr 19 19:12:23 2005

+++ gnu/java/nio/charset/UTF_16Decoder.java     Fri Jun 24 15:44:04 2005

@@ -83,7 +83,7 @@

             // handle byte order mark

             if (byteOrder == UNKNOWN_ENDIAN)

               {

-                char c = (char) (((b1 & 0xFF) << 8) | (b2 & 0xFF));

+                char c = (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF));

                 if (c == BYTE_ORDER_MARK)

                   {

                     byteOrder = BIG_ENDIAN;

@@ -105,8 +105,9 @@

               }

 

            // FIXME: Change so you only do a single comparison here.

-            char c = byteOrder == BIG_ENDIAN ? (char) ((b1 << 8) | b2)

-                                             : (char) ((b2 << 8) | b1);

+            char c = byteOrder == BIG_ENDIAN ?

+                 (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF)) :

+                 (char) (((b2 & 0x00FF) << 8) | (b1 & 0X00FF));

 

             if (0xD800 <= c && c <= 0xDFFF)

               {




------- Additional Comments From from-classpath at savannah dot gnu dot org  
2005-06-27 06:34 -------
> Although Sun's document says that UnicodeLittle is with byte-order mark,

> the default byte order of UnicodeLittle should be little endian.

> UnicodeLittle without byte order mark seems to be a common practice.



Seeing the behavior of Sun's JDK, UnicodeLittle with or without byte order

mark should be treated as follows:



  UnicodeLittle with correct byte order mark:

    Ignore the byte order mark and continue assuming the byte order

    to be little endian.



  UnicodeLittle with incorrect byte order mark:

    The byte sequence is malformed.



  UnicodeLittle without byte order mark:

    Continue assuming the byte order to be little endian.



Then the patch will be like this:



--- gnu/java/nio/charset/UnicodeLittle.java.orig        Tue Apr 19 19:12:23 2005

+++ gnu/java/nio/charset/UnicodeLittle.java     Mon Jun 27 14:44:27 2005

@@ -64,7 +64,7 @@

 

   public CharsetDecoder newDecoder ()

   {

-    return new UTF_16Decoder (this, UTF_16Decoder.UNKNOWN_ENDIAN);

+    return new UTF_16Decoder (this, UTF_16Decoder.MAYBE_LITTLE_ENDIAN);

   }

 

   public CharsetEncoder newEncoder ()



--- gnu/java/nio/charset/UTF_16Decoder.java.orig        Tue Apr 19 19:12:23 2005

+++ gnu/java/nio/charset/UTF_16Decoder.java     Mon Jun 27 14:55:04 2005

@@ -54,6 +54,8 @@

   static final int BIG_ENDIAN = 0;

   static final int LITTLE_ENDIAN = 1;

   static final int UNKNOWN_ENDIAN = 2;

+  static final int MAYBE_BIG_ENDIAN = 3;

+  static final int MAYBE_LITTLE_ENDIAN = 4;

 

   private static final char BYTE_ORDER_MARK = 0xFEFF;

   private static final char REVERSED_BYTE_ORDER_MARK = 0xFFFE;

@@ -81,32 +83,44 @@

             byte b2 = in.get ();

 

             // handle byte order mark

-            if (byteOrder == UNKNOWN_ENDIAN)

+            if (byteOrder == UNKNOWN_ENDIAN ||

+                byteOrder == MAYBE_BIG_ENDIAN ||

+                byteOrder == MAYBE_LITTLE_ENDIAN)

               {

-                char c = (char) (((b1 & 0xFF) << 8) | (b2 & 0xFF));

+                char c = (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF));

                 if (c == BYTE_ORDER_MARK)

                   {

+                    if (byteOrder == MAYBE_LITTLE_ENDIAN)

+                      {

+                        return CoderResult.malformedForLength (2);

+                      }

                     byteOrder = BIG_ENDIAN;

                     inPos += 2;

                     continue;

                   }

                 else if (c == REVERSED_BYTE_ORDER_MARK)

                   {

+                    if (byteOrder == MAYBE_BIG_ENDIAN)

+                      {

+                        return CoderResult.malformedForLength (2);

+                      }

                     byteOrder = LITTLE_ENDIAN;

                     inPos += 2;

                     continue;

                   }

                 else

                   {

-                    // assume big endian, do not consume bytes,

+                    // assume big or little endian, do not consume bytes,

                     // continue with normal processing

-                    byteOrder = BIG_ENDIAN;

+                    byteOrder = (byteOrder == MAYBE_LITTLE_ENDIAN ?

+                                 LITTLE_ENDIAN : BIG_ENDIAN);

                   }

               }

 

            // FIXME: Change so you only do a single comparison here.

-            char c = byteOrder == BIG_ENDIAN ? (char) ((b1 << 8) | b2)

-                                             : (char) ((b2 << 8) | b1);

+            char c = byteOrder == BIG_ENDIAN ?

+                 (char) (((b1 & 0x00FF) << 8) | (b2 & 0x00FF)) :

+                 (char) (((b2 & 0x00FF) << 8) | (b1 & 0X00FF));

 

             if (0xD800 <= c && c <= 0xDFFF)

               {


------- Additional Comments From cvs-commit at developer dot classpath dot org  
2005-08-12 00:02 -------
Subject: Bug 23008

CVSROOT:        /cvsroot/classpath
Module name:    classpath
Branch:         
Changes by:     Tom Tromey <address@hidden>     05/08/11 23:51:30

Modified files:
        .              : ChangeLog 
        gnu/java/nio/charset: UTF_16Decoder.java 

Log message:
        For PR classpath/23008:
        * gnu/java/nio/charset/UTF_16Decoder.java (decodeLoop): Correctly
        mask bytes when constructing characters.

CVSWeb URLs:
http://savannah.gnu.org/cgi-bin/viewcvs/classpath/classpath/ChangeLog.diff?tr1=1.4392&tr2=1.4393&r1=text&r2=text
http://savannah.gnu.org/cgi-bin/viewcvs/classpath/classpath/gnu/java/nio/charset/UTF_16Decoder.java.diff?tr1=1.5&tr2=1.6&r1=text&r2=text






-- 
           Summary: The charset decoder for UnicodeLittle gives wrong
                    results
           Product: classpath
           Version: unspecified
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: classpath
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: from-classpath at savannah dot gnu dot org
                CC: bug-classpath at gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=23008




reply via email to

[Prev in Thread] Current Thread [Next in Thread]