help-smalltalk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: {Spam?} Why string should be collection of single byte characters? (


From: Paolo Bonzini
Subject: Re: {Spam?} Why string should be collection of single byte characters? (WAS: Re: [Help-smalltalk] [Q] Unicode String?)
Date: Sun, 09 Jul 2006 16:23:58 +0200
User-agent: Thunderbird 1.5.0.4 (Macintosh/20060530)


I'm working on it in my spare time, I attach my current prototype patch.
I have almost completed this, it's only about 400 lines of new code, mostly in i18n/Sets.st. I have defined a new UnicodeString class, and modified Character to have support for characters whose Unicode code point is > 255. For ease of testing and usage, also, I've defined a syntax $<279> that allows you to refer to a Character by its ASCII value. It's equivalent to "279 asCharacter" -- I could have instead inlined this at compile-time, but I prefer to have also a more compact syntax.

The changes are mostly backwards compatible, but characters should *not* be compared with ==, but with = unless you're sure the code point is <= 255. Similarly, they should *not* be printed with nextPut:, but with display:, unless you're sure the code point is <= 127.


What follows is some use cases. This is in a UTF-8 locale but (subject to the capabilities of your system's iconv function) it works as well for every other locale.

I am not very expert in the *needs* of people using Unicode, so can you please confirm that it is (close to) what you need? In particular, I'd like feedback on what to do when in transcoding is not enabled, because right now the behavior is inconsistent: see the notes preceded by ***.

Without the I18N package, the behavior is not complete and you can store, but not print Unicode characters correctly:

Printing a Unicode character:
st> $<279> printNl!
$<16r0117>

Converting a Unicode character to String:
*** maybe should consider returning '?'
st> $<279> asString printNl!
error: Invalid argument <16r0117>: argument must be between $<0> and $<16r00FF>

Converting a Unicode character to a UTF-32 String:
st> ($<279> asUnicodeString) printNl!
'<16r0117>'

Converting a UTF-32 String with a Unicode character to a byte-encoded String:
*** maybe should give an error instead
st> $<279> asUnicodeString asString printNl!
'?'

Asking the number of characters to the resulting Strings:
st> $<279> asUnicodeString numberOfCharacters printNl!
1
st> $<279> asUnicodeString asString numberOfCharacters printNl!
error: should not be implemented in this class

Converting ByteArrays or Strings to UnicodeStrings:
st> #[196 151] asUnicodeString first printNl!
error: should not be implemented in this class

-----


After loading the I18N package, everything is much better:

Printing a Unicode character:
st> $<279> printNl!
$ė

Converting a Unicode character to String:
st> $<279> asString printNl!
'ė'

Converting a Unicode character to a UTF-32 String, and then back just by printing it:
st> ($<279> asUnicodeString) printNl!
'ė'

Converting a UTF-32 String with a Unicode character to a byte-encoded String:
st> $<279> asUnicodeString asString printNl!
'ė'

Asking the number of characters to the resulting Strings:
st> $<279> asUnicodeString numberOfCharacters printNl!
1

st> $<279> asUnicodeString asString numberOfCharacters printNl!
1

Converting ByteArrays or Strings to UnicodeStrings:
st> #[196 151] asUnicodeString first printNl!
$ė

st> #[196 151] asUnicodeString size printNl!
1

st> #[196 151] asUnicodeString numberOfCharacters printNl!
1

Paolo





reply via email to

[Prev in Thread] Current Thread [Next in Thread]