|
From: | Paolo Bonzini |
Subject: | Re: {Spam?} Why string should be collection of single byte characters? (WAS: Re: [Help-smalltalk] [Q] Unicode String?) |
Date: | Fri, 07 Jul 2006 17:59:06 +0200 |
User-agent: | Thunderbird 1.5.0.4 (Macintosh/20060530) |
I think it's different than this. strlen counts bytes. mbrlen counts characters. In Smalltalk #size returns allocation units: only if we stored everything in UTF-32 (no, UTF-16 would not suffice) would this mean characters.I DO think that strlen is not for unicode(actually multi-byte encoded case) string and is bad design: limited to single byte encoding.
I do think that modern languages should support Unicode and you're right that GNU Smalltalk (mostly) does not. I don't think they should dismiss character encodings based on bytes, like UTF-8. These should remain the primary representation in my opinion, especially if like in UTF-8 you don't have any problem in finding the first byte of a character (unlike JIS-0212 or GB-2312) and no need for escape sequences (unlike ISO-2022).I DO think that modern language should consider unicode like string. I DO think Smalltalk is MODERN :-)
Paolo
[Prev in Thread] | Current Thread | [Next in Thread] |