[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Wide and UTF-8 international characters
From: |
D. Stimits |
Subject: |
Re: Wide and UTF-8 international characters |
Date: |
Sat, 17 May 2003 16:25:21 -0600 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021018 |
I'm still trying to think ahead on my project, so I'm going to ask based
on what I've read, but not tested (at least not with ncurses).
...
>If I am using just a console or or xterm, without ncurses, I can output
>the full 8 bit characters as described in html 8-bit entities, echoed
>directly to a console (not with ncurses or any lib), such as "©",
>and get the copyright symbol that is like a 'c' inside of a circle (it
>happens that to echo this I echo an uninterpreted 169 decimal, typecast
>to char). So current terminals, whether console or X11, use the full 8
generally true. But the 8th bit used for standout in BSD curses was
stripped off and used as a flag to tell that implementation whether
to use standout mode to highlight characters.
>bits to create their display. If the eighth bit is being used by curses,
>then the top 128 characters are lost to standout mode ability. On the
>other hand, if ncurses uses a separate byte (a 16 bits) to store
more than 8 bits, actually.
So it sounds like the 8th bit is no longer used as a flag...is that
correct? But also that 1 or more bytes are then added with each
character cell to provide attribute data...is that correct?
>characteristics, while leaving the full 8 bits to display output, then
>ncurses can display the full 255 character entity set (html entity set)
>simply by sending the character straight to the terminal. I'm not
>positive, but this should include the full UTF-8 set, which is only
>single-byte. Is ncurses storing attribute in a separate byte already? Or
the problem with that, is that it doesn't mix well with treating the
screen
as an array of characters. You _could_ store each row as a multibyte
string
(with some pain achieved at the right margin), but it would require
counting
or some index added to point to a character which starts at a given
column.
Instead, the common approach stores multiple characters for each array
position - some storage is wasted, but it's accessed more rapidly.
I assume that the actual character then is always converted to a wide
character, even if it is just common text not requiring a wide character
(because it is easier to deal with uniform wide characters than
varying-width multibyte representations with escape sequences to mark
character set changes). How many bytes does the current ncurses use to
store non-attribute character data? I would guess two 8-bit bytes
internally per cell.
>is it the way of the old book description, with 7 bits for character,
>and the last bit for standout mode flagging? If a separate byte is used
>already, then it would seem that multibyte characters already have the
>"infrastructure" to be plugged into ncurses. [FYI, it would be rather
>useful to see an entity substitution ability, like "©" in html]
>
>Pardon my curiosity, lately I've been looking at some non-7-bit ascii
>clients, but the clients support only 8 bit, not multibyte characters. I
>created a lightweight XML style data tree storage mechanism that uses
>XML/html entities to represent characters that cannot be easily entered
>via a keyboard, and it turned out to be far more flexible/useful than I
>thought at first. I remember seeing some of the development ncurses
>branch as partial or initial support for the wide characters, and I
that was up til mid-2001 - I didn't quite know where to begin at
rewriting,
but one of the contributors got it moving. ncurses 5.3 was good enough to
use - the current code probably has isolated bugs, but I don't see any
that are related to wide-characters. Not all functions are tested - so
I've been reviewing, adding test-programs for places that are noticeably
not covered.
Currently on Linux, I could display a copyright symbol ('c' inside of a
circle) by outputting 169 decimal cast as character (8 bits) to the
terminal. I'm looking at the man page for echochar, and it appears that
ncurses came up with its own version of something similar to html/xml
character entities, but the ncurses version is not as complete as
html/xml entities. If I were to use a printw function with a %c format,
feeding it 169 decimal (or anything from 128 through 255), will ncurses
ever represent the output appearance differently than had I fed that
decimal number (cast as 8 bit character) directly to a standard linux
console or xterm?
D. Stimits, stimits AT attbi DOT com