Re: utf8 char display in buffer

help-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf8 char display in buffer

From:	Xah Lee
Subject:	Re: utf8 char display in buffer
Date:	Fri, 12 Jun 2009 17:35:11 -0700 (PDT)
User-agent:	G2/1.0

On Jun 12, 3:23 pm, ken <geb...@mousecar.com> wrote:
> On 06/12/2009 01:53 PM Xah Lee wrote:
>
> > On Jun 12, 7:54 am, ken <geb...@mousecar.com> wrote:
> >> B) It would be helpful if the code which does the decoding of a file and
> >> renders it into the buffer display, if that part of it would throw an
> >> error message when it encounters a character it doesn't know how to
> >> display, i.e., when a little box character is displayed. After all,
> >> isn't it an error when a little box is displayed in lieu of the correct
> >> character? Possible error messages would be something like: "decoding
> >> process can't find /path/to/charset.file" or "decoding process doesn't
> >> have requisite permission to read /path/to/charset.file" or "invalid
> >> character: [hex/decimal value]" or other.
>
> > some thought process in the above is not correct.
>
> Yet emacs puts a little box in the place of a character it cannot find
> (or, per your explanation) possibly confused about.  The fact remains
> that the little box is not a correct rendering of the code.  It is an
> error... at least it is for me, because that's not what I typed in.  So
> it is an error.  As an error, there should be a corresponding error
> message, hopefully one (or more) which would help diagnose the problem.
>  It seems obvious that, given the long thread on this issue with no
> resolution, we could use some help-- like an error message-- which would
> help in diagnosis.
>
> Thanks for the information and the links though.

i think displaying a error for each char that emacs cannot find a font
for is just not feasible. The app can't know whether it used the right
encoding. And even if the encoding used is correct, it can't deal with
possible missing fonts in some of the characters in the char set.

i don't have experience in this, but imagine, when a app gets a byte
stream, and with a given charset/encoding. With that, it can decode
byte length to map to the code points in the char set. (e.g. utf-8,
utf-16, both don't have fixed byte-length for chars) After that done,
you get a sequence of a code points (i.e. a sequence of integers). At
this point, given a integer, you need to map this integere to a
character in a font. There are many issues here... a font i guess is a
set of glyphs... ultimately a set of integers. I'm not sure what sort
of spec or standard specifies what each integer means (i.e. support
your app now has a integer that represents B. Now suppose your app is
set to use font Aria. Now, Aria is a set of integers, but by what
standard that says what integer is B?)... Part of this step is what
happens when Aria don't have that character. (i'm guessing a font also
has data about what character set it contains...)
But in anycase, finally we'll have a B from font Arial. Then it goes
thru the whole display process...

 overall i think the technology we have today that actually display
fonts and unicode text etc are extremely complex, not to mention
vector based fonts and anti-aliasing and font-substitution etc techs.

some interesting read here:

http://en.wikipedia.org/wiki/Computer_font
http://en.wikipedia.org/wiki/Anti-aliasing
http://en.wikipedia.org/wiki/Font_rasterization
http://en.wikipedia.org/wiki/Subpixel_rendering
http://en.wikipedia.org/wiki/Font-substitution

for most modern apps, like browsers, i think they all call OS's APIs
to handle it. Some glimps over emacs dev list seems to suggest that
emacs implements its own display system... on one hand it's bad
because emacs misses out using all modern techs developed in 2 decades
by Apple or Adobe or Microsoft, or some Open Source's work, on the
other hand it is admirable in that it does it on its own...

sorry am rambling a bit. You are right that the bottom line is that
some things just rendered as squares and is a problem. Though, i
wanted to say that my point was that it is unfeasible to issue a error
for missing fonts or miss-interpretation of the encodings. Part of
this is because theoretically there's no way to know that encoding
chosen is correct. Part is because in practice missing font or bad
chosen encoding is very common. If we all stick with ascii, everything
is pretty good. If we stick to western langs, things are still not too
bad. But once you have chinese, japanese, korean alphabets, or the
ocational use of the many math symbols and greek letters, or adding
cyrillic/russian alphabets or arabian alphabets ... the chances of
missing font or missing encoding info is very high.

i think a large part of the problem is that char set and encoding info
is not part of the file. Things are getting better in the past decade
with mime type and unicode standard. But give a byte stream, after
being lucky of able to know it is text, there's still little way to
know how to interpret it. The char set and encoding meta data often
gets lost, implementation are often not robust, font for multi-lang
usually are not there, and font-substitution tech just started.
(according to Wikipedia, IE before 7 does not even have font
substitution (which means, you really need such beast as “unicode
font”, namely a font that contains some tens or hundreds thousands of
glyphs))

i think all these issue only started to get addressed in the past
decade since the globalization partly due to internet. Before, English
speakers just stick with ascii and that's pretty sufficient. Each
western lang region stick with their particular encoding for a few
special chars in their alphabet. Only when things started to mix they
get more complex, and now with Chinese & japanese etc. With unicode,
the use of math symbols also becomes more common. Before that, it's
just ascii markup...

speaking of this. Emacs and FSF docs still stick with 1980s's `quote
hack', and arrows like this ->  => ... very extremely stupid. Of
course i filed polite bug reports, and have argued here too heated,
but basically fallen to no ears. Somethings just is impossible to
progress in the FSF world.

  Xah
∑ http://xahlee.org/

☄

[Prev in Thread]

Current Thread

[Next in Thread]

Re: utf8 char display in buffer, (continued)
- Re: utf8 char display in buffer, Xah Lee, 2009/06/08
  - Re: utf8 char display in buffer, ken, 2009/06/09
- Re: utf8 char display in buffer, Teemu Likonen, 2009/06/11

Prev by Date: Re: re-defvar a variable without reloading emacs
Next by Date: Re: utf8 char display in buffer
Previous by thread: Re: utf8 char display in buffer
Next by thread: Re: utf8 char display in buffer
Index(es):
- Date
- Thread