help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf8 char display in buffer


From: Xah Lee
Subject: Re: utf8 char display in buffer
Date: Fri, 12 Jun 2009 10:53:39 -0700 (PDT)
User-agent: G2/1.0

On Jun 12, 7:54 am, ken <geb...@mousecar.com> wrote:
> B) It would be helpful if the code which does the decoding of a file and
> renders it into the buffer display, if that part of it would throw an
> error message when it encounters a character it doesn't know how to
> display, i.e., when a little box character is displayed. After all,
> isn't it an error when a little box is displayed in lieu of the correct
> character? Possible error messages would be something like: "decoding
> process can't find /path/to/charset.file" or "decoding process doesn't
> have requisite permission to read /path/to/charset.file" or "invalid
> character: [hex/decimal value]" or other.

some thought process in the above is not correct.

In general, a program just read a text file as a byte stream, and
using a encoding scheme to interpret it, the program has little way to
determine if the encoding is correct. Theoretically, it could check
with common phrases but that is generally not done by the software we
use daily. (some program does scan text guess a encoding, but not
always correct)

here's some general technical issues and experiences about using
foreign chars:

• the software needs to know what encoding & char set is used in order
to interpret the binary stream. If you don't specifically set it,
typically it assumes ascii or some iso latin char set. (of software in
USA anyway)

• today's software generally don't contain any extra heuristics to
check if the encoding used is actually correct. There is no technical
way to check that in general. It can be only heuristics, i.e. guesses.
e.g. browsers will often guess when reading a page that doesn't have
encoding info.

• even when the encoding is correct, the software needs all the proper
fonts to display it. Or, rely on some font-replacement technology,
e.g. when it finds a char which the current font doesn't have, it uses
another font for that char. (in the case of Chinese, this often
results in ugly text of mixed char style, some appear thin, some
thick, some squarely (like sans-serif), some calligraphic, some bit-
mapped) Windows OS and OS X both has font-replacement technology, as
well as all the major browsers for both os x and windows. This font
replacement technology, however, is not perfect. So, sometimes you'll
see squares or question marks here or there, especially on some chars
that's not widely used (e.g. math symbols in unicode, double right
arrow, tech symbols such as Apple's command key and option key, triple
asterisk, etc.).

• when writing a file, the software needs to use a encoding to write
it. Just like reading, if you haven't explicitly set it, typically it
uses ascii or some iso latin char set, in most western lang countries.

• when you use a software to open a text but with wrong encoding info,
the result is gibberish.

the above applies not just to emacs, but applies to all apps. Some
commentary are based on my experiences with browsers, web pages, word
processors, online forums, mailing list, email apps, instant messaging
chat apps, etc, on both mac and windows.

technically, the issues involved is char set, encoding, font. ( the
concept of char set and encoding are independent but is often mixed
together in a spec, esp earlier ones).

i use mixed chinese & english in single file often and in both mac os
x and windows. They work well. On the mac, my emacs is version 22.x.
On win, it is emacs23. My encoding in emacs is set to utf-8.

I've wrote a lot about these issues, the following docs might be
helpful.

• Emacs and Unicode Tips
  http://xahlee.org/emacs/emacs_n_unicode.html

• Unicode Characters Example
  http://xahlee.org/Periodic_dosage_dir/t1/20040505_unicode.html

• the Journey of a Foreign Character thru Internet
  http://xahlee.org/Periodic_dosage_dir/t2/non-ascii_journey.html

• Converting a File's Encoding with Python
  http://xahlee.org/perl-python/charset_encoding.html

• Character Sets and Encoding in HTML
  http://xahlee.org/js/html_chars.html

• The Complexity And Tedium of Software Engineering (parts about
unicode problem with unison and emacs)
  http://xahlee.org/UnixResource_dir/writ/programer_frustration.html

• Mac and Windows File Conversion (parts about unicode filename
issues)
  http://xahlee.org/mswin/mac_windows_file_conv.html

• Windows Font and Unicode
  http://xahlee.org/mswin/windows_font_unicode.html

the above article contain tens of links to Wikipedia in appropriate
places. Wikipedia has massive info in digestible form about these
issues, one can spend a month on the above foreign char issues ...

for some examples of mixed chinese & english text i work with, see:

• Chinese Core Simplified Chars
  http://xahlee.org/lojban/simplified_chars.html

• Ethology, Ethnology, and Lyrics
  http://xahlee.org/Periodic_dosage_dir/sanga_pemci/sanga_pemci.html

  Xah
∑ http://xahlee.org/

☄


On Jun 12, 9:48 am, "B. T. Raven" <ni...@nihilo.net> wrote:

> I wouldn't be surprised if the gaps and overlaps in the CJK ranges of
> glyphs weren't so complicated that many characters from the following
> encodings may not be included in utf-8, especially if they are not
> precomposed. Try some of these encodings to see if some of the empty
> boxes are resolved into characters:
>
>             chinese-big5
>             chinese-hz
>             chinese-iso-7bit
>             chinese-iso-8bit
>             chinese-iso-8bit-with-esc
>             cn-big5
>             cn-gb
>             cn-gb-2312
>             iso-2022-cjk
>             iso-2022-cn
>             iso-2022-cn-ext

most chinese encodings are subset or identical to unicode's charset.

In particular, the current, mostly widely used chinese charset the GB
18030, actually is just unicode.

see http://en.wikipedia.org/wiki/GB_18030

Note also, that means china's GB 18030 contain the entirely of
traditional chars in unicode too. (though, i don't know about how big5
relates to unicode )

the list you gave above is from emacs? emacs's list always seems
strange to me... haven't really looked into it. maybe emacs's list is
really encompassing of all encoding that've existed, but it also could
be just screwed up like many open source things. For example, it
invents its own names by mixing up char set encoding with concepts of
EOL convention.

btw, who actually coded the low down levels of char encoding in emacs?
e.g. especially unicode, since it came after richard stallman still
doing the bulk of emacs. That person should be admirable. lol.

  Xah
∑ http://xahlee.org/

reply via email to

[Prev in Thread] Current Thread [Next in Thread]