Re: utf8 char display in buffer

help-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: utf8 char display in buffer

From:	Xah Lee
Subject:	Re: utf8 char display in buffer
Date:	Fri, 12 Jun 2009 10:27:02 -0700 (PDT)
User-agent:	G2/1.0

On Jun 12, 7:54 am, ken <geb...@mousecar.com> wrote:
> B) It would be helpful if the code which does the decoding of a file and
> renders it into the buffer display, if that part of it would throw an
> error message when it encounters a character it doesn't know how to
> display, i.e., when a little box character is displayed. After all,
> isn't it an error when a little box is displayed in lieu of the correct
> character? Possible error messages would be something like: "decoding
> process can't find /path/to/charset.file" or "decoding process doesn't
> have requisite permission to read /path/to/charset.file" or "invalid
> character: [hex/decimal value]" or other.

some thought process in the above is not correct.

In general, a program just read a text file as a byte stream, and
using a encoding scheme to interprete it, the program has little way
to determine if the encoding is correct. Theoretically, it could check
with command phrases but that is generally not done by the software we
use daily. (some program does scan text guess a encoding, but not
always correct)

here's some general technical issues and experiences about using
foreign chars:

• the software needs to know what encoding & char set is used in order
to interprete the binary stream. If you don't specifically set it,
typically it assumes ascii or some iso latin char set. (of software in
USA anyway)

• today's software generally don't contain any extra heuistics to
check if the encoding used is actually correct. There is no technical
way to check that in general. It can be only heuristics, i.e. guesses.
e.g. browsers will often guess when reading a page that doesn't have
encoding info.

• even when the encoding is correct, the software needs all the proper
fonts to display it. Or, rely on some font-replacement technology,
e.g. when it finds a char which the current font doesn't have, it uses
another font for that char. (in the case of Chinese, this often
results in ugly text of mixed char style, some appear thin, some
thick, some squarly (like sans-serif), some caligraphic, some
bitmapped) Windows OS and OS X both has font-replacement technology,
as well as all the major browsers for both os x and windows. This font
replacement technology, however, is not perfect. So, sometimes you'll
see squares or question marks here or there, especially on some chars
that's not widely used (e.g. math symbols in unicode, double right
arrow, tech symbols such as Apple's command key and option key, triple
asterisk, etc.).

• when writing a file, the software needs to use a encoding to write
it. Just like reading, if you havn't explicitly set it, typically it
uses ascii or some iso latin char set, in most western lang countries.

• when you use a software to open a text but with wrong encoding info,
the result is gibberish.

the above applies not just to emacs, but applies to all apps. Some
commentary are based on my experiences with browsers, web pages, word
processors, online forums, mailing list, email apps, instant messaging
chat apps, etc, on both mac and windows.

technically, the issues involved is char set, encoding, font. ( the
concept of char set and encoding are independent but is often mixed
together in a spec, esp earlier ones).

i use mixed chinese & english in single file often and in both mac os
x and windows. They work well. On the mac, my emacs is version 22.x.
On win, it is emacs23. My encoding in emacs is set to utf-8.

I've wrote a lot about these issues, the following docs might be
helpful.

• Emacs and Unicode Tips
  http://xahlee.org/emacs/emacs_n_unicode.html

• Unicode Characters Example
  http://xahlee.org/Periodic_dosage_dir/t1/20040505_unicode.html

• the Journey of a Foreign Character thru Internet
  http://xahlee.org/Periodic_dosage_dir/t2/non-ascii_journey.html

• Converting a File's Encoding with Python
  http://xahlee.org/perl-python/charset_encoding.html

• Character Sets and Encoding in HTML
  http://xahlee.org/js/html_chars.html

• The Complexity And Tedium of Software Engineering (parts about
unicode problem with unison and emacs)
  http://xahlee.org/UnixResource_dir/writ/programer_frustration.html

• Mac and Windows File Conversion (parts about unicode filename
issues)
  http://xahlee.org/mswin/mac_windows_file_conv.html

• Windows Font and Unicode
  http://xahlee.org/mswin/windows_font_unicode.html

the above article contain tens of links to Wikipedia in appropriate
places. Wikipedia has massive info in digestable form about these
issues, one can spend a month on the above foreign char issues ...

for some examples of mixed chinese & english text i work with, see:

• Chinese Core Simplified Chars
  http://xahlee.org/lojban/simplified_chars.html

• Ethology, Ethnology, and Lyrics
  http://xahlee.org/Periodic_dosage_dir/sanga_pemci/sanga_pemci.html

  Xah
∑ http://xahlee.org/

☄

[Prev in Thread]

Current Thread

[Next in Thread]

Re: utf8 char display in buffer, (continued)

Prev by Date: Re: emacsW32 find-file does not show the current buffer file path?
Next by Date: Re: utf8 char display in buffer
Previous by thread: Re: utf8 char display in buffer
Next by thread: Re: utf8 char display in buffer
Index(es):
- Date
- Thread