help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to determine encoding for file?


From: tomas
Subject: Re: How to determine encoding for file?
Date: Mon, 25 Jan 2010 06:57:43 +0100
User-agent: Mutt/1.5.15+20070412 (2007-04-11)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sun, Jan 24, 2010 at 10:59:46PM +0100, Pascal J. Bourguignon wrote:
> kj <no.email@please.post> writes:
> 
> > I've downloaded a large file that is supposed to contain a mixture
> > of Japanese and English (it's basically a learner's dictionary).
> > The English is displayed correctly, but not so for the Japanese.
> >
> > I've tried setting the buffer's coding system to utf-8,
> > japanese-shift-jis, japanese-shift-jis-mac, japanese-shift-jis-dos
> > (just guessing).  None worked.
> >
> > In fact, I'm not even sure that any of these changes of the coding
> > system achieved *anything*, since the buffer's appearance remained
> > unchanged throughout all this mucking around.  I used the command
> > set-buffer-file-coding-system to do this.

This won't do the trick (see below for what will do). This function just
says: "forget you loaded this file as shift-JIS. From now on it will be
UTF-8" (for example). So it doesn't change anything, but when you save
the file, it will be transformed to the new coding system (if possible).

> >                                            Should I need to do
> > anything besides re-setting the coding system to see a change in
> > how the file is displayed?

You'll have to use `revert-buffer-with-coding-system' (by default mapped
to the key seqence C-x RET r). This will reload the file under
assumption of the new coding system.

> > More importantly, is there a better way to determine a file's
> > correct coding system besides trial and error?

Pascal answered this part better than I could :-)

There will be always lots of byte sequences valid under several coding
systems (but meaning different things). The methods out there to get a
grip on the problem are heuristic, partly based on statistical
properties of the text. If you want to have some fun understanding the
kind of problems involved, have a look at [1]. For an implementation in
Emacs  Lisp, see Unicad [2]

- --------
[1] <http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html>
[2] <http://www.emacswiki.org/emacs-en/Unicad>

Regards

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFLXTLXBcgs9XrR2kYRAnR5AJ9Jowgc9pPrCaW0lRe1Tv7xFGya+QCfRXJ8
mLTW2GBvke8OYbVdWiVcrcU=
=gJuQ
-----END PGP SIGNATURE-----




reply via email to

[Prev in Thread] Current Thread [Next in Thread]