[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: character encoding
From: |
Ivan Kanis |
Subject: |
Re: character encoding |
Date: |
26 Oct 2002 19:34:35 +0200 |
User-agent: |
Gnus/5.09 (Gnus v5.9.0) Emacs/21.2 |
Charles> Hugh wrote:
>>
>> Sometimes when I cut and past "it's" from a web page into an
>> emacs buffer it transfers as "it?s". Ditto for other similar
>> events.
Charles> Most likely these are "curly" apostrophes that are
Charles> inserted when people publish HTML by first writing it in
Charles> a word processor like Word or Word Perfect, which use
I agree it's a real pain. This program strips all of that nonsense. It
won't work for copy/paste problem but it will work for big chunk of
text. It turns windows-1252 encoding to iso-8859-1 encoding. Basically
one has to convert the crap Microsoft inserted between 0x80 to 0x9f
into something standard.
I know it's in C. If someone cares to turn this into lisp that'll be neat :)
Ivan
#include "stdio.h"
char *table [] = {
"euro", /* 0x80 0x20AC #EURO SIGN */
"", /* 0x81 #UNDEFINED */
"\"", /* 0x82 0x201A #SINGLE LOW-9 QUOTATION MARK */
"f", /* 0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK */
"\"", /* 0x84 0x201E #DOUBLE LOW-9 QUOTATION MARK */
"...", /* 0x85 0x2026 #HORIZONTAL ELLIPSIS */
"*", /* 0x86 0x2020 #DAGGER */
"*", /* 0x87 0x2021 #DOUBLE DAGGER */
"^", /* 0x88 0x02C6 #MODIFIER LETTER CIRCUMFLEX ACCENT */
" 0/00", /* 0x89 0x2030 #PER MILLE SIGN */
"S", /* 0x8A 0x0160 #LATIN CAPITAL LETTER S WITH CARON */
"<", /* 0x8B 0x2039 #SINGLE LEFT-POINTING ANGLE QUOTATION MARK */
"OE", /* 0x8C 0x0152 #LATIN CAPITAL LIGATURE OE */
"", /* 0x8D #UNDEFINED */
"Z", /* 0x8E 0x017D #LATIN CAPITAL LETTER Z WITH CARON */
"", /* 0x8F #UNDEFINED */
"", /* 0x90 #UNDEFINED */
"'", /* 0x91 0x2018 #LEFT SINGLE QUOTATION MARK */
"'", /* 0x92 0x2019 #RIGHT SINGLE QUOTATION MARK */
"\"", /* 0x93 0x201C #LEFT DOUBLE QUOTATION MARK */
"\"", /* 0x94 0x201D #RIGHT DOUBLE QUOTATION MARK */
"*", /* 0x95 0x2022 #BULLET */
"-", /* 0x96 0x2013 #EN DASH */
"-", /* 0x97 0x2014 #EM DASH */
"~", /* 0x98 0x02DC #SMALL TILDE */
"(TM)", /* 0x99 0x2122 #TRADE MARK SIGN */
"s", /* 0x9A 0x0161 #LATIN SMALL LETTER S WITH CARON */
"\"", /* 0x9B 0x203A #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK */
"oe", /* 0x9C 0x0153 #LATIN SMALL LIGATURE OE */
"", /* 0x9D #UNDEFINED */
"z", /* 0x9E 0x017E #LATIN SMALL LETTER Z WITH CARON */
"y" /* 0x9F 0x0178 #LATIN CAPITAL LETTER Y WITH DIAERESIS */
};
int main (int argc, char **argv) {
FILE *fd;
unsigned char in;
if (argc == 2) {
if ((fd = fopen(argv[1], "r"))) {
while (fread(&in, 1, sizeof(char), fd)) {
if (in >= 0x80 && in < 0xa0) {
printf ("%s", table[in-0x80]);
} else {
printf("%c", in);
}
}
fclose (fd);
}
}
return 0;
}
--
/-----------------------------------------------------------------------------*
| "I shall never make a new friend in my life, | Ivan Kanis |
| though perhaps a few after I die." | ivank@juliva.com |
| (Oscar Wilde) | www.juliva.com |
*-----------------------------------------------------------------------------/