help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: character encoding


From: Ivan Kanis
Subject: Re: character encoding
Date: 26 Oct 2002 19:34:35 +0200
User-agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2

    Charles> Hugh wrote:
    >>
    >> Sometimes when I cut and past "it's" from a web page into an
    >> emacs buffer it transfers as "it?s".  Ditto for other similar
    >> events.

    Charles> Most likely these are "curly" apostrophes that are
    Charles> inserted when people publish HTML by first writing it in
    Charles> a word processor like Word or Word Perfect, which use

I agree it's a real pain. This program strips all of that nonsense. It
won't work for copy/paste problem but it will work for big chunk of
text. It turns windows-1252 encoding to iso-8859-1 encoding. Basically
one has to convert the crap Microsoft inserted between 0x80 to 0x9f
into something standard.

I know it's in C. If someone cares to turn this into lisp that'll be neat :)

Ivan


#include "stdio.h"

char *table [] =  {
    "euro",  /* 0x80 0x20AC  #EURO SIGN */
    "",      /* 0x81          #UNDEFINED */
    "\"",    /* 0x82  0x201A  #SINGLE LOW-9 QUOTATION MARK */
    "f",     /* 0x83  0x0192  #LATIN SMALL LETTER F WITH HOOK */
    "\"",    /* 0x84  0x201E  #DOUBLE LOW-9 QUOTATION MARK */
    "...",   /* 0x85  0x2026  #HORIZONTAL ELLIPSIS */
    "*",     /* 0x86  0x2020  #DAGGER */
    "*",     /* 0x87  0x2021  #DOUBLE DAGGER */
    "^",     /* 0x88  0x02C6  #MODIFIER LETTER CIRCUMFLEX ACCENT */
    " 0/00", /* 0x89  0x2030  #PER MILLE SIGN */
    "S",     /* 0x8A  0x0160  #LATIN CAPITAL LETTER S WITH CARON */
    "<",     /* 0x8B  0x2039  #SINGLE LEFT-POINTING ANGLE QUOTATION MARK */
    "OE",    /* 0x8C  0x0152  #LATIN CAPITAL LIGATURE OE */
    "",      /* 0x8D          #UNDEFINED */
    "Z",     /* 0x8E  0x017D  #LATIN CAPITAL LETTER Z WITH CARON */
    "",      /* 0x8F          #UNDEFINED */
    "",      /* 0x90          #UNDEFINED */
    "'",     /* 0x91  0x2018  #LEFT SINGLE QUOTATION MARK */
    "'",     /* 0x92  0x2019  #RIGHT SINGLE QUOTATION MARK */
    "\"",    /* 0x93  0x201C  #LEFT DOUBLE QUOTATION MARK */
    "\"",    /* 0x94  0x201D  #RIGHT DOUBLE QUOTATION MARK */
    "*",     /* 0x95  0x2022  #BULLET */
    "-",     /* 0x96  0x2013  #EN DASH */
    "-",     /* 0x97  0x2014  #EM DASH */
    "~",     /* 0x98  0x02DC  #SMALL TILDE */
    "(TM)",  /* 0x99  0x2122  #TRADE MARK SIGN */
    "s",     /* 0x9A  0x0161  #LATIN SMALL LETTER S WITH CARON */
    "\"",    /* 0x9B  0x203A  #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK */
    "oe",    /* 0x9C  0x0153  #LATIN SMALL LIGATURE OE */
    "",      /* 0x9D          #UNDEFINED */
    "z",     /* 0x9E  0x017E  #LATIN SMALL LETTER Z WITH CARON */
    "y"     /* 0x9F  0x0178  #LATIN CAPITAL LETTER Y WITH DIAERESIS */
};


int main (int argc, char **argv) {
    FILE *fd;
    unsigned char in;
    
    if (argc == 2) {
        if ((fd = fopen(argv[1], "r"))) {
            while (fread(&in, 1, sizeof(char), fd)) {
                if (in >= 0x80 && in < 0xa0) {
                    printf ("%s", table[in-0x80]);
                } else {
                    printf("%c", in);
                }
            }
            fclose (fd);
        }
    }
    return 0;
}



-- 
/-----------------------------------------------------------------------------*
|    "I shall never make a new friend in my life,    |       Ivan Kanis       |
|    though perhaps a few after I die."              |    ivank@juliva.com    |
|    (Oscar Wilde)                                   |     www.juliva.com     |
*-----------------------------------------------------------------------------/


reply via email to

[Prev in Thread] Current Thread [Next in Thread]