[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Beaver-devel] Charset detection
From: |
Leslie Polzer |
Subject: |
[Beaver-devel] Charset detection |
Date: |
Wed, 28 May 2003 11:06:42 +0200 |
We might rely on good old 'file' for character set detection. Manpage:
"If a file does not match any of the entries in the magic file, it is examined
to see if it seems to be a text file. ASCII, ISO-8859-x, non-ISO 8-bit
extended-ASCII character sets (such as those used on Macintosh and IBM PC
systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character
sets can be distinguished by the different ranges and sequences of bytes that
constitute printable text in each set. If a file passes any of these tests,
its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII
files are identified as ``text'' because they will be mostly readable on
nearly any terminal; UTF-16 and EBCDIC are only ``character data'' because,
while they contain text, it is text that will require translation before it
can be read. In addition, file will attempt to determine other characteris-
tics of text-type files. If the lines of a file are terminated by CR, CRLF,
or NEL, instead of the Unix-standard LF, this will be reported. Files that
contain embedded escape sequences or overstriking will also be identified."
I guess this is a fairly new feature (2000) which might not be present in older
or other versions of 'file', but
1) if someone has libxml2 there's quite a chance he has a modern GNU file, too
2) it's better than nothing
3) we can just take their code and incorporate it in Beaver
Leslie
--
Current Main System: LFS Linux dreadnought 2.4.20 #18 Thu May 15 19:11:10 CEST
2003 i686
Random Religious Statement: I don't like Emacs - it's bloated and expects
disturbing control sequences.
- [Beaver-devel] Charset detection,
Leslie Polzer <=