beaver-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Beaver-devel] Charset detection


From: Leslie Polzer
Subject: [Beaver-devel] Charset detection
Date: Wed, 28 May 2003 11:06:42 +0200

We might rely on good old 'file' for character set detection. Manpage:

"If a file does not match any of the entries in the magic file, it is examined
to  see  if  it  seems  to  be a text file.  ASCII, ISO-8859-x, non-ISO 8-bit
extended-ASCII character sets (such as those used on  Macintosh  and  IBM  PC
systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character
sets can be distinguished by the different ranges and sequences of bytes that
constitute  printable text in each set.  If a file passes any of these tests,
its character set is reported.  ASCII, ISO-8859-x, UTF-8, and  extended-ASCII
files  are  identified  as  ``text''  because they will be mostly readable on
nearly any terminal; UTF-16 and EBCDIC are only ``character  data''  because,
while  they  contain text, it is text that will require translation before it
can be read.  In addition, file will attempt to determine other  characteris-
tics  of text-type files.  If the lines of a file are terminated by CR, CRLF,
or NEL, instead of the Unix-standard LF, this will be reported.   Files  that
contain embedded escape sequences or overstriking will also be identified."

I guess this is a fairly new feature (2000) which might not be present in older
or other versions of 'file', but

1) if someone has libxml2 there's quite a chance he has a modern GNU file, too
2) it's better than nothing
3) we can just take their code and incorporate it in Beaver

Leslie

-- 
Current Main System: LFS Linux dreadnought 2.4.20 #18 Thu May 15 19:11:10 CEST 
2003 i686
Random Religious Statement: I don't like Emacs - it's bloated and expects 
disturbing control sequences.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]