beaver-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Beaver-devel] Charset detection


From: Michael Terry
Subject: Re: [Beaver-devel] Charset detection
Date: Wed, 28 May 2003 16:46:33 -0400
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030322

Leslie Polzer wrote:
We might rely on good old 'file' for character set detection. Manpage:

"If a file does not match any of the entries in the magic file, it is examined
to  see  if  it  seems  to  be a text file.  ASCII, ISO-8859-x, non-ISO 8-bit
extended-ASCII character sets (such as those used on  Macintosh  and  IBM  PC
systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character
sets can be distinguished by the different ranges and sequences of bytes that
constitute  printable text in each set.  If a file passes any of these tests,
its character set is reported.  ASCII, ISO-8859-x, UTF-8, and  extended-ASCII
files  are  identified  as  ``text''  because they will be mostly readable on
nearly any terminal; UTF-16 and EBCDIC are only ``character  data''  because,
while  they  contain text, it is text that will require translation before it
can be read.  In addition, file will attempt to determine other  characteris-
tics  of text-type files.  If the lines of a file are terminated by CR, CRLF,
or NEL, instead of the Unix-standard LF, this will be reported.   Files  that
contain embedded escape sequences or overstriking will also be identified."

Cool.


I guess this is a fairly new feature (2000) which might not be present in older
or other versions of 'file', but

1) if someone has libxml2 there's quite a chance he has a modern GNU file, too
2) it's better than nothing
3) we can just take their code and incorporate it in Beaver

Eh, I'd like to avoid 3 if I could.

As for 1, that's not entirely true, is it? I mean, not everyone will have GNU stuff -- what about Sun hardware? Do they use GNU?

Though, I would love to bump our requirement to modern stuff. The formal UNIX spec is pretty restricting, and we could make our search code a good deal sexier if we are allowed to use modern arguments to grep.

It doesn't really bother me to up our requirements. But, I thought glib had some good support for charset detection. No go?

-mt

Attachment: pgp7gsd7lKnuU.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]