Re: [Beaver-devel] Charset detection

beaver-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Beaver-devel] Charset detection

From:	Michael Terry
Subject:	Re: [Beaver-devel] Charset detection
Date:	Wed, 28 May 2003 16:46:33 -0400
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030322

Leslie Polzer wrote:

We might rely on good old 'file' for character set detection. Manpage:

"If a file does not match any of the entries in the magic file, it is examined
to  see  if  it  seems  to  be a text file.  ASCII, ISO-8859-x, non-ISO 8-bit
extended-ASCII character sets (such as those used on  Macintosh  and  IBM  PC
systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character
sets can be distinguished by the different ranges and sequences of bytes that
constitute  printable text in each set.  If a file passes any of these tests,
its character set is reported.  ASCII, ISO-8859-x, UTF-8, and  extended-ASCII
files  are  identified  as  ``text''  because they will be mostly readable on
nearly any terminal; UTF-16 and EBCDIC are only ``character  data''  because,
while  they  contain text, it is text that will require translation before it
can be read.  In addition, file will attempt to determine other  characteris-
tics  of text-type files.  If the lines of a file are terminated by CR, CRLF,
or NEL, instead of the Unix-standard LF, this will be reported.   Files  that
contain embedded escape sequences or overstriking will also be identified."


Cool.

I guess this is a fairly new feature (2000) which might not be present in older
or other versions of 'file', but

1) if someone has libxml2 there's quite a chance he has a modern GNU file, too
2) it's better than nothing
3) we can just take their code and incorporate it in Beaver


Eh, I'd like to avoid 3 if I could.

As for 1, that's not entirely true, is it? I mean, not everyone willhave GNU stuff -- what about Sun hardware? Do they use GNU?

Though, I would love to bump our requirement to modern stuff. Theformal UNIX spec is pretty restricting, and we could make our searchcode a good deal sexier if we are allowed to use modern arguments to grep.

It doesn't really bother me to up our requirements. But, I thought glibhad some good support for charset detection. No go?

-mt

pgp7gsd7lKnuU.pgp
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

[Beaver-devel] Charset detection, Leslie Polzer, 2003/05/28
- Re: [Beaver-devel] Charset detection, Michael Terry <=
  - Re: [Beaver-devel] Charset detection, Leslie Polzer, 2003/05/28

Prev by Date: Re: [Beaver-devel] GtkItemFactory escaping
Next by Date: Re: [Beaver-devel] Charset detection
Previous by thread: [Beaver-devel] Charset detection
Next by thread: Re: [Beaver-devel] Charset detection
Index(es):
- Date
- Thread