[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: grep RFE: End-of-Line choices
From: |
Mabry Tyson |
Subject: |
Re: grep RFE: End-of-Line choices |
Date: |
Fri, 27 Feb 2004 03:29:58 -0800 |
User-agent: |
Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.6) Gecko/20040113 |
I apologize if there have been discussions that I've missed, but I'm not
on the grep mailing lists. It sounds as if there is some hesitation
to do a general solution.
It was trivial to *hack* grep to change dosbuf.c so that guess_type
looked for files to have bare CRs (separately, and consistently with
also looking for CRLF) and then have undossify_input do the right thing
for Mac files (with no char position mapping for CR files, just
translation, of course). It was more complicated getting that file
compiled for MacOSX. So I have a grep that does what I want (handling
LF, CR, and CRLF).
But that isn't the right way to do this and I'm the kind of person that
would rather do it the right way rather than doing the quick hack. (As
an introduction for me, I got my PhD for AI more than 20 years ago.
I've done my share of sys admin work but I haven't hacked system or
kernel code for decades. Mainly I do research.) I didn't feel that
I'd be the right person to do a proper change as I had never looked at
grep's code until the other day and C isn't my native language.
At least in the world I see, I see a growing tendency to have more
heterogenous file systems. I frequently am on Mac, Unix, and Windows
systems in the same day. We develop on platforms of choice and merge
the code together. Then we deploy the same code to the various
systems. We move files between OSs all the time. People copy whole
file systems between Windows and Unix for backup purposes. We mount
file systems back and forth.
The dual nature of Mac OS X (or more than dual if you run a Windows
emulator that has access to the Mac file systems, as I did before I got
a separate Windows box) is an extreme example of a heterogenous file
system. We have a number of dual-boot Windows/Linux systems that run
into the same issues.
Some transfer programs do EOL translation. That's fine for a finite
amount of file transfers, but I wouldn't want to trust that if I'm
exchanging lots of random directories and files.
This issue isn't new. dos2unix and unix2dos have been around a long
time. Emacs has gone to some effort to adapt to a file's EOL
convention. That obviously was a much bigger effort than having grep
adapt to a file's EOL convention (on all OS's, not just DOS).
I have to say that I can't think of any text files that I've ever had to
deal with that would have screwed up the detection of the EOL
convention. In the case of binary files, all bets are off. If I had
to make a choice between a grep that only did LF and a grep that always
chooses among LF, CRLF, or CR, I'd take the latter. But I'd prefer one
that had switches to prevent screw-ups for a case I can't even imagine.
(In response to one comment, I don't think the issue of "cat mac_file
dos_file unix_file | grep" is significant. But a solution that didn't
do the dosbuf.c mapping would be welcome, and I can imagine that
solution detecting CRLF, CR, or LF whenever they show up and then using
OS info or switch settings to decide whether it has found an EOL. Such
a solution could accept {CRLF | CR | LF} as an EOL convention (as
opposed to choosing one of CRLF or CR or LF as the EOL convention for
the whole file).)
Done properly, the capability to use grep on a text file with a
different EOL convention need not interfere with the efficiency of grep
on "natural" EOL files. Use the switch to turn it on when you need it
(and I'll leave the switch on unless I need it off).
I would urge you to make grep be general purpose and agnostic about a
file's OS of origin.
Thanks for considering this....
Mabry Tyson
address@hidden
P.S., Here's the diff on the changes I did. (Whoops! I see that
mac_file_type and mac_use_file_type
are extraneous and should be removed.)
Mabry-Tysons-Computer 3:08am<2> 105: diff -C 3 dosbuf.c.20000119
dosbuf.c
*** dosbuf.c.20000119 Wed Jan 19 20:43:03
2000
--- dosbuf.c Tue Feb 24 19:50:13
2004
***************
*** 8,17
****
functions won't work
correctly);
* Reporting correct byte count with -b for any kind of
file.
*/
typedef enum
{
! UNKNOWN, DOS_BINARY, DOS_TEXT,
UNIX_TEXT
}
File_type;
struct dos_map
{
--- 8,20
----
functions won't work
correctly);
* Reporting correct byte count with -b for any kind of
file.
+ Also handles MAC text files whose lines end in bare
CR.
+ * Change CR to LF but otherwise leave the file
alone.
+
*/
typedef enum
{
! UNKNOWN, DOS_BINARY, DOS_TEXT, UNIX_TEXT,
MAC_TEXT
}
File_type;
struct dos_map
{
***************
*** 29,39
****
--- 32,47
----
static int dos_pos_map_used =
0;
static int inp_map_idx = 0, out_map_idx =
1;
+ static File_type mac_file_type =
UNKNOWN;
+ static File_type mac_use_file_type =
UNKNOWN;
+
+
/* Guess DOS file type by looking at its contents.
*/
static inline
File_type
guess_type (char *buf, register size_t
buflen)
{
int crlf_seen =
0;
+ int cr_seen =
0;
register char *bp =
buf;
while
(buflen--)
***************
*** 47,56
****
else if (*bp == '\r' && buflen && bp[1] ==
'\n')
crlf_seen =
1;
bp++;
}
! return crlf_seen ? DOS_TEXT :
UNIX_TEXT;
}
/* Convert external DOS file representation to
internal.
--- 55,69
----
else if (*bp == '\r' && buflen && bp[1] ==
'\n')
crlf_seen =
1;
+ /* Bare CR means MAC text file (unless we later
see
+ binary characters)
*/
+ else if (*bp == '\r'
)
+ cr_seen =
1;
+
bp++;
}
! return crlf_seen ? DOS_TEXT : cr_seen ? MAC_TEXT :
UNIX_TEXT;
}
/* Convert external DOS file representation to
internal.
***************
*** 140,148
****
--- 153,185
----
return
chars_left;
}
+ else if (dos_file_type ==
MAC_TEXT)
+
{
+ char *destp =
buf;
+
+ while
(buflen--)
+
{
+ if (*buf !=
'\r')
+
{
+ *destp++ =
*buf++;
+
chars_left++;
+
}
+
else
+
{
+ /* Insert an LF
*/
+ *destp++ =
'\n';
+
buf++;
+
chars_left++;
+
+
}
+
}
+
+ return
chars_left;
+
}
return
buflen;
}
Mabry-Tysons-Computer 3:08am<2> 104: diff -b -C 3 system.h.20010208
system.h
*** system.h.20010208 Thu Feb 8 09:01:32
2001
--- system.h Tue Feb 24 19:48:46
2004
***************
*** 57,67
****
--- 57,69
----
# undef O_BINARY /* BeOS 5 has O_BINARY and O_TEXT, but they have no
effect.
*/
#endif
#ifdef
HAVE_DOS_FILE_CONTENTS
+ # if defined(__MSDOS__) ||
defined(_WIN32)
# include
<io.h>
# ifdef
HAVE_SETMODE
# define SET_BINARY(fd) setmode (fd,
O_BINARY)
#
else
# define SET_BINARY(fd) _setmode (fd,
O_BINARY)
+ #
endif
#
endif
#endif