help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Q on call-process and grep


From: Drew Adams
Subject: RE: Q on call-process and grep
Date: Thu, 22 Dec 2005 13:11:45 -0800

    If it were me, I'd make a copy of the file, and then chop it into
    smaller pieces where I can illustrate the problem in a manageable
    length (say, 10 or 20 lines, but the fewer the better). The sed command

       sed -n 17,35p bigfile >smallfile

There are over 30,000 lines.

    will print lines 17 to 35, inclusive, so you can do your testing. But
    you say that some of the lines are quite long. So try this:

       awk '{print length($0)}' smallfile

    to see how long is too long. If the lines are under 4000 chars, I'd
    feel safe in guessing that line length isn't a problem. If you have
    lines 20,000 chars or more, then I'd start thinking about the input.

I was hoping that I was missing something simple. You seem to be confirming
that I didn't miss anything obvious (to you) ;-).

The longest line is over 12,000 characters.

    Does each line in the problem set end in a CR/LF? I've had datafiles
    that gave me bad data because somehow some lines ended with CR/LF,
    others with CR/CR/LF, and others with CR only. How I got the problem
    isn't relevant. But to normalize the input, try

       tr -d '\r' <smallfile | sed -n p >clean_smallfile

    which should remove any extraneous CRs which might be causing
    corruption and restore the line endings to your Cygwin default (Unix or
    DOS, whichever you picked).

Did that on the complete original file. `ediff' shows no difference from the
original.

I tried using a small file - just a few lines of the original - no change.
Terms that can't be found still aren't; those that can be found still are.

    Use tr to delete all the characters that are permissible or
    expected, and whatever is left must be an unexpected character. Examine
    the output with cat -A or od or your tool of choice. E.g.,

       tr -d '\n\r\t\40-\176' <infile >outfile

Did that. outfile is empty, so I guess everything was ASCII.

    If it were me, I might wonder about embedded backspaces or carriage
    returns in the text. Just a thought. Good luck on your hunting!

My guess is that the line lengths and number of lines don't matter here,
because it works fine for other words, including 1) words in the longest
line and 2) words in the last line of the file. It's a mystery to me why it
doesn't work for certain words.

Thanks for your suggestions, though - they were good things to try, even if
I haven't yet solved the problem.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]