help-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Diff: obtain only the different lines of the newest file


From: Davide Brini
Subject: Re: Diff: obtain only the different lines of the newest file
Date: Thu, 12 Aug 2010 09:56:41 +0100
User-agent: KMail/1.13.5 (Linux/2.6.34-gentoo-r1; KDE/4.4.5; x86_64; ; )

On Thursday 12 Aug 2010 08:44:35 Bob Proulx wrote:

> Kimahri Ronso wrote:
> > My question is about the diff command.
> > 
> > I have 2 files containing almost the same information with around 70.000
> > records.
> > 
> > What I would like to know is if there is a possibility to obtain only the
> > different lines from the second file without anything else.
> 
> Instead of using 'diff' you might find 'comm' more the right tool
> there.
> 
>        Compare sorted files FILE1 and FILE2 line by line.
> 
>        With  no  options,  produce  three-column  output.  Column one
> contains lines unique to FILE1, column two contains lines unique to 
> FILE2,  and column three contains lines common to both files.
> 
>        -1     suppress lines unique to FILE1
> 
>        -2     suppress lines unique to FILE2
> 
>        -3     suppress lines that appear in both files
> 
> Here is an example.  Given:
> 
>   $ cat /tmp/a
>   one
>   two
>   three
> 
>   $ cat /tmp/b
>   one
>   two
>   three
>   four
>   five
>   six
> 
> Then:
> 
>   $ comm -13 /tmp/a /tmp/b
>   four
>   five
>   six

I think this works by accident, since comm needs sorted files.
I get this:

$ comm -13 /tmp/a /tmp/b
four
comm: file 2 is not in sorted order
five
six
 
> > I just need to know the content of the changed lines in the newest file.
> 
> For that you would need to determine the newest file first and then
> handle it appropriately.  Something like this, untested:
> 
>   if [ $(stat --format %Y /tmp/a) -lt $(stat --format %Y /tmp/b) ]; then
>     comm -13 /tmp/a /tmp/b
>   else
>     comm -13 /tmp/b /tmp/a
>   fi
> 
> The stat with %Y emits the modification time as an integer number of
> seconds and that is compared to determine the newest file.

Here's an awk solution, assuming the newer file has previously been determined 
(for example with stat as you suggest):

awk 'NR==FNR{a[$0];next} !($0 in a)' oldfile newfile

That prints lines in "newfile" that are not in "oldfile".

-- 
D.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]