bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UNIX join command bug


From: James Youngman
Subject: Re: UNIX join command bug
Date: Thu, 21 Aug 2008 21:57:29 +0100

On Thu, Aug 21, 2008 at 4:45 PM, Guillaume Smits <address@hidden> wrote:
> Dear GNU,
>
>
> I have two files exactly identical composed of:
>
> 6 Fields, tab separated, with a /n

That would be \n - I assume you mean ASCII LF.

> at the end of the line, sorted
> numerically on the key identifier (field #2).
>
>
> Here is the head of the files:
>
>
> File1
>
> CHR     SNP     A1      A2      MAF     NCHROBS
> 13      rs4     G       A       0.0648148       216
> 7       rs8     T       C       0.166667        216
> 7       rs16    T       C       0.475962        208
> ...
>
>
> File2
>
> CHR     SNP     A1      A2      MAF     NCHROBS
> 7       rs8     A       G       0.215674        9876
> 7       rs16    G       A       0.477102        9870
> 7       rs19    G       A       0.385628        9880
> ...
>
>
>
> The first file is ~ 1,400,000 lines long
>
> The second file is ~ 330,000 lines long

You're not making it easy for people to help you.    You don't
indicate what version of coreutils you are using.    You don't provide
a minimal example.   You just tell us you have two vast inputs you
won't show us that don't join in the way you expect.



> When I perform a very simple join command as follows:
>
> Join -1 2 -2 2 file1.txt file2.txt > joinedfile.txt
>
>
> I obtain a joinedfile of ~213.000 lines in place of the expected
> ~322.000 lines (65% of the lines).
>
> The lines missing are scattered everywhere in the original files (at the
> beginning, middle or end). There is also no logic to find while
> considering the SNP identifier of the missing lines.
>
>
>
> For example a line which is missing is the following one:

This is not a helpful example; 99% of join problems are caused by
out-of-order input and you haven't provided a complete example that
domenstrates the problem so that we can eliminate that possibility.


> I can't find any difference between the files (e.g., no hidden
> characters) or the key identifiers. The files are sorted in the same
> way, tabulated in the same way,...

My guess is that this is not actually the case.

> The only difference is the number of lines (1.4 million in file 1; 300
> thousands in file 2). While big, these line numbers should not be a
> limiting factor to the join command... (and why would be the missing
> line scattered all along the files?)
>
>
> Using a Perl script to print lines having the same field 2 identifier, I
> obtain the ~322,000 lines expected proving that it is nearly surely a
> join command bug.
>
>
>
> Question: Is there any trivial (or less trivial) explanation to this
> join command bug?

Operator error?      Try coreutils 6.11, which should notify you if
the input is out of order - see the Info documentation for details.

James.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]