[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: GNU coreutils' `join` behaves differently on macOS versus Linux?
From: |
Robert Leach |
Subject: |
Re: GNU coreutils' `join` behaves differently on macOS versus Linux? |
Date: |
Tue, 11 Jul 2023 17:36:26 -0400 |
Thanks! That is very helpful in that it confirms my sanity. I didn't know
about the locale stuff.
I have more info, as yesterday, I'd worked on getting around this issue (but
hadn't found a complete solution yet). I have it all working consistently now.
There are some interesting things I noted that may help explain things, since
it *may*(?) not be entirely attributable to the sort strategies...
- It seemed as if the sort error was fatal on linux, as it never output
anything, though when I added `set euxo pipefail`, join on both systems output
nothing - and join on both systems exited non-zero (according to the command
echo output).
- Adding `--nocheck-order` then allowed both OS's join to generate output and
exit successfully, but the output on linux unexpectedly included the comment at
the top of the file (which wasn't included in the macOS output, and caused
problems downstream of the join command).
Then once I added a `grep -v "#"` to the input files that join consumed, join
on both systems started generating the same output.
The upstream command (featureCounts) was reliably generating the same order
output with the same number of lines, so we didn't really need the order check,
otherwise, I would have added the locale setting you suggested - but it's
definitely good to know.
As a sample, here is the top of one of the input files that was going into the
join command, and all input files only differed by column 7:
```
# Program:featureCounts v2.0.6; Command:"featureCounts" "-p" "-T" "4" "-g"
"peak" "-t" "peak" "-F" "GTF" "--fracOverlap" "0.3" "-a"
"results/all_atac_peaks.gtf" "-o"
"results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv"
"results/sorted_atac_alignments/SRR17656980_19_60m_end.bam"
Geneid Chr Start End Strand Length
results/sorted_atac_alignments/SRR17656980_19_60m_end.bam
peak1 19 9991 10388 . 398 22
peak2 19 13446 13578 . 133 1
peak3 19 18765 19285 . 521 1072
peak4 19 19695 19840 . 146 457
peak5 19 24499 24928 . 430 1086
peak6 19 33839 34320 . 482 169
peak7 19 34412 34567 . 156 36
peak8 19 34653 35197 . 545 212
```
Rob
Robert William Leach
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544
> On Jul 11, 2023, at 5:48 AM, Pádraig Brady <P@draigBrady.com> wrote:
>
> On 10/07/2023 22:26, Robert Leach via GNU coreutils General Discussion wrote:
>> Hi,
>> I wanted to ask about the `join` utility in `coreutils` 9.3. I'm building a
>> snakemake workflow and am debugginbg an error that only occurs when the
>> workflow is run on a linux system. I have narrowed the difference down to
>> the `join` utility provided by the `coreutils` conda package. An error is
>> produced on both systems, but since my script had not set `set -euxo
>> pipefail`, the error was silent. On linux, this produced an error in the
>> workflow rule that executes after the one that uses the join utility,
>> because the input file was empty.
>> So I manually ran the join command and noticed the difference in behavior on:
>> macOS:
>> ```
>> (coreutils) gen-rl-imac[2023-07-10
>> 17:01:59]:...CT-LOCAL/YURI/ATACC/REPOS/ATACCompendium$ join -1 1 -2 1 -o
>> 1.1,1.7,2.7 -t ' '
>> .tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv
>>
>> .tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv
>> Geneid results/sorted_atac_alignments/SRR17656980_19_60m_end.bam
>> results/sorted_atac_alignments/SRR13509617_19_60m_end.bam
>> peak1 22 28
>> peak2 1 12
>> peak3 1072 1637
>> peak4 457 942
>> peak5 1086 1507
>> peak6 169 67
>> peak7 36 85
>> peak8 212 198
>> join:
>> .tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv:12:
>> is not sorted: peak10 19 39038 39248 . 211 194
>> join:
>> .tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv:12:
>> is not sorted: peak10 19 39038 39248 . 211 228
>> peak9 39 34
>> peak10 194 228
>> peak11 2178 2778
>> ...
>> join: input is not in sorted order
>> ```
>> and linux:
>> ```
>> (coreutils) [rleach@argo-comp2 ATACCompendium]$ join -1 1 -2 1 -o
>> 1.1,1.7,2.7 -t ' '
>> .tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv
>>
>> .tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv
>> join:
>> .tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv:12:
>> is not sorted: peak10 19 39038 39248 . 211 194
>> join:
>> .tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv:2:
>> is not sorted: Geneid Chr Start End Strand Length
>> results/sorted_atac_alignments/SRR13509617_19_60m_end.bam
>> join: input is not in sorted order
>> ```
>> Is this a bug in either the macOS or linux versions of the coreutils join
>> utility, a known issue, or what?
>
> Well the output from join(1) is giving ample clues
> that the input files aren't sorted appropriately.
>
> Details:
>
> The above should be warnings and not impact the exit status of the join
> process.
>
> The difference in output from Linux and MacOS is probably due to locale
> settings.
> Note how "Geneid" is the first disorder on your Linux system, which suggests
> MacOS is using the C locale, while your Linux system is using en_US or
> equivalent.
> So you may get better consistency with the join --header option,
> and that may be enough to address all your issues.
>
> If --header doesn't suffice, you may need to `LC_ALL=C sort -k1.5n` your
> input files
> before passing to join.
>
> If that doesn't suffice, you may get desired operation with the
> --nocheck-order option.
>
> cheers,
> Pádraig