bug-parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GNU Parallel Bug Reports GNU parallel with NCBI-BLASTP


From: Andrew James Alverson
Subject: GNU Parallel Bug Reports GNU parallel with NCBI-BLASTP
Date: Tue, 12 Jul 2016 22:05:25 +0000

Hello,

I’ve been using GNU parallel to run NCBI-BLASTP, a bioinformatics program for finding similarity between DNA or protein sequences. If you’d like some background, my usage is similar to what is described here: https://www.biostars.org/p/88624/

BLASTP outputs a tab-delimited text file, and I’ve recently discovered that my output can get corrupted in two ways when using GNU parallel. First, each line of output should contain 16 columns. Most lines have 16 columns, but a small number of lines contains between 3 and 31 columns of output. Second, in some instances the actual data in the columns can be corrupted. I’ve only explored this for one particular column of data, which was flagged by another program as erroneous and is what initially clued me into these issues. These issues are generally repeatable, though the exact errors are not always the same between runs. I’ve gotten similar results when running it on two different hardware environments. I have confirmed that the issue is not with BLASTP itself. I only get the issues when using GNU parallel.

I initially found the problem using parallel v. 20150822. The problem still exists with parallel v. 20160622. Here is the output of parallel --version:

parallel_test $ /share/apps/parallel/20160622/bin/parallel --version
GNU parallel 20160422
Copyright (C) 2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.


When using programs that use GNU Parallel to process data for publication
please cite as described in 'parallel --citation'.
parallel_test $ 

To reproduce the error, you can download the necessary files from here: https://drive.google.com/folderview?id=0B2Vc98jiSniOeWtRcDI3blVGcUE&usp=sharing

Then do the following:

1. You need the BLAST package, which can be downloaded from here: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.3.0/
2. Unzip Species0.fa.gz then run BLAST using the four downloaded files (your paths to the programs will be different):

cat Species0.fa | /share/apps/parallel/20160622/bin/parallel --will-cite --progress --recstart '>' --pipe /share/apps/blast/ncbi-blast-2.3.0+/bin/blastp -outfmt '6 qseqid qlen sseqid slen frames pident nident length mismatch gapopen qstart qend sstart send evalue bitscore' -evalue 0.001 -num_threads 4 -db BlastDBSpecies0 -out Blast0_0.latest_parallel.txt

3. All of the names in columns 1 and 3 should have the form [0-9]*_[0-9]. One problem is that some of these names get corrupted, taking the form [0-9]*_[0-9]*_[0-9]. I look for this with awk:

awk '$1~/[0-9]*_[0-9]*_[0-9]/ || $3~/[0-9]*_[0-9]*_[0-9]/' Blast0_0.latest_parallel.txt

4. Each line should have 16 columns. I check the column number with Perl:

perl -anwe '@F == 16 or print' Blast0_0.latest_parallel.txt

I realize this is a lengthy report. Thanks for your time and any suggestions you might have.

Andy



reply via email to

[Prev in Thread] Current Thread [Next in Thread]