A BLAST array job for the SANBI cluster

If you want to query a BLAST database with a large number of input query sequences, you might want to use this script. The easy way to gain speed for a BLAST search is to split the input set of query sequences (using a script such as split_fasta.py or (if the sequences don’t contain linebreaks) you can use split or you can use csplit) into multiple parts and run the BLAST search as an array job. For this script, you need a working directory containing these subdirectories:

in/ - a directory containing your split queries in files named *.fasta
out/ - an empty output directory
logs/ - an empty log directory

Tune your splitting for efficiency: if your queries are too small, the time to start running will make the search inefficient. If your queries are too large, the jobs will run too long – remember that the timelimit on the default all.q is 8 hours.

Uljana and I wrote the script below to actually run the array job. If your working directory was, for example, /cip0/research/rosemary/blast and you saved this script as blastplus_array.sh and you have 20 input query files, then you could submit the script with:

qsub -wd /cip0/research/rosemary/blast -t 1-20 blastplus_array.sh

Note that each use can have at most 20 jobs running on the cluster at any one time, so your queries will run in blocks of 20 jobs at a time. The raw source code is available and easier to copy than the listing below. Also note that you probably want to customise the actual BLAST command line (at the end of the script). The one in here was designed to pick up taxonomy information from the local install of the NR database – useful for doing a metagenomic scan.

#!/bin/sh

# requirement:
# working directory with:
# in/ - files named .fasta that are query sequences
# out/ - empty directory to put outputs in
# logs/ - empty directory to put logs in 
#
# qsub with:
# qsub -t 1-2 -wd ./my-work-dir blastplus_array.sh

#$ -o logs/$JOB_NAME.o$JOB_ID.$TASK_ID
#$ -e logs/$JOB_NAME.e$JOB_ID.$TASK_ID


### -----
### define input and output directories

base_dir=`pwd`
in_dir=$base_dir/in
out_dir=$base_dir/out

cd $in_dir

### -----
### get all the file names into a file

filelist=../logs/file_list.txt
ls *.fasta > $filelist

### -----
### access the fasta files by the ${SGE_TASK_ID}

fasta=`awk "NR == ${SGE_TASK_ID} {print}" $filelist` # ${file_list[$counter]}
echo $fasta

### -----
### add the blast module and run blast

. /etc/profile.d/module.sh
module add blastplus/default

blastn -query $in_dir/$fasta -db nt -out $out_dir/$fasta.txt -outfmt "6 std slen qlen qcovs qcovhsp staxids sscinames sskingdoms" -soft_masking false -max_target_seqs 3 -evalue 10