How to submit a job to the SANBI computing cluster

I keep trying to finish the documentation about our computing cluster at SANBI and SGE (Sun Grid Engine)1 and how to run jobs. In the meantime, however, here’s how to run a job on the cluster at SANBI.

Firstly, the structure of the cluster. Our storage, for now, is provided by a storage server and shared across the whole cluster. This means that your home directory and the /cip0 storage area is shared across the whole cluster. We still need to implement better research data management practices but you should do your work in a scratch directory and store your results in your research directory. I’m not going to talk more about that now because the system is in flux.

Secondly, the cluster has a number of compute nodes and a single submit node. The submit node is queue00.sanbi.ac.za, so log in there to submit your job. It is a smallish virtual machine, so don’t run anything substantial on the submit node!

So lets’s imagine that you want to run a tool like fastqc on the cluster. First, is the tool available? We use a system called environment modules to manage the available software. This allows us to install software in a central place and just add the relevant environment variables to run the tool you need. The module avail command lists available commands, so module avail 2>&1 |grep fastqc will show us that fastqc is indeed available.

Next we need a working directory (e.g. /cip0/scratch/pvh) and a script that will run the command. Here is a little script (lets imagine it is called run_fastqc.sh):

#!/bin/sh

. /etc/profile.d/module.sh

module add fastqc

if [ ! -d out ] ; then
    mkdir out
fi

fastqc -t 1 -out `pwd`/out -f fastq data.fastq

The . /etc/profile.d/module.sh ensures that the module command is available. The module add fastqc adds the path to fastqc so that it is available to our script. You can use module add outside a script if you want to examine how to run a command. It also sets a variable, FASTQC_HOME, pointing to where the command is installed, so you can ls $FASTQC_HOME to see if there is perhaps a README or other useful data in that directory.

Then fastqc needs you to create the output directory, it won’t do it itself, so the script does that, creating a directory under the current directory, named out. Now you need to send the script to the cluster’s scheduler:

qsub -wd $(pwd) -q all.q -N myfastqc run_fastqc.sh

This will send run_fastqc.sh to the scheduler and tell it to run it on all.q (which happens to be the default queue) with the job named myfastqc. This queue has a time limit of 8 hours, so if you need to run for longer than that you need a -q long.q. The long.q queue has no time limit but fewer CPUs available. The -wd flags sets the job’s working directory, in this example the directory you are in when you submit the job.

You can check the status of your job with qstat. Job output is, by default, written into the working directory in two files, one for stderr and the other for stdout.

There’s much more to say about the cluster and the use of the qsub, qstat, qacct and qhost commands, but this should be enough to get you started with your first cluster job. The rest will have to wait till I’ve got time to write more extensive documentation.

Footnotes

Sun Grid Engine is part of the Grid Engine family of job schedulers that has undergone complex evolution over the last years due to Sun’s takeover by Oracle and the subsequent forking of the codebase. See the Grid Engine Wiki for details. Back to text

Adventures in Galaxy output collections

For the Galaxy IUC Tools and Collections codefest we (the SANBI software developers) decided to take on what we thought would be a simple job: make the bamtools_split tool output a dataset collection instead of multiple datasets. So here’s the output clause of the old (multiple datasets) version of bamtools_split:

  <outputs>
    <data format="txt" name="report" label="BAMSplitter Run" hidden="true">
      <discover_datasets pattern="split_bam\.(?P&lt;designation&gt;.+)\.bam" ext="bam" visible="true"/>
    </data>
  </outputs> 

and this needed to change to:

  <outputs>
    <collection name="report" type="list" label="BAMSplitter Run">
        <discover_datasets pattern="split_bam\.(?P&lt;designation&gt;.+)\.bam" ext="bam"/>
    </collection>
  </outputs>

In other words, the <data> element just gets changed to a <collection> element and the <discover_datasets> element remains essentially the same. So we did this change and everything ran fine except: the output collection was empty. Why?

Lots of debugging followed, based on a fresh checkout of the Galaxy codebase. We discovered that the crucial function here is collect_dynamic_collections() in the galaxy.tools.parameters.output_collect module. This is called by the finish() method of the Jobclass, via the Toolclass’ method of the same name.

The collect_dynamic_collections function identifies output collections in a tool’s definition and then uses a collection builder to map job output files to a dataset collection type. The collection builder is a factory class defined in galaxy.dataset_collections.builder and each dataset collection type (defined in galaxy.dataset_collections.builder.types) has its own way of moving output elements into the members of a collection type.

Anyway, we traced this code all the way through to the point where it was obvious the dataset collection was being created successfully and then turned to the other Galaxy devs (John Chilton specifically) to ask for help, only to discover that the problem was gone. The dataset collection was somehow populated! It turns out that if your Galaxy tool creates an output dataset collection that has an uncertain number of members (like a list collection) then it is populated asynchronously and you need to refresh the history to see its members – this is known bug.

So that’s been quite a learning curve. The final tool is on Github. The collection tag for outputs was introduced above. We haven’t explored its pair mode, but check out Peter Briggs’ trimmomatic tool which has an option to output as a pair type dataset collection.

In the test section of the tool configuration, you can use a dataset collection like this:

<test>
    <param name="input_bam" ftype="bam" value="bamtools-input1.bam"/>
    <param name="analysis_type_selector" value="-mapped"/>
    <output_collection name="report">
      <element name="MAPPED" file="bamtools-split-MAPPED1.bam" />
      <element name="UNMAPPED" file="bamtools-split-UNMAPPED1.bam" />
    </output_collection>
</test>

The output_collection tag essentially groups outputs together, with each element tag taking the place that of an individual output tag. Each element tag has a name that maps to one of the names identified by the discover_datasets pattern (perhaps index numbers can be used instead of names, I don’t know) and can use the test attributes that output provides.

With the tests updated and some suitable sample data in place the tests pass and the tool is ready for a pull request. There was some discussion though on the semantics of this tool… for more go and read the comments on the PR.