MegaRAID write cache policy with lsmcli

A couple of weeks ago I had a near disaster when some of our servers lost power while their RAID was set to “write-through” caching with non-working batteries. The result was filesystem corruption and failure on 2 out of 3 Ceph monitor servers. In the past I have written about using MegaCLI for RAID admin. MegaCLI has been replaced by StorCLI, which I found here on the Broadcom web pages. I unpacked the various zip files until I got the storcli-007.0309.0000.0000-1.noarch.rpm RPM and installed that to get the MegaRAID storcli64 tool. Instead of using that directly I’m using the libstoragemgmt tools with the lsmcli tool. On CentOS 7 this required installing libstoragemgmt and libstoragemgmt-megaraid-plugin and starting the lsmd daemon systemctl start libstoragemgmt.

This all set up, I found the volume with lsmcli -u megaraid:// list --type VOLUMES:

[root@ceph-mon2 ~]# lsmcli -u megaraid:// list --type VOLUMES
ID                               | Name | SCSI VPD 0x83                    | Size         | Disabled | Pool ID | System ID | Disk Paths
6003048016dfd2001cf1d19f0af655a3 | VD 0 | 6003048016dfd2001cf1d19f0af655a3 | 597998698496 | No       | :DG0    |           | /dev/sda  

then the volume-cache-info command:

[root@ceph-mon2 ~]# lsmcli -u megaraid:// volume-cache-info --vol  6003048016dfd2001cf1d19f0af655a3
Volume ID                        | Write Cache Policy | Write Cache | Read Cache Policy | Read Cache | Physical Disk Cache
6003048016dfd2001cf1d19f0af655a3 | Write Back         | Write Back  | Enabled           | Enabled    | Use Disk Setting   

and set the policy to AUTO (which means write-back when the battery is ok, write-through otherwise):

[root@ceph-mon2 ~]# lsmcli -u megaraid:// volume-write-cache-policy-update --vol  6003048016dfd2001cf1d19f0af655a3 --policy AUTO
Volume ID                        | Write Cache Policy | Write Cache   | Read Cache Policy | Read Cache | Physical Disk Cache
6003048016dfd2001cf1d19f0af655a3 | Write Through      | Write Through | Enabled           | Enabled    | Use Disk Setting   

There doesn’t seem to be a direct way to query the battery backup unit (BBU) with lsmcli but /opt/MegaRAID/storcli/storcli64 show will show you what the status is.

MaterializeCSS vs ReactJS: the case of the select

For historical reasons, the COMBAT TB web interface uses materialize for its styling. So far so good. That is, until I tried to deploy my code, written with React.JS. See, as I understand it, materialize has in some cases decided to replace some HTML elements with its own version of them. Notably the <select> element. And ReactJS relies on these elements for its own operation.

The first problem I had was that <select> elements vanished. Turns out you need a bit of Javascript to make them work:

$(document).ready(function() {

The next problem, however, was that the onChange handlers that ReactJS uses don’t trigger events. Luckily that has been discussed before.

I’ve got two types of <select> element in the code I was writing for the COMBAT TB Explorer web application: static ones (there from the birth of the page) and dynamically generated ones. For the static ones I added some code to link up the events in the componentDidMount handler:

componentDidMount: function() {
    $(document).ready(function() {
    $('#modeselectdiv').on('change', 'select', null, this.handleModeChange);
    $('#multicompselectdiv').on('change', 'select', null, this.handleMultiCompChange);


but this didn’t work for the dynamically generated elements, I think because they are only rendered after an AJAX call returns. For since I know a state change triggers the render event, I added the handler hook-up after the data was return and deployed (to the application’s state), for example:

success: function(datasets) {
    var dataset_list = [];
    var dataset_list_length = datasets.length
    for (var i = 0; i < dataset_list_length; i++) {
        dataset_list.push({'name': datasets[i]['name'], id: datasets[i]['id']});
    this.setState({datasets: dataset_list, dataset_id: dataset_list[0].id});
    $('#datasetselectdiv').on('change', 'select', null, this.handleDatasetChange);

Turns out this works. The state change handlers are now linked in, they keep the state up to date with what the user is doing on the form, and the whole thing (that links the application to a Galaxy instance) works. Yay!

Making Ubuntu 14.04 and CentOS 7 NFS work together

I just spent a frustrating morning configuring our servers to talk NFS to each other properly. So we have:

1) NFS servers (ceph-mon1 and so on) running CentOS 7.
2) NFS clients (gridj1 and so on) running Ubuntu 14.04.

The first problem: RBD mounting and NFS startup were not configured on the servers. I fixed that by adding entries in /etc/ceph/rbdmap and enabling the rbdmap and nfs-server services using systemctl enable. I also used e2label to label the ext4 filesystems in the RBDs and then used these labels in /etc/fstab instead of device names. And used the _netdev option in the mount options because these devices are network devices.

The second problem: I had to add the insecure option to the exports in /etc/exports. This is because the mount request comes from a port higher than 1024, a so-called insecure port. And then exportfs -r to resync everything.

And the third problem: Ubuntu autofs makes a NFS4 mount request by default (even though I had specified nfsvers=3 in the mount options), and I haven’t configured NFS4’s authenticated mounts, so I was getting authenticated mount request from messages in /var/log.messages on the NFS server. I switched the NFS server to not do NFS4 by adding --no-nfs-version 4 to the RPCNFSDARGS variable in /etc/sysconfig/nfs on the server, restarted the NFS server (systemctl restart nfs-server) and the mounts finally worked.

Finally, documented this here for posterity…

Faster Galaxy with uWSGI

I recently switched out local Galaxy server to be run using uWSGI and supervisord instead of the standard (which uses Paste under the hood). I followed the Galaxy scaling guide and it was pretty accurate except for a few details. I won’t be showing the changes to Galaxy config files, they are exactly as related on that page.

I installed supervisord by doing pip install supervisor in the virtualenv that Galaxy uses. Then I put a supervisord.conf in the config/ directory of our Galaxy install and it starts like this:




The [inet_http_server] section directs supervisord to listen on localhost port 9001. The following two sections, [supervisord] and [supervisorctl] need to be present but can be empty. The rest of the configuration is as per that on the Scaling page with a few changes I’ll explain below:

command         = /opt/galaxy/.venv/bin/uwsgi --plugin python --ini-paste /opt/galaxy/config/galaxy.ini --die-on-term
directory       = /opt/galaxy
umask           = 022
autostart       = true
autorestart     = true
startsecs       = 10
user            = galaxy
environment     = PATH=/opt/galaxy/.venv:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin,PYTHON_EGG_CACHE=/opt/galaxy/.python-eggs,PYTHONPATH=/opt/galaxy/eggs/PasteDeploy-1.5.0-    py2.7.egg,SGE_ROOT=/var/lib/gridengine
numprocs        = 1
stopsignal      = TERM

command         = /opt/galaxy/.venv/bin/python ./scripts/ serve config/galaxy.ini --server-name=handler%(process_num)s --pid-file=/opt/galaxy/handler%(process_num) --log-    file=/opt/galaxy/handler%(process_num)s.log
directory       = /opt/galaxy
process_name    = handler%(process_num)s
numprocs        = 2
umask           = 022
autostart       = true
autorestart     = true
startsecs       = 15
user            = galaxy
environment     = PYTHON_EGG_CACHE=/opt/galaxy/.python-eggs,SGE_ROOT=/var/lib/gridengine

The SGE_ROOT is necessary because our cluster uses Sun Grid Engine and the SGE DRMAA library requires this environment variable. Otherwise this config uses uWSGI installed (using pip) in the virtualenv that Galaxy uses.

This snipped of nginx configuration shows what was commented out and what was added to link nginx to uWSGI:

#proxy_set_header REMOTE_USER $remote_user;
#proxy_set_header X-Forwarded-Host $host;
#proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
#proxy_set_header X-URL-SCHEME https;
#proxy_pass http://galaxy_app;
uwsgi_param UWSGI_SCHEME $scheme;
include uwsgi_params;

Then, how to start and stop it all? Firstly, the supervisord config. The basis for this was the debian-norrgard script from the supervisord initscripts repository. The final script is in this gist. Note these lines:


They link supervisord to Galaxy settings. Then /etc/init.d/galaxy is in this gist. It depends on the supervisord startup script and starts and stops Galaxy using supervisorctl.

Two things remain unsatisfactory:

  1. The shutdown of Galaxy doesn’t work reliably. The use of uWSGI’s --die-on-term and stopsignal = TERM in the supervisord.conf is an attempt to remedy this.

  2. The uWSGI config relies on the PasteDeploy egg. This exists on our Galaxy server because it was downloaded by the historical Galaxy startup script. With the switch towards wheel based (instead of egg based) packages, this script is no longer part of a Galaxy install. The uWSGI settings might need to be changed because of this, however, the PasteDeploy package is installed in the virtualenv that Galaxy uses, so perhaps no change is necessary. I haven’t tested this.

With these limitations, however, our Galaxy server is working and much more responsive than before.

Automatically commit and push IPython notebook

I’m currently teaching Python at a Software Carpentry workshop at North West University in Potchefstroom. As always there are concerns about pace and about how people can catch up if they fall behind. In a recent discussion on this topic on the Software Carpentry mailing list, David Dotson mentioned that he commits his IPython notebooks by pressing a custom keyboard shortcut which triggers an automatic git add/commit/push. No code was available, but I poked around a bit and found this StackOverflow question and answer which showed how to add a post-save hook to an IPython notebook (with details on doing the same for the newer Project Jupyter notebooks).

So here’s my code:

import os
from subprocess import check_call
from shlex import split

def post_save(model, os_path, contents_manager):
    """post-save hook for doing a git commit / push"""
    if model['type'] != 'notebook':
        return # only do this for notebooks
    workdir, filename = os.path.split(os_path)
    if filename.startswith('Scratch') or filename.startswith('Untitled'):
        return # skip scratch and untitled notebooks
    # now do git add / git commit / git push
    check_call(split('git add {}'.format(filename)), cwd=workdir)
    check_call(split('git commit -m "notebook save" {}'.format(filename)), cwd=workdir)
    check_call(split('git push'), cwd=workdir)

c.FileContentsManager.post_save_hook = post_save

This code obviously assumes that your working directory is a git repository and it has been configured with a remote to push to. For this workshop my notebooks are in this git repo on GitHub.

I created a new IPython profile (ipython profile create swcteaching) for use while teaching and added that code to the file. You can find this file’s location with ipython profile locate swcteaching.

The one little niggle is that the commit message is always the same. I don’t know IPython’s front-end code well enough, but perhaps there is a way to pop up a window and request a commit message (going towards something more like David Dotson’s solution and less like mine).

docker -G and non-local groups

So we (like many other labs) store our user identity information in LDAP. I created a docker group in LDAP so that its memberships is valid across our cluster. When I tried to run a docker command, however, I got this error:

Get http:///var/run/docker.sock/v1.20/containers/json: dial unix /var/run/docker.sock: permission denied.
* Are you trying to connect to a TLS-enabled daemon without TLS?
* Is your docker daemon up and running?

Turns out that /var/run/docker.sock was owned by root:root, not root:docker as expected. Running docker in debug mode I saw this message:

DEBU[0000] Warning: could not change group /var/run/docker.sock to docker: Group docker not found 

After a big of poking around and verifying that the group did exist I came across the code in unix_socket.go. To make a longish (lines 41-83) story short, docker relies on libcontainer for its user/group lookups and these parse the /etc/group file, ignoring nsswitch.conf (and thus identity providers like LDAP).

If you use the numeric gid (docker daemon -G 555 for instance) then you get some strange messages in the log (example is from debug mode):

WARN[0000] Could not find GID 555                       
DEBU[0000] 555 group found. gid: 555

but the ownership of the docker Unix socket is set as expected.

How Galaxy resolves dependencies (or not)

There are two parts to building a link between Galaxy and command line bioinformatics tools: the tool XML that specifies a mapping between the Galaxy web user interface and the tool command line and tool dependencies that specify how to source the actual packages that implement the tool’s commands. I spent a bit of time today digging into how that second set of components operates as part of my work on SANBI‘s local Galaxy installation.

Requirements and resolvers

In the tool XML dependencies are specified using <requirement> clauses, for example:

<requirement type="package" version="">bwa</requirement>

This is taken from the bwa tool XML that I installed from the toolshed and specifies a particular version of the bwa short read aligner. Not all requirements have a version attached, however – this is from the BAM to BigWig converter, one of the datatype converters that comes with Galaxy (in the lib/galaxy/datatypes/converters directory) and is crucial to the operation of Trackster, the in-Galaxy genome browser:

<requirement type="package">bedtools</requirement>

These dependencies are fed into the dependency manager (DependecyManager in lib/galaxy/tools/deps/ which uses various dependency resolvers to generate shell commands that make the dependency available at runtime. These shell commands are passed on to the Galaxy job that actually executes the tool (see the prepare method of JobWrapper).

The default config

Galaxy provides a configuration file, config/dependency_resolvers_conf.xml to configure how the dependency resolvers are used. There is no sample provided but this pull request shows that the default is:

  <tool_shed_packages />
  <galaxy_packages />
  <galaxy_packages versionless="true" />

These aren’t the only available resolvers – there is a homebrew resolver and one (with big warning stickers) that uses Environment Modules – but what is shown above is the default.

A quick note: if all the resolvers fail, you’ll see a message such as:

Failed to resolve dependency on 'bedtools', ignoring

in your Galaxy logs (paster.log) and, quite likely, the job depending on that requirement will fail, since the required tool is not in Galaxy’s PATH.

Tool Shed Packages resolver

The tool_shed_packages resolver is designed to find packages installed from the Galaxy Toolshed. These are installed in the location specified as tool_dependency_dir in config/galaxy.ini. I’ll refer to this location as base_path. Toolshed packages are installed as base_path/toolname/version/toolshed_owner/toolshed_package_name/changeset_revision so for example BLAST+ on our system is base_path/blast+/2.2.31/iuc/package_blast_plus_2_2_31/e36f75574aec at the moment. The tool_shed_packages resolver cannot handle <requirement> clauses without a version number except for when they are referring to type="set_environment", not type="package" requirements. Uhhhhh, ok I’ll explain set_environment a bit later. In general, if you want to resolve your dependency through the toolshed, you need to specify a version number.

Underneath it all, the tool_shed_packages resolver looks for a file called provided as part of the packaged tool. This file contains settings that are sourced (i.e. . into the job script Galaxy ends up executing.

Galaxy Packages resolver

The galaxy_packages resolver is oriented towards manually installed packages. Note that it is called twice – first with its default – version supporting – form and secondly with versionless="true". The software found by the galaxy_packages resolver is installed under the same base_path as for the tool_shed_packages resolver, but that’s where the similarity ends. This resolver looks under base_path/toolname/version by default, so for example base_path/bedtools/2.22. If it finds a bin/ directory in the specified path it will add that to the path Galaxy uses to find binaries. If, however, it finds a file name it will emit code to source that file. This means that you can use the script any way you want to add a tool to Galaxy’s path. For example, here is our for bedtools:


if [ -z "$MODULEPATH" ] ; then
  . /etc/profile.d/

module add bedtools/bedtools-2.20.1

That uses our existing Environment Modules installation to add bedtools to the PATH.

The versionless="true" incarnation of the galaxy_packages resolver works similarly, except that it looks for a symbolic link in base_path/toolname/default, e.g. base_path/bedtools/default. This needs to point to a version numbered directory containing a bin/ subdirectory or file as above. It is this support for versionless="true" that allows for the resolution of <requirement> specifications with no version.

Supporting versionless requirements

As might not be obvious from the discussion thus far, given the default Galaxy setup, the only way versionless requirements can be satisfied is with a manually installed package with a default link that is resolved with the galaxy_packages resolver. So even if you have the relevant requirement installed via a toolshed package, it will not be used to resolve a versionless requirement: you’d have to make a symlink from base_path/toolname/default to base_path/toolname/version and symlink the buried within the package’s folders to that base_path/toolname/version directory. You could do that, or you could just manually add a as per the galaxy_packages schema and not install packages through the toolshed.

By the way, the set_environment type of requirement mentioned earlier is related to thing referred to when talking about tool_shed_packages is a special kind of ‘package’ that simply contains an script. They are expected to be installed as base_path/environment_settings/toolname/toolshed_owner/toolshed_package_name/changeset_revision and they’re the only thing checked for by tool_shed_packages when trying to resolve a versionless requirement.

In Conclusion

If you’re read this far, congratulations. I hope this information is useful; after all, everything could change with a wriggle of John Chilton‘s fingers (wink). Part of the reason that I believe Galaxy dependency resolution is so complicated is that the Galaxy community has been trying to solve the bioinformatics package management problem, something that has bedevilled the community ever since I started working in the field in the 90s.

The problem is this: bioinformatics is a small field of computing. A colleague of mine compared us to a spec of phytoplankton floating in a computing sea and I think he’s right. Scientific software is often poorly engineered and most often not well packaged, and keeping up with the myriad ways of installing the software we use, keeping it up to date and having it all work together is dizzyingly hard and time consuming. And there aren’t enough hands on the job. Much better resources fields are still trying to solve this problem and its clear that Galaxy is going to try and benefit from that work.

Some of the by-default-unused dependency resolvers try and leverage old (Environment Modules) and new (Homebrew) solutions for dependency management, and with luck and effort (probably more of the latter) things will get better in the future. For our personal Galaxy installation I’m going to probably be doing a bit of manual maintenance and using the galaxy_packages scheme more than packages from the toolshed. I might try and fix up the modules resolver, at least to make it work with what we’ve got going at SANBI. And hopefully I’ll get less confused by those Failed to resolve dependency messages in the future!

How to submit a job to the SANBI computing cluster

I keep trying to finish the documentation about our computing cluster at SANBI and SGE (Sun Grid Engine)1 and how to run jobs. In the meantime, however, here’s how to run a job on the cluster at SANBI.

Firstly, the structure of the cluster. Our storage, for now, is provided by a storage server and shared across the whole cluster. This means that your home directory and the /cip0 storage area is shared across the whole cluster. We still need to implement better research data management practices but you should do your work in a scratch directory and store your results in your research directory. I’m not going to talk more about that now because the system is in flux.

Secondly, the cluster has a number of compute nodes and a single submit node. The submit node is, so log in there to submit your job. It is a smallish virtual machine, so don’t run anything substantial on the submit node!

So lets’s imagine that you want to run a tool like fastqc on the cluster. First, is the tool available? We use a system called environment modules to manage the available software. This allows us to install software in a central place and just add the relevant environment variables to run the tool you need. The module avail command lists available commands, so module avail 2>&1 |grep fastqc will show us that fastqc is indeed available.

Next we need a working directory (e.g. /cip0/scratch/pvh) and a script that will run the command. Here is a little script (lets imagine it is called


. /etc/profile.d/

module add fastqc

if [ ! -d out ] ; then
    mkdir out

fastqc -t 1 -out `pwd`/out -f fastq data.fastq

The . /etc/profile.d/ ensures that the module command is available. The module add fastqc adds the path to fastqc so that it is available to our script. You can use module add outside a script if you want to examine how to run a command. It also sets a variable, FASTQC_HOME, pointing to where the command is installed, so you can ls $FASTQC_HOME to see if there is perhaps a README or other useful data in that directory.

Then fastqc needs you to create the output directory, it won’t do it itself, so the script does that, creating a directory under the current directory, named out. Now you need to send the script to the cluster’s scheduler:

qsub -wd $(pwd) -q all.q -N myfastqc

This will send to the scheduler and tell it to run it on all.q (which happens to be the default queue) with the job named myfastqc. This queue has a time limit of 8 hours, so if you need to run for longer than that you need a -q long.q. The long.q queue has no time limit but fewer CPUs available. The -wd flags sets the job’s working directory, in this example the directory you are in when you submit the job.

You can check the status of your job with qstat. Job output is, by default, written into the working directory in two files, one for stderr and the other for stdout.

There’s much more to say about the cluster and the use of the qsub, qstat, qacct and qhost commands, but this should be enough to get you started with your first cluster job. The rest will have to wait till I’ve got time to write more extensive documentation.


Sun Grid Engine is part of the Grid Engine family of job schedulers that has undergone complex evolution over the last years due to Sun’s takeover by Oracle and the subsequent forking of the codebase. See the Grid Engine Wiki for details. Back to text

Adventures in Galaxy output collections

For the Galaxy IUC Tools and Collections codefest we (the SANBI software developers) decided to take on what we thought would be a simple job: make the bamtools_split tool output a dataset collection instead of multiple datasets. So here’s the output clause of the old (multiple datasets) version of bamtools_split:

    <data format="txt" name="report" label="BAMSplitter Run" hidden="true">
      <discover_datasets pattern="split_bam\.(?P&lt;designation&gt;.+)\.bam" ext="bam" visible="true"/>

and this needed to change to:

    <collection name="report" type="list" label="BAMSplitter Run">
        <discover_datasets pattern="split_bam\.(?P&lt;designation&gt;.+)\.bam" ext="bam"/>

In other words, the <data> element just gets changed to a <collection> element and the <discover_datasets> element remains essentially the same. So we did this change and everything ran fine except: the output collection was empty. Why?

Lots of debugging followed, based on a fresh checkout of the Galaxy codebase. We discovered that the crucial function here is collect_dynamic_collections() in the module. This is called by the finish() method of the Jobclass, via the Toolclass’ method of the same name.

The collect_dynamic_collections function identifies output collections in a tool’s definition and then uses a collection builder to map job output files to a dataset collection type. The collection builder is a factory class defined in galaxy.dataset_collections.builder and each dataset collection type (defined in galaxy.dataset_collections.builder.types) has its own way of moving output elements into the members of a collection type.

Anyway, we traced this code all the way through to the point where it was obvious the dataset collection was being created successfully and then turned to the other Galaxy devs (John Chilton specifically) to ask for help, only to discover that the problem was gone. The dataset collection was somehow populated! It turns out that if your Galaxy tool creates an output dataset collection that has an uncertain number of members (like a list collection) then it is populated asynchronously and you need to refresh the history to see its members – this is known bug.

So that’s been quite a learning curve. The final tool is on Github. The collection tag for outputs was introduced above. We haven’t explored its pair mode, but check out Peter Briggs’ trimmomatic tool which has an option to output as a pair type dataset collection.

In the test section of the tool configuration, you can use a dataset collection like this:

    <param name="input_bam" ftype="bam" value="bamtools-input1.bam"/>
    <param name="analysis_type_selector" value="-mapped"/>
    <output_collection name="report">
      <element name="MAPPED" file="bamtools-split-MAPPED1.bam" />
      <element name="UNMAPPED" file="bamtools-split-UNMAPPED1.bam" />

The output_collection tag essentially groups outputs together, with each element tag taking the place that of an individual output tag. Each element tag has a name that maps to one of the names identified by the discover_datasets pattern (perhaps index numbers can be used instead of names, I don’t know) and can use the test attributes that output provides.

With the tests updated and some suitable sample data in place the tests pass and the tool is ready for a pull request. There was some discussion though on the semantics of this tool… for more go and read the comments on the PR.

A BLAST array job for the SANBI cluster

If you want to query a BLAST database with a large number of input query sequences, you might want to use this script. The easy way to gain speed for a BLAST search is to split the input set of query sequences (using a script such as or (if the sequences don’t contain linebreaks) you can use split or you can use csplit) into multiple parts and run the BLAST search as an array job. For this script, you need a working directory containing these subdirectories:

in/ - a directory containing your split queries in files named *.fasta
out/ - an empty output directory
logs/ - an empty log directory

Tune your splitting for efficiency: if your queries are too small, the time to start running will make the search inefficient. If your queries are too large, the jobs will run too long – remember that the timelimit on the default all.q is 8 hours.

Uljana and I wrote the script below to actually run the array job. If your working directory was, for example, /cip0/research/rosemary/blast and you saved this script as and you have 20 input query files, then you could submit the script with:

qsub -wd /cip0/research/rosemary/blast -t 1-20

Note that each use can have at most 20 jobs running on the cluster at any one time, so your queries will run in blocks of 20 jobs at a time. The raw source code is available and easier to copy than the listing below. Also note that you probably want to customise the actual BLAST command line (at the end of the script). The one in here was designed to pick up taxonomy information from the local install of the NR database – useful for doing a metagenomic scan.


# requirement:
# working directory with:
# in/ - files named .fasta that are query sequences
# out/ - empty directory to put outputs in
# logs/ - empty directory to put logs in 
# qsub with:
# qsub -t 1-2 -wd ./my-work-dir

#$ -o logs/$JOB_NAME.o$JOB_ID.$TASK_ID
#$ -e logs/$JOB_NAME.e$JOB_ID.$TASK_ID

### -----
### define input and output directories


cd $in_dir

### -----
### get all the file names into a file

ls *.fasta > $filelist

### -----
### access the fasta files by the ${SGE_TASK_ID}

fasta=`awk "NR == ${SGE_TASK_ID} {print}" $filelist` # ${file_list[$counter]}
echo $fasta

### -----
### add the blast module and run blast

. /etc/profile.d/
module add blastplus/default

blastn -query $in_dir/$fasta -db nt -out $out_dir/$fasta.txt -outfmt "6 std slen qlen qcovs qcovhsp staxids sscinames sskingdoms" -soft_masking false -max_target_seqs 3 -evalue 10