Adventures in Galaxy output collections

For the Galaxy IUC Tools and Collections codefest we (the SANBI software developers) decided to take on what we thought would be a simple job: make the bamtools_split tool output a dataset collection instead of multiple datasets. So here’s the output clause of the old (multiple datasets) version of bamtools_split:

  <outputs>
    <data format="txt" name="report" label="BAMSplitter Run" hidden="true">
      <discover_datasets pattern="split_bam\.(?P&lt;designation&gt;.+)\.bam" ext="bam" visible="true"/>
    </data>
  </outputs> 

and this needed to change to:

  <outputs>
    <collection name="report" type="list" label="BAMSplitter Run">
        <discover_datasets pattern="split_bam\.(?P&lt;designation&gt;.+)\.bam" ext="bam"/>
    </collection>
  </outputs>

In other words, the <data> element just gets changed to a <collection> element and the <discover_datasets> element remains essentially the same. So we did this change and everything ran fine except: the output collection was empty. Why?

Lots of debugging followed, based on a fresh checkout of the Galaxy codebase. We discovered that the crucial function here is collect_dynamic_collections() in the galaxy.tools.parameters.output_collect module. This is called by the finish() method of the Jobclass, via the Toolclass’ method of the same name.

The collect_dynamic_collections function identifies output collections in a tool’s definition and then uses a collection builder to map job output files to a dataset collection type. The collection builder is a factory class defined in galaxy.dataset_collections.builder and each dataset collection type (defined in galaxy.dataset_collections.builder.types) has its own way of moving output elements into the members of a collection type.

Anyway, we traced this code all the way through to the point where it was obvious the dataset collection was being created successfully and then turned to the other Galaxy devs (John Chilton specifically) to ask for help, only to discover that the problem was gone. The dataset collection was somehow populated! It turns out that if your Galaxy tool creates an output dataset collection that has an uncertain number of members (like a list collection) then it is populated asynchronously and you need to refresh the history to see its members – this is known bug.

So that’s been quite a learning curve. The final tool is on Github. The collection tag for outputs was introduced above. We haven’t explored its pair mode, but check out Peter Briggs’ trimmomatic tool which has an option to output as a pair type dataset collection.

In the test section of the tool configuration, you can use a dataset collection like this:

<test>
    <param name="input_bam" ftype="bam" value="bamtools-input1.bam"/>
    <param name="analysis_type_selector" value="-mapped"/>
    <output_collection name="report">
      <element name="MAPPED" file="bamtools-split-MAPPED1.bam" />
      <element name="UNMAPPED" file="bamtools-split-UNMAPPED1.bam" />
    </output_collection>
</test>

The output_collection tag essentially groups outputs together, with each element tag taking the place that of an individual output tag. Each element tag has a name that maps to one of the names identified by the discover_datasets pattern (perhaps index numbers can be used instead of names, I don’t know) and can use the test attributes that output provides.

With the tests updated and some suitable sample data in place the tests pass and the tool is ready for a pull request. There was some discussion though on the semantics of this tool… for more go and read the comments on the PR.

A BLAST array job for the SANBI cluster

If you want to query a BLAST database with a large number of input query sequences, you might want to use this script. The easy way to gain speed for a BLAST search is to split the input set of query sequences (using a script such as split_fasta.py or (if the sequences don’t contain linebreaks) you can use split or you can use csplit) into multiple parts and run the BLAST search as an array job. For this script, you need a working directory containing these subdirectories:

in/ - a directory containing your split queries in files named *.fasta
out/ - an empty output directory
logs/ - an empty log directory

Tune your splitting for efficiency: if your queries are too small, the time to start running will make the search inefficient. If your queries are too large, the jobs will run too long – remember that the timelimit on the default all.q is 8 hours.

Uljana and I wrote the script below to actually run the array job. If your working directory was, for example, /cip0/research/rosemary/blast and you saved this script as blastplus_array.sh and you have 20 input query files, then you could submit the script with:

qsub -wd /cip0/research/rosemary/blast -t 1-20 blastplus_array.sh

Note that each use can have at most 20 jobs running on the cluster at any one time, so your queries will run in blocks of 20 jobs at a time. The raw source code is available and easier to copy than the listing below. Also note that you probably want to customise the actual BLAST command line (at the end of the script). The one in here was designed to pick up taxonomy information from the local install of the NR database – useful for doing a metagenomic scan.

#!/bin/sh

# requirement:
# working directory with:
# in/ - files named .fasta that are query sequences
# out/ - empty directory to put outputs in
# logs/ - empty directory to put logs in 
#
# qsub with:
# qsub -t 1-2 -wd ./my-work-dir blastplus_array.sh

#$ -o logs/$JOB_NAME.o$JOB_ID.$TASK_ID
#$ -e logs/$JOB_NAME.e$JOB_ID.$TASK_ID


### -----
### define input and output directories

base_dir=`pwd`
in_dir=$base_dir/in
out_dir=$base_dir/out

cd $in_dir

### -----
### get all the file names into a file

filelist=../logs/file_list.txt
ls *.fasta > $filelist

### -----
### access the fasta files by the ${SGE_TASK_ID}

fasta=`awk "NR == ${SGE_TASK_ID} {print}" $filelist` # ${file_list[$counter]}
echo $fasta

### -----
### add the blast module and run blast

. /etc/profile.d/module.sh
module add blastplus/default

blastn -query $in_dir/$fasta -db nt -out $out_dir/$fasta.txt -outfmt "6 std slen qlen qcovs qcovhsp staxids sscinames sskingdoms" -soft_masking false -max_target_seqs 3 -evalue 10

OrthoMCL and BLAST: Adventures in the (SANBI) Galaxy

BLAST in Galaxy

Part of my work for the week was to start using Galaxy more extensively at SANBI. I.e. to make it more usable. Last week I wrote an authentication plugin to allow a Galaxy server to authenticate using PAM. This got accepted into the 15.07 release of Galaxy, so I updated our Galaxy server to that release. I had neglected to include the example auth_conf.xml in the code I committed, but working off the example on my laptop I got PAM authentication working as a replacement for the previous HTTP authentication (which also spoke to PAM on the backend). I also took the opportunity to switch our server to using HTTPS using the SANBI wildcard certificate.

My first attempt at a practical use for our server came when I needed to run the BLAST step of the OrthoMCL pipeline. OrthoMCL uses an all-against-all BLAST as its input dataset, and based on the data I had from our colleagues, I had a collection of about 300,000 proteins to BLAST against each other. I started this off as an array job at CHPC but thought I could try and work locally as well, as a proof of concept. (Actually there was a previous step, filtering out poor proteins, but I’ll get to that below.) My first attempt at using BLAST hit a bug: “NotFound: cannot find ‘files_path’ while searching for ‘db_opts.histdb.files_path'”. This exception was thrown from __build_command_line in Galaxy’s lib/galaxy/tools/evaluation.py because the BLAST wrappers use an attribute called files_path instead of extra_files_path. Peter Cock and John Chilton discuss the problem in this Github issue and Peter quickly committed a workaround to the BLAST tools.

Having fixed that, and having prepared the protein set (outside Galaxy), I decided to take a chance on the Galaxy “parallelisation” code. This is enabled through appropriate tags in the tool XML, and in the case of the blastp wrapper splits the query dataset into chunks of 1000 sequences each before submitting jobs (in Galaxy terms, actually tasks, not fully fledged jobs) to the cluster. Unfortunately these are individual jobs, not an array job, because array jobs are only implemented in the still-only-on-the-horizon DRMAA version 2. In any event, our cluster can handle thousands of job submissions so I hit go, saw the history item turn from grey to yellow, and waited. Unfortunately, after a day or so it went red (failed), but by then I was too busy with other stuff to debug it. To be continued…

(As an aside, the BLAST wrapper wraps BLAST+, whereas OrthoMCL uses legacy BLAST. I still need to check that the BLAST wrapper exposes enough flags in order to guarantee equivalence. A useful guide for some of the corresponding flags can be found on this page about ortholog finding).

OrthoMCL in and out of Galaxy

As mentioned previously, I was running BLAST as part of the OrthoMCL pipeline. OrthoMCL uses BLAST, MCL and a database (in the version we use, SQLite3) to compute the orthologs in a set of proteins. The pipeline has two steps before the BLAST stage (orthomclAdjustFasta and orthomclFilterFasta), five between the BLAST and MCL stages and a final step to process the MCL output. Currently I use a Makefile to execute the pipeline but at the GCC2015 Hackathon AJ started work on some wrappers for the steps in the pipeline. There has been previous work on executing OrthoMCL within Galaxy but that ran the entire workflow as a single tool. We want to implement the pipeline as a Galaxy workflow because that way we can (in theory at least) benefit from improvements in how BLAST is executed (e.g. parallelism) or even replace the BLAST step with a similar (but apparently faster) tool such as Diamond. The OrthoMCL pipeline is pretty linear so even given the limited capabilities of workflows in current Galaxy (as discussed by John Chilton at BOSC 2015) creating a OrthoMCL workflow should be pretty easy.

To that end we’ve now got a Github repository for the tool wrappers. I’m trying to follow the structure that groups like IUC use. AJ’s working on orthomclAdjustFasta, so I decided to tackle orthomclFilterFasta, a tool that takes a directory full of FASTA files as input, does some simple filtering and outputs a combined FASTA file. I’m not 100% sure on the requirements for the command line (I need to go back into the code and see how it is executed) so I’ve got a tool that generates a single shell command in the form:

mkdir inputs && /bin/bash orthomcl_prepare_dataset_for_filter.sh dataset1.dat && /bin/bash orthomcl prepare_dataset_for_filter.sh dataset2.dat && orthomclFilterFasta inputs/ <p1> <p2>

The prepare_dataset_for_filter.sh is just a simple script to take a FASTA file, extract the tag that OrthoMCL uses to identify sets (added by orthomclAdjustFasta) and renames the file according to that tag. The orthomclFilterFasta tool insists that input files end in .fasta and are named according to their tag.

In any event the tool runs fine on a local Galaxy install. The next step is to get tool dependencies right, which is where the stuff in the package directory comes in. Galaxy can install packages for you (in an admin-configurable folder). For a tool the dependencies it needs are specified in a file called tool_dependencies.xml that is in the same folder as the tool XML.

The tool dependencies specify packages to install. For OrthoMCL two new packages have been written (see here, one for OrthoMCL and one for the Perl DBD::SQLite module that it depends on. OrthoMCL in turn depends on Perl and DBD::SQLite. This is done using a repository_dependencies.xml file – I’m still not sure if this is the correct approach, but in any event it follows the guide for simple repository dependencies in Galaxy. One limitation to repository dependencies is that they apparently only work within a single toolshed, so what to do if the package you require is in another toolshed?

Thus far the tool dependency stuff has been tested on a local Galaxy installation. It doesn’t work. Directories are created, but they are empty. Further testing and a better testing procedure is needed. Eric Rasche mentioned that Marius van den Beek has some Jenkins based testing framework that uses Docker to create sandboxes, and it is here – so perhaps getting this up and running is a next step.

And then finally, FastOrtho seems like a possibly viable alternative to OrthoMCL. The output seems roughly similar to OrthoMCLs and it is much faster (and as a single tool with no Perl dependencies, easier to package), but as with all new tools in bioinformatics, we’ll have to prove that it works well enough to replace OrthoMCL (which is somewhat of a standard in this domain). Well check back in a few weeks for updates…

Mixed HTTP / WordPress authentication with nginx

Our WordPress blog / site network at wp.sanbi.ac.za uses Daniel Westermann-Clark’s HTTP Authentication plugin. Despite the scary warning on the plugin’s page over at wordpress.org, the setup has worked well for us over the years. The only problem has been allowing external (non-SANBI) users to edit pages. This is something we really need for the annual bioinformatics course website.

After quite a few false starts I realised that the solution is to use passive authentication. I.e. do not force the user to use HTTP authentication but rather make it an optional extra. We use PAM for nginx authentication (via nginx_auth_pam, included in the nginx build in Ubuntu’s nginx-full module) and the crucial bit of the nginx config looks like this:

        location /sanbi-login.html {
            auth_pam  "SANBI authentication";
            auth_pam_service_name "nginx";
        }
        # Process only the requests to wp-login and wp-admin
        location ~ /wp-(admin|login|includes|content) {
          try_files $uri $uri/ \1/index.php?args;

          location ~ \.php$ {
             try_files $uri =404;
             include fastcgi_params;
             fastcgi_param REMOTE_USER $remote_user;
             fastcgi_index index.php;
             fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
             fastcgi_pass  php;
             fastcgi_intercept_errors on;
          }
        }

(Full nginx config is here). The nginx server forwards the REMOTE_USER variable to the php5-fpm backend but doesn’t actually force HTTP authentication for any of the WordPress content. Instead only a single page is protected with HTTP authentication, and that page simply redirects to the WordPress admin interface. Here’s the source of sanbi-login.html:

<html>
<head>
<title>SANBI login</title>
<meta http-equiv="refresh" content="0; url=/wp-admin" />
</head>
<body>
</body>
</html>

This works well because ours is a sub-domain based WordPress network. In the HTTP Authentication plugin settings, we point to the address of the sanbi-login.html page as the login page, so the login page ends up looking like this:

WP login screen.

The Log in with HTTP Authentication takes you to the sanbi-login.html page that requests HTTP Authentication and immediately redirects back to /wp-admin. The login form uses conventional WordPress authentication.

One remaining challenge is how to set this option for all sites in the network. Currently it needs to be configured for each site individually. The setting is stored in the wordpress database in the wp_BLOGNUMBER_options table (where BLOGNUMBER is the number of the blog in the wordpress network) with option_name = ‘http_authentication_options’, as part of a serialised array. This code suggests how to set an option for all sites in a network but as yet there is no way to do this from the wordpress web interface, and I’m a bit loath to work on the database directly.

Reflections on Big Data, Bioinformatics and the recent UCT/UWC workshop

Monday and Tuesday of this week was largely consumed by a focus on Big Data. First, Ton Engbersen from the IBM/ASTRON Centre for Exascale Technology presented a talk at UCT on microservers, data gravity and microclouds.

The microserver in question is the [DOME](http://www.hpcwire.com/2014/04/10/dome-ibm-research- microserver-freescale/), a system-on-chip based device that crams 128 computers into a 19″ 2U rack drawer. Each computer takes up 13cm x 5 cm (and is 6 mm thick) and provides 12 PowerPC cores with up to 48 GB RAM, resulting in a 2U rack with 1536 cores and over 6 TB of RAM. The whole thing is cooled with warm water, an IBM innovation that is currently in use on the SuperMUC supercomputer in Leipzig, read more about its benefits on their page.

The DOME server is being developed to analyse data from the SKA, an exascale computing problem. The SKA is anticipated to generate between 300 to 1500 petabytes of data per year, putting it on the extreme end of scientific enterprises in terms of data volume. While big data is commonly associated with data volume, researchers at IBM identify four V’s of big data: volume, velocity, variety and veracity. Volume is straightforward. Velocity speaks to the rate at which new data appears. With the amount of sequence data available in GenBank growing at an exponential rate, both volume and velocity of data threaten to outstrip the ability of bioinformatics centres to analyse data. In terms of integration of data, however, my presentation on the big data of tuberculosis focussed more on the variety and veracity of available data. A survey of the data published alongside research articles in the field shows that much of the variety of data gleaned through bioinformatics experiments is lost or only retained in closed institutional databases (and thus effectively lost to the field). An overview of health data collected as part of the NIH-funded Centre for Predictive Computational Phenotyping illustrates the problem of data veracity: electronic health records for patients are often incomplete and lack the vocabulary researchers require to identify disease presence of progression.

Managing the data collections necessary to study e.g. the global state of TB prevalence and treatment will require digital curation of multiple datasets drawn from experiments performed in a range of domains. As Ton Engbersen pointed out that the growing size of data means that “compared to the cost of moving bytes around, everything else is free” (originally a Jim Gray quote). Add to this the (much more tractable) fact that the skills required to build stores and curate these datasets are unevenly distributed, data collections are set to become “the new oil”. Engebersen proposes a solution: micro-clouds that offer the possibility to move code to the data rather than the other way round. Such entities would require a sophisticated cross-institutional authentication framework – almost certainly built on digital certificates – to allow authorised software agents to interface with data. This immediate suggests a set of research priorities to add to SANBI’s existing research projects on data storage and data movement. Luckily this research overlaps with some research interests at UWC Computer Science.

The workshop concluded with some agreements to collaborate between UCT and UWC on big data, but the perspectives delivered show that there is much more at play than the SKA. The fact that both UWC and UCT have established bioinformatics expertise and are located on the established SANReN backbone means that there’s an immediate opportunity to share knowledge and experiments on projects that tackle all four V’s of big data. Lots of ideas… the coming year will see how they can be put into practice.

An experimental Ceph storage cluster for the Computer Science Netlab

Ceph is an open source distributed object store and filesystem developed originally developed by Sage Weil for his PhD. Saying that again, but slower this time, Ceph is an object store where the objects are **distributed over a collection of drives and serves. And there’s a filesystem component too. The basic blocks of Ceph are object storage daemons (OSDs) and monitoring daemons (MONs). Each OSD manages a block of storage into which are placed objects, aggregated into placement groups (PGs). A datastructure called a

CRUSH map, which is a distributed hash table used to look up where particular data objects are stored. Distributed hashtables underlie most distributed storage software (e.g. you also find one in GlusterFS). Keeping track of the state of everything are the MONs that use the PAXOS algorithm to maintain a consistent set of knowledge about the state of the cluster. There are always an odd number of MONs: typically 3 for a small cluster.

Once you put it all together you get RADOS, the Reliable Autonomic Distributed Object Store (see the 2007 paper if you want). Storage in RADOS is divided into pools where you set per-pool policy for object size and striping and replication. So a pool might have say 4096 PGs containing objects of maximum 4 MB each filled with 64 KB stripes of data, replicated to ensure that there are 3 copies of each PG available at all times.

Ceph provides 3 options for accessing RADOS: first there is the RADOS object gateway, a RESTful service allowing individual objects (e.g. VM images) to be stored and retrieved. Then there’s the RADOS Block Device (RBD), a iSCSI like block device that can be mapped to a virtual drive on the client computer and finally there is CephFS, a POSIX-compliant filesystem running on top of storage pools in RADOS (the filesystem uses an extra daemon the metadata server (MDS), to map filesystem locations such as /home to objects).

The Ceph architecture allows heterogenous disks and servers to be aggregated into a storage pool in a way that is more flexible than traditional RAID and with substantially shorter rebuild times.

So to demo Ceph I set it up in the Computer Science Netlab at UWC, using a fair sprinkling of ansible scripting and the ceph-deploy tool. The architecture is three MONs, normandy, netlab2-ws and netlab6-ws and three OSDs, netlab2-ws, netlab6-ws and netlab17-ws. On the OSDs I’ve used the existing filesystem as the Ceph store – Ceph should have its own storage partitions, but I didn’t want to go to the trouble of repartitioning machines for the demo. The release used is Giant, the latest and as-yet-unreleased Ceph release.

TODO: detail the initial installation

I’ll hopefully getting around to documenting the initial steps, but let me show how I add a new OSD to the Ceph cluster.

First, create the /dfs directory on the new OSD (in this case netlab10-ws). The “-s -k -K” arguments for ansible mean ask for SSH password, ask for sudo password and use sudo. Since the commands I’m running need to be run as root on the remote machine, I need all that.

ansible -s -k -K -m file -a "name=/dfs owner=root state=directory" netlab10-ws

Then add a user to the remote machine. The password set is never actually used, since we’ll be using SSH public keys to do the log in and passwordless sudo.

ansible  -u pvh -k -K -m user -a 'name=netlab-ceph home=/dfs/netlab-ceph createhome=yes password="SOMEENCRYPTEDSTUFF" shell=/bin/bash comment="Ceph User" state=present' netlab10-ws

ansible -k -s -K -m authorized_key -a 'user=netlab-ceph key="SSH PUBLIC KEY GOES HERE"' netlab10-ws

ansible -k -K -s -m copy -a 'content="netlab-ceph ALL = (root) NOPASSWD:ALL\n" dest=/etc/sudoers.d/050_ceph mode=0444 owner=root' netlab10-ws 

Next we need to install a NTP daemon and synchronise time (we should actually run an in-lab NTP server, but for now we’re using a public one to set the time). Ceph relies on everything being in close time synchronisation to operate.

ansible -s -k -K -m apt -a 'name=ntp state=present' netlab10-ws

ansible -s -k -K -m command -a 'ntpdate 0.pool.ntp.org' netlab10-ws
ansible -s -k -K -m service -a 'name=ntp state=started enabled=true' netlab10-ws

Then use the ceph-deploy tool to install the actual ceph packages, create a directory for the OSD to put its data in, initialize, activate and we’re done!

ceph-deploy install --release=giant netlab10-ws

ansible -s -k -K -m file -a 'name=/dfs/osd3 state=directory' netlab10-ws

ceph-deploy osd prepare netlab10-ws:/dfs/osd3
ceph-deploy osd activate netlab10-ws:/dfs/osd3

You can check on the state of the cluster with sudo ceph status, where you’ll see something like this:

netlab-ceph@normandy:~$ sudo ceph status
    cluster 915d5e83-2950-4860-ba97-2118c061036f
     health HEALTH_WARN 18 pgs degraded; 220 pgs peering; 93 pgs stuck inactive; 93 pgs stuck unclean; recovery 296/2880 objects degraded (10.278%)
     monmap e1: 3 mons at {netlab2-ws=10.0.0.16:6789/0,netlab6-ws=10.0.0.21:6789/0,normandy=10.0.0.1:6789/0}, election epoch 14, quorum 0,1,2 normandy,netlab2-ws,netlab6-ws
     mdsmap e5: 1/1/1 up {0=normandy=up:active}
     osdmap e64: 4 osds: 4 up, 4 in
      pgmap v4413: 320 pgs, 3 pools, 3723 MB data, 960 objects
            63303 MB used, 799 GB / 907 GB avail
            296/2880 objects degraded (10.278%)
                  18 active+degraded
                 220 peering
                  82 active+clean
recovery io 11307 kB/s, 2 objects/s
  client io 6596 kB/s wr, 4 op/s

Or you can watch it rebalancing itself with ceph -w:

netlab-ceph@normandy:~$ sudo ceph -w
    cluster 915d5e83-2950-4860-ba97-2118c061036f
     health HEALTH_WARN 121 pgs degraded; 8 pgs recovering; 28 pgs stuck unclean; recovery 1308/3864 objects degraded (33.851%)
     monmap e1: 3 mons at {netlab2-ws=10.0.0.16:6789/0,netlab6-ws=10.0.0.21:6789/0,normandy=10.0.0.1:6789/0}, election epoch 14, quorum 0,1,2 normandy,netlab2-ws,netlab6-ws
     mdsmap e5: 1/1/1 up {0=normandy=up:active}
     osdmap e64: 4 osds: 4 up, 4 in
      pgmap v4445: 320 pgs, 3 pools, 4991 MB data, 1288 objects
            63424 MB used, 799 GB / 907 GB avail
            1308/3864 objects degraded (33.851%)
                 113 active+degraded
                 199 active+clean
                   8 active+recovering+degraded
recovery io 14070 kB/s, 3 objects/s

2014-10-16 11:44:39.069514 mon.0 [INF] pgmap v4445: 320 pgs: 113 active+degraded, 199 active+clean, 8 active+recovering+degraded; 4991 MB data, 63424 MB used, 799 GB / 907 GB avail; 1308/3864 objects degraded (33.851%); 14070 kB/s, 3 objects/s recovering
2014-10-16 11:44:41.178062 mon.0 [INF] pgmap v4446: 320 pgs: 113 active+degraded, 199 active+clean, 8 active+recovering+degraded; 4991 MB data, 63473 MB used, 799 GB / 907 GB avail; 1306/3864 objects degraded (33.799%); 9782 kB/s, 2 objects/s recovering

To remove an OSD, you can use these commands (using out newly create osd.3 as an example) – they take the OSD out of the storage cluster, stop the daemon, remove it from the CRUSH map, delete authentication keys and finally remove the OSD from the cluster’s list of OSDs.

netlab-ceph@normandy:~$ sudo ceph osd out 3
marked out osd.3. 
netlab-ceph@normandy:~$ ssh netlab10-ws sudo stop ceph-osd-all
ceph-osd-all stop/waiting
netlab-ceph@normandy:~$ sudo ceph osd crush remove osd.3
removed item id 3 name 'osd.3' from crush map
netlab-ceph@normandy:~$ sudo ceph auth del osd.3
updated
netlab-ceph@normandy:~$ sudo ceph osd rm 3
removed osd.3

Then, for good measure, you can remove the data:

netlab-ceph@normandy:~$ ssh netlab10-ws sudo rm -rf /dfs/osd3/\*

Also not covered in this blog is how I added a RBD device and how I created and mounted a CephFS filesystem. Well… bug me till I finish writing this thing.

Installing Slurm on CentOS using Ansible

Helping the UWC Student Cluster Challenge team prepare for the final round (at the CHPC National Meeting) has given me an excuse to play with some new toys: I’ve got a mini-cluster of three VMs running on my laptop (using KVM and libvirt), and I’ve been looking into Slurm as a cluster scheduler. At SANBI we run SGE, lots of other people use Torque, but I’ve been interested in Slurm for a while, because its a fully open source scheduler with some big name users and seemingly a bright future. Then, I’m big into systems administration task automation: at SANBI we use puppet (and I personally use fabric), but Bruce Becker introduced me to Ansible, and so I took the opportunity to build an Ansible playbook to install Slurm on my mini-cluster.

Ansible playbooks are written in YAML and describe a set of tasks that need to be applied to a set of servers. These tasks are defined in terms of a set of modules and executed using ssh  (optionally using ZeroMQ to speed up data transfer) so you need to have ssh access to the machines you want to administer. I did this by adding my ssh key to the authorized_keys file for the root users on my mini-cluster. Like puppet recipes, ansible playbooks are (largely) declarative: you specify what you want, not how to achieve it. Unlike puppet recipes, ansible playbooks run tasks in order, first to last.

So my cluster has three nodes: head (the head node), and two workers: worker1 and worker2. These are on a private (virtual) LAN with DNS being provided by the head node (so the DNS names are head.cluster, etc). My laptop is the VM host and the IP addresses of all nodes are in /etc/hosts on the machine. The laptop has Ansible 1.4 installed from Rodney Quillo’s PPA. This is crucial: I use a bunch of features that are only available in 1.4.

The Slurm Quick Start Administrator Guide outlines the steps needed to install Slurm in a general way. First, I downloaded Slurm 2.6.4 and then installed the dependencies I needed to compile it:

openmpi-devel pam-devel hwloc-devel rrdtool-devel ncurses-devel munge-devel

This is not an exhaustive list: I had previously installed software on the nodes, so there might be stuff I left off this list. To identify missing dependencies, look at the config.log after the configure stage and search for WARNING messages. I unpacked Slurm and did:

./configure --prefix=/opt/slurm
make
sudo make install

This installed Slurm in /opt/slurm. I then created an archive of the slurm install:

cd /opt
tar jcf /var/tmp/slurm-bin.tar.bz2 slurm/*

I deployed this with Ansible to the nodes in the cluster. My Ansible setup uses a hostfile (/etc/ansible/hosts) that defines the hosts and host groups:

[workers]
worker1.cluster
worker2.cluster
[head]
head.cluster
[cluster:children]
head
workers

(You don’t need to use a system-wide hosts file like I did, you can specify an alternative hostfile with the -i flag on the ansible-playbook command line.) This was my initial ansible playbook (slurm.yml):

---
- hosts: cluster
 remote_user: root
 tasks:
 - name: create slurm user
 user: name=slurm createhome=no home=/opt/slurm 
 shell=/sbin/nologin state=present
 - name: install slurm dependencies
 yum: name={{item}} 
 with_items:
 - pam 
 - hwloc 
 - rrdtool 
 - ncurses 
 - munge
 - name: create slurm directories
 file: path=/var/spool/slurmd owner=slurm mode=0755 state=directory
 - name: copy slurm binaries to /tmp
 copy: src=slurm-bin.tar.bz2 dest=/tmp/slurm-bin.tar.bz2
 - name: unpack slurm binary distribution
 command: /bin/tar jxf /tmp/slurm-bin.tar.bz2 chdir=/opt 
 creates=/opt/slurm
 - name: install slurm configuration file
 copy: src=slurm.conf dest=/opt/slurm/etc/slurm.conf
 notify: restart slurm
 - name: install slurm path file in /etc/profile.d
 copy: src=slurm.sh dest=/etc/profile.d/slurm.sh mode=0755 owner=root
 - name: install slurm started script in /etc/init.d
 copy: src=init.d.slurm dest=/etc/init.d/slurm mode=0755 owner=root
 - name: enable munge service
 service: name=munge state=started enabled=yes
 - name: enable slurm service startup
 service: name=slurm state=started enabled=yes
 handlers:
 - name: restart slurm
 service: name=slurm state=restarted

As mentioned previously, ansible playbooks are read top down. So the steps taken are:

  1. Create slurm user.
  2. Install slurm dependencies (the non-devel versions of the previously mentioned packages).
  3. Create slurm spool directory (/var/spool/slurmd) and make it owned by the slurm user.
  4. Upload and unpack the slurm-bin.tar.bz2 that was previously created.
  5. Install the slurm configuration file to /opt/slurm/etc/slurm.conf. The first draft of this was created with the Slurm 2.6 configuration tool and the final version is here.
  6. Install the slurm.sh script to /etc/profile.d that sets the PATH to include slurm binaries. This file contains:

    PATH=$PATH:/opt/slurm/bin:/opt/slurm/sbin

    if [ -z “$MANPATH” ] ; then
    MANPATH=/opt/slurm/share/man
    else
    MANPATH=$MANPATH:/opt/slurm/share/man
    fi
    export PATH MANPATH 

  7. Install init.d.slurm from slurm distribution’s etc/ directory to /etc/init.d/slurm. This handles start/stop of both slurmctld (on head) and slurmd (on worker nodes).
  8. Ensure that the munge daemon is started. I have previously generated a munge key using the instructions on the munge website and distributed this to /etc/munge/munge.key on each of the nodes in the cluster. This was done with another ansible playbook (not shown).

  9. Once the file is installed the service is enabled (with something like chkconfig slurm on) and the slurm daemons are started (service slurm start).
  10. The playbook was then split up into Ansible roles and roles were added to create a NFS server on the head node and share /home from the head node and mount it over /home on the worker nodes.

Once this was all up and running, the system was tested by using sbatch to run a simple script. Here’s the script, hello.sh:

#!/bin/sh

echo Hello World

This was submitted with the command:

sbatch hello.sh

After that worked the AMG benchmark was run using MPI. Here’s the test_amg.sh script:

#!/bin/sh
echo NTASKS $SLURM_NTASKS
LD_LIBRARY_PATH=/usr/lib64/openmpi/lib
export LD_LIBRARY_PATH
mpirun src/test/amg2013 -laplace -P 1 1 $SLURM_NTASKS -n 64 64 64 -solver 2

and run using:

sbatch -n 2 test_amg.sh

Compared to my experience with SGE, Slurm seems to run jobs really fast and compared to Torque+Maui it seems pretty easy to set up.

As mentioned above I switched over my playbook to using Ansible roles. Roles allow you to split out the components of configured into a particular directory structure and then mix these into your final playbook. So the roles structures I currently have is:

roles
├── munge
│   ├── files
│   │   └── munge.key
│   ├── handlers
│   │   └── main.yml
│   └── tasks
│       └── main.yml
├── nfs-client
│   └── tasks
│       └── main.yml
├── nfs-common
│   └── tasks
│       └── main.yml
├── nfs-server
│   ├── files
│   │   └── exports
│   ├── handlers
│   │   └── main.yml
│   └── tasks
│       └── main.yml
└── slurm
    ├── files
    │   ├── init.d.slurm
    │   ├── slurm-bin.tar.bz2
    │   ├── slurm.conf
    │   └── slurm.sh
    ├── handlers
    │   └── main.yml
    └── tasks
        └── main.yml

Effectively what Ansible roles do is to split the sections of your playbook out into a directory structure. This is then used in the final playbook (slurm.yml):

---
- hosts: cluster
 remote_user: root
 roles:
 - munge
 - slurm
 - nfs-common
 tasks:
 - name: disable firewall
 service: name=iptables enabled=no state=stopped
- hosts: head
 remote_user: root
 roles:
 - nfs-server
- hosts: workers
 remote_user: root
 roles:
 - nfs-client

And finally I’m at a stage where I can run:

ansible-playbook slurm.yml

And have the complete infrastructure required for a Slurm install set up on my virtual cluster.
[edited to add whitespace to Ansible playbooks as per suggestion from Michael de Haan @laserllama]

 

SFP, fastlink and funnies with a Dell switch

A couple of weeks ago I dropped in on the UWC Student Cluster Competition team to see how they were progressing with their cluster configuration. and I discovered that they were struggling with the networking on their cluster. As a test, they’d set up two Dell rack mounted servers, connected them to a 10 Gb switch (a Dell 8100 series switch as I recall) and then connected the switch to the campus network to try and get an IP via DHCP. The switch was getting an IP, but the server weren’t. The servers are running CentOS by the way.

As a test, we set up a DHCP server on Nicole’s laptop (we tried on Eugene’s first, but I just couldn’t quite get my head around how Arch Linux does things) and watched the traffic. After some time, we saw DHCP traffic and got DHCP working, but in a mysterious way: if we restarted the networking, the interface would fail to acquire an IP. Then if we ran ifup some time later, it would acquire an IP quite fine. After I left I decided to google around a bit (having discovered the LINKDELAY setting in CentOS network scripts), and lo and behold, someone else reported exactly the same problem and suggested that fastlink be enabled.

So what is this fastlink thing? Seems that in other contexts it is called Port Fast and its a known way to solve DHCP negotiation issues. By default network switches implement the Spanning Tree Protocol (STP) on their ports in order to configure into a spanning tree (and avoid loops or unreachable ports). This involves a delay as a port becomes active, the delay that caused the DHCP query to time out and the problem we saw. If you know you have a host connected to a port, you can set Port Fast on that port, thereby avoiding the delay. Ah, well everyone has to encounter some or other funny the first time they set up a server. And by the way, for a rhythmic description of STP, consult the Algorhyme (or listen to it).

A mouldy myth

WHAT-were-they-thinkingSomeone at my home institution, the University of the Western Cape, has decided that the way to attract students is to wave a picture of mouldy bread at them. Presumably they don’t think that having top class postgraduate programmes at places like BCB or PLAAS or SANBI is worth advertising. Nope, instead we should talk about mouldy bread. Or rather, a myth about Alexander Fleming and mouldy bread. Thus the modified (Gimped in fact) image to the left.

The original proclaims “What if [Fleming] never looked twice at something as ordinary as stale bread?”. Well, I don’t think we really know what Alexander Fleming thought of stale bread. What we do know is that stale bread had nothing to do with his (re)discovery of the antibacterial action of Penicillium moulds. Instead:

“Returning from holiday on September 3, 1928, Fleming began to sort through petri dishes containing colonies of Staphylococcus, bacteria that cause boils, sore throats and abscesses. He noticed something unusual on one dish. It was dotted with colonies, save for one area where a blob of mold was growing. The zone immediately around the mold—later identified as a rare strain of Penicillium notatum—was clear, as if the mold had secreted something that inhibited bacterial growth.” [source]

So it was a lazy attitude towards cleaning the lab — not mouldy bread — that led to Fleming’s discovery. That’s the first thing this blurb got wrong. What offends me more, however, is the clichéd image of the heroic scientist’s discovery sparking a paradigm shift. In reality, the antibacterial effect of Penicillum was known before Fleming, with a range of scientists and traditional knowledges describing the antibacterial effects of mould, or of Penicillum specifically. Just four years before Fleming’s discovery, Andre Gratia and Sara Dath discovered the antibacterial effect of a species of Penicillium, also as result of contamination of a bacterial culture.

What made Fleming’s discovery significant was not the moment of discovery and subsequent insight, but rather what he did afterwards: instead of merely publishing a paper and moving on to another topic, he spent years trying to get other scientists — chemists especially — interested in the new substance’s potential. It was just over a decade later that Howard Florey’s team assembled a strange collection of baths and milkchurns as part of the first penicillin production line. Before Florey, however, there was Dr Cecil Paine, a student of Fleming‘s, who used a crude penicillin extract to successfully treat an eye infection in 1931. (Paine later was a colleague of Florey’s) And Ernst Chain, the scientist in Florey’s lab that led the penicillin research, allegedly extracted the compound from a sample of mould that had been sub-cultured from Fleming’s original isolate. Florey’s science also drew on the clinical trials conducted by his wife, Ethel Florey. So the links between Fleming, the Floreys, Chain and the practical use of penicillin drew on a rich culture of openness and experimental. In addition, as efforts got under way to industrialise penicillin production, Sara Dath’s work in collecting a “monumental number of moulds and bacteria” proved useful. Science was then, and is now, a product of a process of collective enquiry and effort, and while “scientific interest” is key, reducing the story of penicillin to a Scot staring at stale bread does violence to history. And UWC could do better than to peddle this mouldy myth.

 

Cluster building at UWC

students assembling compute cluster

Motse, Eugene (hidden) and Saeed assembling a Dell R710 they’re going to use in their cluster

Last year the Centre for High Performance Computing (CHPC) ran a student cluster building competition for the first time, alongside their national meeting. The winning team progressed to the International Student Cluster Challenge in Leipzig and won top honours there. Observing the teams at work last year convinced me this is something we need to introduce to UWC, so his year when David Macleod from the CHPC’s ACE Lab is announced the second installment of  competition I contacted Computer Science to make sure we had a team. From that side Reg Dodds is facilitating things, and their sysadmin, Daniel Leenderts, is offering a helping hand. The team is being mentored by Motse Lehata, and includes Warren Jacobus, Saeed Natha, Nicole Thomas and Eugene de Beste.

On Tuesday Long and myself wandered over the CS to observe and assist with the unpacking and installation of the practice cluster that Dell had sponsored. I’ll be thin on the technical details in case they don’t want it shared, but it provides enough hardware for installing and testing an operating system and applications for benchmarking. I’m hoping to use this cluster building as an opportunity to get students (and faculty) interested in building cyberinfrastructure as an area for research and maybe even future careers. After all, right now I’ve got the distinct impression that the small number of people I know that run the (mostly Linux) servers that power South African e-Research infrastructure ended up in that career path largely by accident. With big international projects like the SKA and H3Africa coming on stream in the next few years, we’re going to need a much large pool of scientific computing, High Performance Computing, scientific workflows (my personal research bugbear), data curation, storage and re-use and so on expertise. Right now, as far as I can see, there is no decent curriculum out there to train these people, something that I’m trying to address in my small way as part of the H3ABionet, and there is no clear track through the educational institutions into the research infrastructure (as opposed to pure research) side of things. Its gotta change!