Galaxy and the notorious rfind() error: HistoryDatasetAssociation objects aren’t strings

I have repeatedly triggered this error when writing Galaxy tool wrappers:

2018-03-26 13:36:58,408 ERROR [] (2) Failure preparing job
Traceback (most recent call last):
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/jobs/runners/", line 170, in prepare_job
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/jobs/", line 909, in prepare
self.command_line, self.extra_filenames, self.environment_variables =
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/tools/", line 445, in build
raise e
AttributeError: 'HistoryDatasetAssociation' object has no attribute 'rfind'

The command in the tool wrapper at the time included:

#import os.path
#set report_name os.path.splitext(os.path.basename($input_vcf))[0] + '.html'
tbvcfreport generate '$input_vcf' &&
mv '$report_name' $output

In the main part of the template, $input_vcf, which is a reference to an input dataset, effectively behaves like a string, as
it is substituted with the filename of the input dataset. In the #set part, however, it is a Python variable that refers
to the underlying HistoryDatasetAssociation. Thus the obscure looking error message, because a HDA is indeed not a string
and has no .rfind() method.

The error can be fixed by wrapping $input_vcf in a str() call to convert it into its string representation, i.e.
the filename I am interested in:

#import os.path
#set report_name os.path.splitext(os.path.basename(str($input_vcf)))[0] + '.html'
tbvcfreport generate '$input_vcf' &&
mv '$report_name' $output

Thanks to Marius van den Beek (@mvdbeek) for catching this for me.

A Galaxy 18.01 install

We are preparing for Galaxy Africa in a few weeks’ time, which will feature some Galaxy training. In preparation for that I installed a new Ubuntu 16.04 virtual machine to host a 18.01 Galaxy server. The aim is to set up a production Galaxy server. To that end, the server is being hosted on a 1 TB Ceph RBD partition mounted on /galaxy. A user called galaxyuser was created on our FreeIPA authentication environment, and /galaxy/galaxysrv was created to host Galaxy files.

The first step of setup was to clone Galaxy 18.01 release and configure it for production use. The postgresql database server was installed and a user created for galaxyuser and then that user used to create the galaxy database. I configured the database and added myself ( as an admin user.

The next step was to install nginx. As far as possible I tried to not alter the “out of the box” nginx configuration, to make it easier to do upgrades later. To that end, firstly, a SSL certificate was added using certbot and Let’s Encrypt by installing the certbot and python-certbot-nginx packages, and running certbox --nginx certonly. This yielded /etc/letsencrypt/live/ and associated files. The /etc/nginx/ssl directory as created and a /etc/nginx/ssl/dhparam.pem file was created with openssl dhparam -out /etc/nginx/ssl/dhparam.pem 4096. This was in order to create a more secure configuration than default as explained here.

Following the instructions from the Galaxy Receiving Files with nginx documentation and advice from Marius van den Beek, nginx-extras was installed from the recommended PPA, yielding nginx, nginx-common and nginx-extras packages for version 1.10.3-0ubuntu0.16.04.2ppa1. Then a file /etc/nginx/conf.d/custom.conf was created with content as per this gist. This is effectively a combination of the options suggested by the Galaxy admin docs with those in /etc/letsencrypt/options-ssl-nginx.conf. The server configuration directives from the recommended Galaxy configuration were adapted and put in /etc/nginx/sites-available/galaxy. The resulting configuration is in this gist. Once added, the configuration was activated by removing the /etc/nginx/sites-enabled/default file and linking the galaxy configuration fie in its place. Finally, /etc/nginx/nginx.conf was altered by changing the user used to run the server to galaxyuser (i.e. “user galaxyuser”). To connect Galaxy to nginx, the socket: option in the Galaxy config/galaxy.yml and the configuration in the nginx site configuration were harmonised as per the relevant documentation. Since the unix socket was not created on startup, a http connection and thus TCP socket on localhost was used.

The third step was configuring Galaxy to start using supervisord. This was based on the [program:web] configuration from the Galaxy starting and stopping configuration guide. And this is where things started going wrong. Using this configuration, the data upload tool didn’t work as it used the system Python, not Python from the Galaxy virtualenv configured in /galaxy/galaxysrv/galaxy/.venv. To ensure that the Galaxy virtualenv was activated before running the upload tool, the VIRTUAL_ENV config was added to the /etc/supervisor/conf.d/galaxy configuration, resulting in the config shown in this gist.

The fourth step was to configure CVMFS to allow access to the reference data collection used on I installed the cvmfs package by following the instructions to install the apt repository and then apt-get install cvmfs. The correct configuration was learned from Björn Grüning (@bgruening)’s bgruening/galaxy-stable Docker container with some help from @scholtalbers on gitter:

a. In /etc/cvmfs/domain.d/ put the line as per this gist.

b. In /etc/cvmfs/default.local put:


c. In /etc/cvmfs/keys/ put:

-----END PUBLIC KEY-----

d. Add the line /cvmfs /etc/auto.cvmfs to /etc/auto.master yielding a file looking like this gist.

e. A /cvmfs directory was created to be a mount point (mkdir /cvmfs).

f. The autofs service was restarted (systemctl restart autofs) and then a ls /cvmfs/ shows a pretty collection of reference data.

g. updated The config/galaxy.yml file was updated so that the tool_data_table_config_path key contains references to the files that are stored in CVMFS. The final value of this key was:

tool_data_table_config_path: /cvmfs/,/cvmfs/,config/tool_data_table_conf.xml

This might not fit on your screen, so see the config fragment here. After the update to the Galaxy config all service were restarted (with sudo supervisorctl restart all).

My fifth and final step was to test the Galaxy server by installing bowtie2 from the toolshed and working through the first steps of the mapping tutorial. Both human (hg19) and fruitfly (dm3) reference genomes were downloaded (and apparently stored in /var/lib/cvmfs/shared) using CVMFS and the bowtie2 mapping was run against them successfully, yielding the results expected from the tutorial.

Future work? There is lots – I have to connect the server to a cluster, using Slurm, and enable Interactive Environments… I’ll blog about that when I get there.

MaterializeCSS vs ReactJS: the case of the select

For historical reasons, the COMBAT TB web interface uses materialize for its styling. So far so good. That is, until I tried to deploy my code, written with React.JS. See, as I understand it, materialize has in some cases decided to replace some HTML elements with its own version of them. Notably the <select> element. And ReactJS relies on these elements for its own operation.

The first problem I had was that <select> elements vanished. Turns out you need a bit of Javascript to make them work:

$(document).ready(function() {

The next problem, however, was that the onChange handlers that ReactJS uses don’t trigger events. Luckily that has been discussed before.

I’ve got two types of <select> element in the code I was writing for the COMBAT TB Explorer web application: static ones (there from the birth of the page) and dynamically generated ones. For the static ones I added some code to link up the events in the componentDidMount handler:

componentDidMount: function() {
    $(document).ready(function() {
    $('#modeselectdiv').on('change', 'select', null, this.handleModeChange);
    $('#multicompselectdiv').on('change', 'select', null, this.handleMultiCompChange);


but this didn’t work for the dynamically generated elements, I think because they are only rendered after an AJAX call returns. For since I know a state change triggers the render event, I added the handler hook-up after the data was return and deployed (to the application’s state), for example:

success: function(datasets) {
    var dataset_list = [];
    var dataset_list_length = datasets.length
    for (var i = 0; i < dataset_list_length; i++) {
        dataset_list.push({'name': datasets[i]['name'], id: datasets[i]['id']});
    this.setState({datasets: dataset_list, dataset_id: dataset_list[0].id});
    $('#datasetselectdiv').on('change', 'select', null, this.handleDatasetChange);

Turns out this works. The state change handlers are now linked in, they keep the state up to date with what the user is doing on the form, and the whole thing (that links the application to a Galaxy instance) works. Yay!

How Galaxy resolves dependencies (or not)

There are two parts to building a link between Galaxy and command line bioinformatics tools: the tool XML that specifies a mapping between the Galaxy web user interface and the tool command line and tool dependencies that specify how to source the actual packages that implement the tool’s commands. I spent a bit of time today digging into how that second set of components operates as part of my work on SANBI‘s local Galaxy installation.

Requirements and resolvers

In the tool XML dependencies are specified using <requirement> clauses, for example:

<requirement type="package" version="">bwa</requirement>

This is taken from the bwa tool XML that I installed from the toolshed and specifies a particular version of the bwa short read aligner. Not all requirements have a version attached, however – this is from the BAM to BigWig converter, one of the datatype converters that comes with Galaxy (in the lib/galaxy/datatypes/converters directory) and is crucial to the operation of Trackster, the in-Galaxy genome browser:

<requirement type="package">bedtools</requirement>

These dependencies are fed into the dependency manager (DependecyManager in lib/galaxy/tools/deps/ which uses various dependency resolvers to generate shell commands that make the dependency available at runtime. These shell commands are passed on to the Galaxy job that actually executes the tool (see the prepare method of JobWrapper).

The default config

Galaxy provides a configuration file, config/dependency_resolvers_conf.xml to configure how the dependency resolvers are used. There is no sample provided but this pull request shows that the default is:

  <tool_shed_packages />
  <galaxy_packages />
  <galaxy_packages versionless="true" />

These aren’t the only available resolvers – there is a homebrew resolver and one (with big warning stickers) that uses Environment Modules – but what is shown above is the default.

A quick note: if all the resolvers fail, you’ll see a message such as:

Failed to resolve dependency on 'bedtools', ignoring

in your Galaxy logs (paster.log) and, quite likely, the job depending on that requirement will fail, since the required tool is not in Galaxy’s PATH.

Tool Shed Packages resolver

The tool_shed_packages resolver is designed to find packages installed from the Galaxy Toolshed. These are installed in the location specified as tool_dependency_dir in config/galaxy.ini. I’ll refer to this location as base_path. Toolshed packages are installed as base_path/toolname/version/toolshed_owner/toolshed_package_name/changeset_revision so for example BLAST+ on our system is base_path/blast+/2.2.31/iuc/package_blast_plus_2_2_31/e36f75574aec at the moment. The tool_shed_packages resolver cannot handle <requirement> clauses without a version number except for when they are referring to type="set_environment", not type="package" requirements. Uhhhhh, ok I’ll explain set_environment a bit later. In general, if you want to resolve your dependency through the toolshed, you need to specify a version number.

Underneath it all, the tool_shed_packages resolver looks for a file called provided as part of the packaged tool. This file contains settings that are sourced (i.e. . into the job script Galaxy ends up executing.

Galaxy Packages resolver

The galaxy_packages resolver is oriented towards manually installed packages. Note that it is called twice – first with its default – version supporting – form and secondly with versionless="true". The software found by the galaxy_packages resolver is installed under the same base_path as for the tool_shed_packages resolver, but that’s where the similarity ends. This resolver looks under base_path/toolname/version by default, so for example base_path/bedtools/2.22. If it finds a bin/ directory in the specified path it will add that to the path Galaxy uses to find binaries. If, however, it finds a file name it will emit code to source that file. This means that you can use the script any way you want to add a tool to Galaxy’s path. For example, here is our for bedtools:


if [ -z "$MODULEPATH" ] ; then
  . /etc/profile.d/

module add bedtools/bedtools-2.20.1

That uses our existing Environment Modules installation to add bedtools to the PATH.

The versionless="true" incarnation of the galaxy_packages resolver works similarly, except that it looks for a symbolic link in base_path/toolname/default, e.g. base_path/bedtools/default. This needs to point to a version numbered directory containing a bin/ subdirectory or file as above. It is this support for versionless="true" that allows for the resolution of <requirement> specifications with no version.

Supporting versionless requirements

As might not be obvious from the discussion thus far, given the default Galaxy setup, the only way versionless requirements can be satisfied is with a manually installed package with a default link that is resolved with the galaxy_packages resolver. So even if you have the relevant requirement installed via a toolshed package, it will not be used to resolve a versionless requirement: you’d have to make a symlink from base_path/toolname/default to base_path/toolname/version and symlink the buried within the package’s folders to that base_path/toolname/version directory. You could do that, or you could just manually add a as per the galaxy_packages schema and not install packages through the toolshed.

By the way, the set_environment type of requirement mentioned earlier is related to thing referred to when talking about tool_shed_packages is a special kind of ‘package’ that simply contains an script. They are expected to be installed as base_path/environment_settings/toolname/toolshed_owner/toolshed_package_name/changeset_revision and they’re the only thing checked for by tool_shed_packages when trying to resolve a versionless requirement.

In Conclusion

If you’re read this far, congratulations. I hope this information is useful; after all, everything could change with a wriggle of John Chilton‘s fingers (wink). Part of the reason that I believe Galaxy dependency resolution is so complicated is that the Galaxy community has been trying to solve the bioinformatics package management problem, something that has bedevilled the community ever since I started working in the field in the 90s.

The problem is this: bioinformatics is a small field of computing. A colleague of mine compared us to a spec of phytoplankton floating in a computing sea and I think he’s right. Scientific software is often poorly engineered and most often not well packaged, and keeping up with the myriad ways of installing the software we use, keeping it up to date and having it all work together is dizzyingly hard and time consuming. And there aren’t enough hands on the job. Much better resources fields are still trying to solve this problem and its clear that Galaxy is going to try and benefit from that work.

Some of the by-default-unused dependency resolvers try and leverage old (Environment Modules) and new (Homebrew) solutions for dependency management, and with luck and effort (probably more of the latter) things will get better in the future. For our personal Galaxy installation I’m going to probably be doing a bit of manual maintenance and using the galaxy_packages scheme more than packages from the toolshed. I might try and fix up the modules resolver, at least to make it work with what we’ve got going at SANBI. And hopefully I’ll get less confused by those Failed to resolve dependency messages in the future!

Adventures in Galaxy output collections

For the Galaxy IUC Tools and Collections codefest we (the SANBI software developers) decided to take on what we thought would be a simple job: make the bamtools_split tool output a dataset collection instead of multiple datasets. So here’s the output clause of the old (multiple datasets) version of bamtools_split:

    <data format="txt" name="report" label="BAMSplitter Run" hidden="true">
      <discover_datasets pattern="split_bam\.(?P&lt;designation&gt;.+)\.bam" ext="bam" visible="true"/>

and this needed to change to:

    <collection name="report" type="list" label="BAMSplitter Run">
        <discover_datasets pattern="split_bam\.(?P&lt;designation&gt;.+)\.bam" ext="bam"/>

In other words, the <data> element just gets changed to a <collection> element and the <discover_datasets> element remains essentially the same. So we did this change and everything ran fine except: the output collection was empty. Why?

Lots of debugging followed, based on a fresh checkout of the Galaxy codebase. We discovered that the crucial function here is collect_dynamic_collections() in the module. This is called by the finish() method of the Jobclass, via the Toolclass’ method of the same name.

The collect_dynamic_collections function identifies output collections in a tool’s definition and then uses a collection builder to map job output files to a dataset collection type. The collection builder is a factory class defined in galaxy.dataset_collections.builder and each dataset collection type (defined in galaxy.dataset_collections.builder.types) has its own way of moving output elements into the members of a collection type.

Anyway, we traced this code all the way through to the point where it was obvious the dataset collection was being created successfully and then turned to the other Galaxy devs (John Chilton specifically) to ask for help, only to discover that the problem was gone. The dataset collection was somehow populated! It turns out that if your Galaxy tool creates an output dataset collection that has an uncertain number of members (like a list collection) then it is populated asynchronously and you need to refresh the history to see its members – this is known bug.

So that’s been quite a learning curve. The final tool is on Github. The collection tag for outputs was introduced above. We haven’t explored its pair mode, but check out Peter Briggs’ trimmomatic tool which has an option to output as a pair type dataset collection.

In the test section of the tool configuration, you can use a dataset collection like this:

    <param name="input_bam" ftype="bam" value="bamtools-input1.bam"/>
    <param name="analysis_type_selector" value="-mapped"/>
    <output_collection name="report">
      <element name="MAPPED" file="bamtools-split-MAPPED1.bam" />
      <element name="UNMAPPED" file="bamtools-split-UNMAPPED1.bam" />

The output_collection tag essentially groups outputs together, with each element tag taking the place that of an individual output tag. Each element tag has a name that maps to one of the names identified by the discover_datasets pattern (perhaps index numbers can be used instead of names, I don’t know) and can use the test attributes that output provides.

With the tests updated and some suitable sample data in place the tests pass and the tool is ready for a pull request. There was some discussion though on the semantics of this tool… for more go and read the comments on the PR.

A BLAST array job for the SANBI cluster

If you want to query a BLAST database with a large number of input query sequences, you might want to use this script. The easy way to gain speed for a BLAST search is to split the input set of query sequences (using a script such as or (if the sequences don’t contain linebreaks) you can use split or you can use csplit) into multiple parts and run the BLAST search as an array job. For this script, you need a working directory containing these subdirectories:

in/ - a directory containing your split queries in files named *.fasta
out/ - an empty output directory
logs/ - an empty log directory

Tune your splitting for efficiency: if your queries are too small, the time to start running will make the search inefficient. If your queries are too large, the jobs will run too long – remember that the timelimit on the default all.q is 8 hours.

Uljana and I wrote the script below to actually run the array job. If your working directory was, for example, /cip0/research/rosemary/blast and you saved this script as and you have 20 input query files, then you could submit the script with:

qsub -wd /cip0/research/rosemary/blast -t 1-20

Note that each use can have at most 20 jobs running on the cluster at any one time, so your queries will run in blocks of 20 jobs at a time. The raw source code is available and easier to copy than the listing below. Also note that you probably want to customise the actual BLAST command line (at the end of the script). The one in here was designed to pick up taxonomy information from the local install of the NR database – useful for doing a metagenomic scan.


# requirement:
# working directory with:
# in/ - files named .fasta that are query sequences
# out/ - empty directory to put outputs in
# logs/ - empty directory to put logs in 
# qsub with:
# qsub -t 1-2 -wd ./my-work-dir

#$ -o logs/$JOB_NAME.o$JOB_ID.$TASK_ID
#$ -e logs/$JOB_NAME.e$JOB_ID.$TASK_ID

### -----
### define input and output directories


cd $in_dir

### -----
### get all the file names into a file

ls *.fasta > $filelist

### -----
### access the fasta files by the ${SGE_TASK_ID}

fasta=`awk "NR == ${SGE_TASK_ID} {print}" $filelist` # ${file_list[$counter]}
echo $fasta

### -----
### add the blast module and run blast

. /etc/profile.d/
module add blastplus/default

blastn -query $in_dir/$fasta -db nt -out $out_dir/$fasta.txt -outfmt "6 std slen qlen qcovs qcovhsp staxids sscinames sskingdoms" -soft_masking false -max_target_seqs 3 -evalue 10

OrthoMCL and BLAST: Adventures in the (SANBI) Galaxy

BLAST in Galaxy

Part of my work for the week was to start using Galaxy more extensively at SANBI. I.e. to make it more usable. Last week I wrote an authentication plugin to allow a Galaxy server to authenticate using PAM. This got accepted into the 15.07 release of Galaxy, so I updated our Galaxy server to that release. I had neglected to include the example auth_conf.xml in the code I committed, but working off the example on my laptop I got PAM authentication working as a replacement for the previous HTTP authentication (which also spoke to PAM on the backend). I also took the opportunity to switch our server to using HTTPS using the SANBI wildcard certificate.

My first attempt at a practical use for our server came when I needed to run the BLAST step of the OrthoMCL pipeline. OrthoMCL uses an all-against-all BLAST as its input dataset, and based on the data I had from our colleagues, I had a collection of about 300,000 proteins to BLAST against each other. I started this off as an array job at CHPC but thought I could try and work locally as well, as a proof of concept. (Actually there was a previous step, filtering out poor proteins, but I’ll get to that below.) My first attempt at using BLAST hit a bug: “NotFound: cannot find ‘files_path’ while searching for ‘db_opts.histdb.files_path'”. This exception was thrown from __build_command_line in Galaxy’s lib/galaxy/tools/ because the BLAST wrappers use an attribute called files_path instead of extra_files_path. Peter Cock and John Chilton discuss the problem in this Github issue and Peter quickly committed a workaround to the BLAST tools.

Having fixed that, and having prepared the protein set (outside Galaxy), I decided to take a chance on the Galaxy “parallelisation” code. This is enabled through appropriate tags in the tool XML, and in the case of the blastp wrapper splits the query dataset into chunks of 1000 sequences each before submitting jobs (in Galaxy terms, actually tasks, not fully fledged jobs) to the cluster. Unfortunately these are individual jobs, not an array job, because array jobs are only implemented in the still-only-on-the-horizon DRMAA version 2. In any event, our cluster can handle thousands of job submissions so I hit go, saw the history item turn from grey to yellow, and waited. Unfortunately, after a day or so it went red (failed), but by then I was too busy with other stuff to debug it. To be continued…

(As an aside, the BLAST wrapper wraps BLAST+, whereas OrthoMCL uses legacy BLAST. I still need to check that the BLAST wrapper exposes enough flags in order to guarantee equivalence. A useful guide for some of the corresponding flags can be found on this page about ortholog finding).

OrthoMCL in and out of Galaxy

As mentioned previously, I was running BLAST as part of the OrthoMCL pipeline. OrthoMCL uses BLAST, MCL and a database (in the version we use, SQLite3) to compute the orthologs in a set of proteins. The pipeline has two steps before the BLAST stage (orthomclAdjustFasta and orthomclFilterFasta), five between the BLAST and MCL stages and a final step to process the MCL output. Currently I use a Makefile to execute the pipeline but at the GCC2015 Hackathon AJ started work on some wrappers for the steps in the pipeline. There has been previous work on executing OrthoMCL within Galaxy but that ran the entire workflow as a single tool. We want to implement the pipeline as a Galaxy workflow because that way we can (in theory at least) benefit from improvements in how BLAST is executed (e.g. parallelism) or even replace the BLAST step with a similar (but apparently faster) tool such as Diamond. The OrthoMCL pipeline is pretty linear so even given the limited capabilities of workflows in current Galaxy (as discussed by John Chilton at BOSC 2015) creating a OrthoMCL workflow should be pretty easy.

To that end we’ve now got a Github repository for the tool wrappers. I’m trying to follow the structure that groups like IUC use. AJ’s working on orthomclAdjustFasta, so I decided to tackle orthomclFilterFasta, a tool that takes a directory full of FASTA files as input, does some simple filtering and outputs a combined FASTA file. I’m not 100% sure on the requirements for the command line (I need to go back into the code and see how it is executed) so I’ve got a tool that generates a single shell command in the form:

mkdir inputs && /bin/bash dataset1.dat && /bin/bash orthomcl dataset2.dat && orthomclFilterFasta inputs/ <p1> <p2>

The is just a simple script to take a FASTA file, extract the tag that OrthoMCL uses to identify sets (added by orthomclAdjustFasta) and renames the file according to that tag. The orthomclFilterFasta tool insists that input files end in .fasta and are named according to their tag.

In any event the tool runs fine on a local Galaxy install. The next step is to get tool dependencies right, which is where the stuff in the package directory comes in. Galaxy can install packages for you (in an admin-configurable folder). For a tool the dependencies it needs are specified in a file called tool_dependencies.xml that is in the same folder as the tool XML.

The tool dependencies specify packages to install. For OrthoMCL two new packages have been written (see here, one for OrthoMCL and one for the Perl DBD::SQLite module that it depends on. OrthoMCL in turn depends on Perl and DBD::SQLite. This is done using a repository_dependencies.xml file – I’m still not sure if this is the correct approach, but in any event it follows the guide for simple repository dependencies in Galaxy. One limitation to repository dependencies is that they apparently only work within a single toolshed, so what to do if the package you require is in another toolshed?

Thus far the tool dependency stuff has been tested on a local Galaxy installation. It doesn’t work. Directories are created, but they are empty. Further testing and a better testing procedure is needed. Eric Rasche mentioned that Marius van den Beek has some Jenkins based testing framework that uses Docker to create sandboxes, and it is here – so perhaps getting this up and running is a next step.

And then finally, FastOrtho seems like a possibly viable alternative to OrthoMCL. The output seems roughly similar to OrthoMCLs and it is much faster (and as a single tool with no Perl dependencies, easier to package), but as with all new tools in bioinformatics, we’ll have to prove that it works well enough to replace OrthoMCL (which is somewhat of a standard in this domain). Well check back in a few weeks for updates…

Reflections on Big Data, Bioinformatics and the recent UCT/UWC workshop

Monday and Tuesday of this week was largely consumed by a focus on Big Data. First, Ton Engbersen from the IBM/ASTRON Centre for Exascale Technology presented a talk at UCT on microservers, data gravity and microclouds.

The microserver in question is the [DOME]( microserver-freescale/), a system-on-chip based device that crams 128 computers into a 19″ 2U rack drawer. Each computer takes up 13cm x 5 cm (and is 6 mm thick) and provides 12 PowerPC cores with up to 48 GB RAM, resulting in a 2U rack with 1536 cores and over 6 TB of RAM. The whole thing is cooled with warm water, an IBM innovation that is currently in use on the SuperMUC supercomputer in Leipzig, read more about its benefits on their page.

The DOME server is being developed to analyse data from the SKA, an exascale computing problem. The SKA is anticipated to generate between 300 to 1500 petabytes of data per year, putting it on the extreme end of scientific enterprises in terms of data volume. While big data is commonly associated with data volume, researchers at IBM identify four V’s of big data: volume, velocity, variety and veracity. Volume is straightforward. Velocity speaks to the rate at which new data appears. With the amount of sequence data available in GenBank growing at an exponential rate, both volume and velocity of data threaten to outstrip the ability of bioinformatics centres to analyse data. In terms of integration of data, however, my presentation on the big data of tuberculosis focussed more on the variety and veracity of available data. A survey of the data published alongside research articles in the field shows that much of the variety of data gleaned through bioinformatics experiments is lost or only retained in closed institutional databases (and thus effectively lost to the field). An overview of health data collected as part of the NIH-funded Centre for Predictive Computational Phenotyping illustrates the problem of data veracity: electronic health records for patients are often incomplete and lack the vocabulary researchers require to identify disease presence of progression.

Managing the data collections necessary to study e.g. the global state of TB prevalence and treatment will require digital curation of multiple datasets drawn from experiments performed in a range of domains. As Ton Engbersen pointed out that the growing size of data means that “compared to the cost of moving bytes around, everything else is free” (originally a Jim Gray quote). Add to this the (much more tractable) fact that the skills required to build stores and curate these datasets are unevenly distributed, data collections are set to become “the new oil”. Engebersen proposes a solution: micro-clouds that offer the possibility to move code to the data rather than the other way round. Such entities would require a sophisticated cross-institutional authentication framework – almost certainly built on digital certificates – to allow authorised software agents to interface with data. This immediate suggests a set of research priorities to add to SANBI’s existing research projects on data storage and data movement. Luckily this research overlaps with some research interests at UWC Computer Science.

The workshop concluded with some agreements to collaborate between UCT and UWC on big data, but the perspectives delivered show that there is much more at play than the SKA. The fact that both UWC and UCT have established bioinformatics expertise and are located on the established SANReN backbone means that there’s an immediate opportunity to share knowledge and experiments on projects that tackle all four V’s of big data. Lots of ideas… the coming year will see how they can be put into practice.

A Puppet definition for a Ensembl API server

At SANBI we use Puppet to manage system configuration for our servers. This significantly reduces the management headache, allowing us to make changes in a central location (e.g. what the DNS server IP addresses are) and also allows us to create “classes” of servers for different roles. Recently we hosted a course on the Ensembl Genome Browser taught by Bert Overduin of the EBI. In addition to teaching people how to use the Ensembl website, Bert taught a number of students how to use the Ensembl Perl API. I set up a VM, using the web interface to SANBI’s private VM cloud, and created a puppet definition that would install the Ensembl API on the server. So here’s a commented version of the definition I created.

First, a note about puppet: Puppet configuration is declarative, in other words it defines what should be, not (necessarily) how to get there. Each configuration item creates a “resource”. Puppet provides a bunch of resource types out of the box and allows you to define your own types. For this server, I defined two types, the download and the unpack types, referring to a resource that required downloading and a resource that required unpacking respectively. These definitions went in my .pp file ahead of my server definition, along with a download_and_unpack type that combined the two definitions. The download_and_unpack type uses resource ordering, in its arrow (->) form. Since the Puppet configuration language is declarative, not imperative, you cannot assume that resources are created in the order that you specify, so if order is a requirement you need to specify it. Anyway here are these types:

define download( $url, $dist='defaultvalue', $download_dir='/var/tmp' ) {

    if $dist == 'defaultvalue' {
        $path_els = split($url, '/')
        $dist_file = $path_els[-1]
    } else {
        $dist_file = $dist
    $downloaded_dist = "$download_dir/$dist_file"
    exec { "download_$title":
        creates => $downloaded_dist,
        path => '/usr/bin',
        command => "wget -O $downloaded_dist $url",

define unpack ( $dist, $creates, $dest='/opt', $download_dir='/var/tmp' ) {
    $suffix = regsubst($dist, '^.*(gz|bz2)$', '\1', 'I')
    if $suffix == 'gz' {
         $comp_flag = 'z'
    } elsif $suffix == 'bz2' {
         $comp_flag = 'j'
    } else { 
         $comp_flag = ''

    exec { "unpack_$title":
         creates => "$dest/$creates",
         command => "tar -C $dest -${comp_flag}xf $download_dir/$dist",
         path => '/bin',

define download_and_unpack ( $url, $dist='defaultvalue', 
                             $creates, $dest='/opt',
                             $download_dir='/var/tmp' ) {
    if $dist == 'defaultvalue' {
        $path_els = split($url, '/')
        $dist_file = $path_els[-1]
    } else {
        $dist_file = $dist
    download { "get_$title":
        url => $url,
        dist => $dist_file, 
        download_dir => $download_dir 
    unpack { "install_$title":
        dist => $dist_file, 
        creates => $creates, 
        dest => $dest, 
        download_dir => $download_dir 

Just one last notes on these types: they use exec, that executes a command. In Puppet exec will be executed each time the config is run, unless you use a creates, onlyif or unless statement. I thus use knowledge of what the commands do to specify that they should NOT be run if certain files exist.

Then there is one more type I need: a Ensembl course user with a particular defined password (the password matches the username – yes, very insecure, but this is on a throwaway VM for a single course). This is defined in terms of a user and an exec resource. The exec resource checks for the presence of the username *without* a password in /etc/shadow, and if it exists uses usermod to set the password (first generating it using openssl). Note that the generate() function runs on the Puppet server, not the client, so anything you are using there needs to be installed on the server (in this case it was openssl that was installed on the server already).

define enscourse_createuser {
    $tmp = generate("/usr/bin/openssl","passwd","-1",$name)
    $password_hash = inline_template('<%= @tmp.chomp %>')
    user { "$name":
      require => Group['enscourse'],
      ensure => present,
      gid => 'enscourse',
      comment => "Ensembl Course User $name",
      home => "/home/$name",
      managehome => true,
      shell => '/bin/bash',
    exec { "/usr/sbin/usermod -p '${password_hash}' ${name}":
      onlyif => "/bin/egrep -q '^${name}:[*!]' /etc/shadow",
      require => User[$name],

With the custom types out of the way we can start looking at the Puppet node that defines the “” server configuration:

node '' inherits 'sanbi-server-ubuntu1204' {
    network::interface { "eth0":
         ipaddr  => "",
         netmask => "",

We have an established “base machine definition” that we inherit from. This is *not* the recommended way to create Puppet configs, but we didn’t know that when we started using Puppet at SANBI. Puppet’s type system encourages a kind of mixin style programming, so there should be a set of Puppet classes e.g. sanbi-server or ubuntu-1204-server, and we should include them in the node definition. Just a quick note: Puppet classes are effectively singleton objects: they define a collection of resources that is declared once (as soon as the class is used in an include statement) in the entire Puppet catalog (a Puppet catalog is the collection of resources that will be applied to a particular system). Read Craig Dunn’s blog for a bit on the difference between Puppet defined types and classes.

We then define the network interface parameters (an entry on SANBI’s private Class C network). And then onwards to an Augeas definition that ensures that pam_mkhomedir is enabled. Augeas is a configuration management tool that parses text files and turns them into a tree that can be addressed and manipulating using a path specification language.

    augeas { 'mod_mkhomedir in pam':
        context => '/files/etc/pam.d/common-session',
        changes => [ 'ins 1000 after *[last()]',
                     'set 1000/type session',
                     'set 1000/control required',
					 'set 1000/module',
					 'set 1000/argument umask=0022',
	    onlyif => "match *[module=''] size == 0",

And now on to some package definitions. Ensembl requires a specific version of Bioperl (version 1.7.3) so we need to ensure that the Bioperl from the Ubuntu repositories is not installed. And then we provide a few text editors, the CVS version control system, and the mysql server.

    # pvh - 03/09/2013 - can't use bioperl from ubuntu repo. must be v 1.2.3
    package {['bioperl','bioperl-run']:
        ensure => "absent",

    package {['emacs23-nox', 'joe', 'jupp']:
        ensure => "present",

    package {'cvs':
        ensure => "present",

    package { 'mysql-server':
        ensure => "present",

Now we get to use our download_and_unpack resource type to download and unpack the modules, as specificed by the Ensembl API installation instructions. Then define a /etc/profile.d/ file so that the Ensembl stuff gets added to users’ PERL5LIB environment variables:

    download_and_unpack { 'bioperl':
        url => '',
        creates => 'bioperl-1.2.3/t/trim.t',

    download_and_unpack { 'ensembl':
        url => '',
        creates => 'ensembl/sql/table.sql',

    download_and_unpack { 'ensembl-compara':
        url => '',
        creates => 'ensembl-compara/sql/tree-stats.sql',

    download_and_unpack { 'ensembl-variation':
        url => '',
        creates => 'ensembl-variation/sql/var_web_config.sql',

    download_and_unpack { 'ensembl-functgenomics':
        url => '',
        creates => 'ensembl-functgenomics/sql/trimmed_funcgen_schema.xls',

    file { '/etc/profile.d/':
        content => '#!/bin/sh
export PERL5LIB
        owner => root,
        mode => 0644,

While much of the Ensembl API is pure Perl, Bert wanted the calc_genotypes tool compiled for use during the course, so we need a few more packages and an exec resource to do the compilation (with the associated creates statement to stop it being re-run on each puppet run):

    # for compiling calc_genotypes
    package { ['libipc-run-perl', 'build-essential']:
       ensure => present,

    exec { 'build_calc_genotypes':
       creates => '/opt/ensembl-variation/C_code/calc_genotypes',
       require => [Download_and_unpack['ensembl-variation'],
       command => 'make calc_genotypes',
       cwd => '/opt/ensembl-variation/C_code',
       user => 'root',
       path => '/bin:/usr/bin',


And finally some ugly hackery. I need a list of users to create, but Puppet doesn’t have an easy way to do this. So I wrote a little Python script that generates a list of usernames, separated by @. When I use this with generate() I need to get rid of the spurious newline, which I do using an inline template, and finally generate the list using split(). Yes I know, really ugly. Its this kind of stuff that is making us here at SANBI consider switching to Salt Stack (also because we love Python here).

Anyway, once we’ve got a list we can just pass it to define a collect of enscourse_createuser resources. The resource naming is a bit off, since “createuser” implies something imperative. I should have just called this enscourse_user or something. And finally close off the curly braces, our node definition is complete!

     $tmp = generate('/usr/local/bin/', 'user', 25)
     $user_string = inline_template('<%= @tmp.chomp %>')
     notice("user string :${user_string}:")
     $user_list = split($user_string, '@')

     group { 'enscourse':
       ensure => present

     enscourse_createuser { $user_list: }

Here is that little Python script by the way:


import sys

base = sys.argv[1]
limit = int(sys.argv[2])
num_list = [base + str(x) for x in range(1,limit+1)]
print "@".join(num_list),

Remember that generate() is run on the Puppet server, so this script is installed on there. Well that’s it! And here is the whole thing as one block in case you want to copy and paste it:

Continue reading