Galaxy 22.05 and the cluster_venv

I recently upgraded one of my Galaxy servers to version 22.05 (once again using Ansible and the excellent Galaxy Ansible roles). Unfortunatey this exposed a problem with our environment. The Galaxy server that I configured is running on Ubuntu 20.04, but the HPC cluster's worker nodes use Ubuntu 18.04. Ubuntu 18.04 comes with Python 3.6, which is too old for modern Galaxy.

Thanks to advice from Galaxy admin's chat I installed a Conda environment by downloading Miniconda3 and doing a batch install ( -b -p GALAXY_DIR/conda, where GALAXY_DIR is my Galaxy installation directory). Then I activated this with source GALAXY_DIR/conda/bin/activate) and proceeded to run Galaxy's as described in this post. My GALAXY_VIRTUAL_ENV in job_conf.xml points to the location of the virtualenv.

The end result is that Galaxy uses this virtual environment to provide Python on cluster nodes, and job submission now works again.

Loading Nanopore data into Galaxy

Basecalled nanopore data is output in a structure like this:


In other words, for each barcode there is a collection of multiple fastq files. While it is straightforward to concatenate these together on the command line, for non-command-line-users that is not as easy. To get this into Galaxy for analysis, you can follow these steps:

  1. Upload the whole fastq_pass directory to the Galaxy server using FTP upload
  2. Using the Rule Based Uploader, select that you want to create Collections and that your source is the FTP Directory and use these rules (these rules expect there to be a single directory in your FTP area with barcode directories. If that is not the case, you need to change the .*/barcode\\d+/(.*) regular expression, to match your upload directory. For example if your upload directory was called fastq_pass make it fastq_pass/barcode\\d+/(.*). These rules also assume that your data is gzipped FASTQ. If it is not gzipped fastq, changed the fastq.gz$ to fastq$ and changed the fastqsanger.gz to fastqsanger. You can also find these rules here:
    "rules": [
      "type": "add_column_metadata",
      "value": "path"
      "type": "add_filter_regex",
      "target_column": 0,
      "expression": "fastq.gz$",
      "invert": false
      "type": "add_column_regex",
      "target_column": 0,
      "expression": ".*/barcode\\d+/(.*)",
      "group_count": 1
      "type": "add_column_regex",
      "target_column": 0,
      "expression": "./(barcode\\d+)/.*",
      "group_count": 1
      "type": "swap_columns",
      "target_column_0": 1,
      "target_column_1": 2
    "mapping": [
      "type": "ftp_path",
      "columns": [
      "type": "list_identifiers",
      "columns": [
      "editing": false
    "extension": "fastqsanger.gz"
  3. Give the collection a name (e.g. samples1) and Upload. A new list of lists called samples1 will be created.
  4. Create a tabular mapping file with the first column being the barcode name and the second column to sample name that you want to use. You can either create this as a TSV (e.g. using Excel) or you can type it into the "Paste" box in the Galaxy uploader (and make sure to select the type as tabular and the "convert spaces to tabs" option in the settings).
  5. Using the input collection and tabular renaming file as inputs, run this workflow. The result is a list with the elements of the list being the concatenated reads from the barcode directories.

You can watch a video demo of this method here.

Making a Galaxy data manager idempotent (a hack)

In programming the term idempotent is used to mean that you can run a function or an application more than once and get the same results each time. In this blog post I will be discussing a hack I used in the primer_scheme_bedfiles data manager. The data manager installs reference files used by pipelines designed for processing ARTIC protocol amplicon sequencing results. The ARTIC / PrimalSeq protocol uses tiled PCR to amplify a genetic sample of a virus and this approach is the backbone to most SARS-CoV-2 sequencing happening around the world today. A pre-sequencing step of the protocol uses pools of specially designed primers, and the sequence corresponding to these primers needs to be removed from the reads during the bioinformatic analysis. Primer locations are described in BED-like files that get fed into tools like ARTIC minion and ivar trim. To avoid the need for the user to supply these each time an analysis is run, the [primer_scheme_bedfiles data manager]( stores these in a shared data area accessible to all tools on the Galaxy server it runs on.

While users can supply their own BED files to upload a new primer scheme description, most commonly users will want to download them from the relevant websites where they are published. This is where a challenge comes in: only one copy of each primer scheme file should be stored, but the download module does not know which schemes the Galaxy server has installed. Offering to install the same scheme twice is an error. The solution is to use a little-known feature of Galaxy tool authoring: the Galaxy tool actually has access to the state of the currently running Galaxy server (using the $__app__ variable) and this can be used to look up the contents of data tables. Here is the code in question:

            #set $data_table = $__app__.tool_data_tables.get("primer_scheme_bedfiles")
            #if $data_table is not None:
                #set $known_primers = [ row[0] for row in $data_table.get_fields() ]
                #set $primer_list = ','.join([ primer_name for primer_name in $input.primers if primer_name not in known_primers ])
                #set $primer_list = $input.primers
            #end if

The reference to $__app__.tool_data_tables is a ToolDataTableManager (defined here) which effectively acts like a hash keyed on the tool data table name. The primer_scheme_bedfiles tool data table is a TabularToolDataTable, which in turn provides the get_fields() method that returns a two-dimensional array. The first dimension is rows and in this particular table I'm talking about, the first column is the value field, i.e. the name of the primer scheme. The code above then computes the difference between the primer schemes requested by the user and the ones already installed and only downloads those that aren't already present. It could have used sets instead of lists for a little extra efficiency but the logic would largely remain the same.

One peculiarity discovered was the processing of the list comprehension - the primer_name variable used here is used without a $ because the Cheetah template system wants it that way. The $input.primers becomes a string (val1,val2) when used as a value in the template but in the context of the #set and list comprehension is a list (['val1', 'val2']).

Note that as Marius van den Beek pointed out when I mentioned this technique on Twitter, using $__app__ to look inside the state of the Galaxy server is not recommended and may well break in the (near?) future. I hope that by that time Galaxy gives you a different way to see the current state of a data table to preserve the possibilities of idempotent behaviour.

I'm largely writing this post to document how things work (this is documented in the comments in the code) in case I forget next time I need to write a similar DM or in case this post helps others in the Galaxy tool developer community.

Galaxy 21.05 upgrade and cluster_venv

As part of my (rather prolonged) work towards a M.Sc. in bioinformatics, I maintain a Galaxy server at SANBI. I've recently upgraded to Galaxy 21.05, at the time of this writing the latest Galaxy release. You can read more about that release here.

My Galaxy server is deployed using Ansible with a combination of the standard Galaxy roles and ones developed at SANBI to match our infrastructure. Specifically, we have roles for integrating with our infrastructure's authentication, monitoring and CephFS filesystem. I also wrote a workaround for deploying letsencrypt based SSL. You can find this configuration in this repository.

The Galaxy server integrates with our cluster, the worker nodes of which are running Ubuntu 18.04 (the Galaxy server is on Ubuntu 20.04). For a number of tasks, Galaxy requires tools to have some access to Python libraries that are not part of core Python for the business of "finishing" jobs (i.e. feeding results back into Galaxy) and so on. In the past I have found that using the single virtualenv that the Galaxy roles configure on the Galaxy server causes problems when running jobs on the cluster. Thus I have a specific venv for running on the cluster that is configured on the cluster. I.e. after the Galaxy server install was completed, I logged into one of the cluster worker nodes as root, deleted the old cluster_venv and ran:

cd /projects/galaxy/pvh_masters_galaxy1
export GALAXY_VIRTUAL_ENV=$(pwd)/cluster_venv
cd server
scripts/ --skip-client-build --skip-samples

Obviously it would be better to automate the above, but I have not got around to doing so yet. I'm not sure if this is the best approach but it works at least for our environment, so I'm writing this blog post in case it is useful to others (or to jog my own memory down the line!). This cluster_venv setup is exposed to the job runners in job_conf.xml - here is a snippet of my configuration:

    <plugins workers="4">
        <plugin id="local" type="runner" load=""/>
        <plugin id="slurm" type="runner" load=""/>
    <destinations default="dynamic">
        <destination id="slurm" runner="slurm">
            <param id="tmp_dir">True</param>
            <env id="GALAXY_VIRTUAL_ENV">/projects/galaxy/pvh_masters_galaxy1/cluster_venv</env>
            <env id="GALAXY_CONFIG_FILE">/projects/galaxy/pvh_masters_galaxy1/config/galaxy.yml</env>
        <destination id="local" runner="local"/>
        <destination id="dynamic" runner="dynamic">
            <param id="tmp_dir">True</param>
            <param id="type">dtd</param>
        <destination id="cluster_default" runner="slurm">
            <param id="tmp_dir">True</param>
            <env id="SLURM_CONF">/tools/admin/slurm/etc/slurm.conf</env>
            <env id="GALAXY_VIRTUAL_ENV">/projects/galaxy/pvh_masters_galaxy1/cluster_venv</env>
            <env id="GALAXY_CONFIG_FILE">/projects/galaxy/pvh_masters_galaxy1/config/galaxy.yml</env>
            <param id="nativeSpecification">--mem=10000</param>
            <resubmit condition="memory_limit_reached" destination="cluster_20G" />

P.S. this was the only manual task I had to perform (on the Galaxy side of things). Mostly the update consisted of updating our SANBI ansible roles to support Ubuntu 20.04 (and Ceph octopus), switching to the latest roles (as described in the training material for Galaxy admins), flicking the version number from release_20.09 to release_21.05 and running the Ansible playbook.

Galaxy and the notorious rfind() error: HistoryDatasetAssociation objects aren’t strings

I have repeatedly triggered this error when writing Galaxy tool wrappers:

2018-03-26 13:36:58,408 ERROR [] (2) Failure preparing job
Traceback (most recent call last):
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/jobs/runners/", line 170, in prepare_job
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/jobs/", line 909, in prepare
self.command_line, self.extra_filenames, self.environment_variables =
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/tools/", line 445, in build
raise e
AttributeError: 'HistoryDatasetAssociation' object has no attribute 'rfind'

The command in the tool wrapper at the time included:

#import os.path
#set report_name os.path.splitext(os.path.basename($input_vcf))[0] + '.html'
tbvcfreport generate '$input_vcf' &&
mv '$report_name' $output

In the main part of the template, $input_vcf, which is a reference to an input dataset, effectively behaves like a string, as
it is substituted with the filename of the input dataset. In the #set part, however, it is a Python variable that refers
to the underlying HistoryDatasetAssociation. Thus the obscure looking error message, because a HDA is indeed not a string
and has no .rfind() method.

The error can be fixed by wrapping $input_vcf in a str() call to convert it into its string representation, i.e.
the filename I am interested in:

#import os.path
#set report_name os.path.splitext(os.path.basename(str($input_vcf)))[0] + '.html'
tbvcfreport generate '$input_vcf' &&
mv '$report_name' $output

Thanks to Marius van den Beek (@mvdbeek) for catching this for me.

A Galaxy 18.01 install

We are preparing for Galaxy Africa in a few weeks' time, which will feature some Galaxy training. In preparation for that I installed a new Ubuntu 16.04 virtual machine to host a 18.01 Galaxy server. The aim is to set up a production Galaxy server. To that end, the server is being hosted on a 1 TB Ceph RBD partition mounted on /galaxy. A user called galaxyuser was created on our FreeIPA authentication environment, and /galaxy/galaxysrv was created to host Galaxy files.

The first step of setup was to clone Galaxy 18.01 release and configure it for production use. The postgresql database server was installed and a user created for galaxyuser and then that user used to create the galaxy database. I configured the database and added myself ( as an admin user.

The next step was to install nginx. As far as possible I tried to not alter the "out of the box" nginx configuration, to make it easier to do upgrades later. To that end, firstly, a SSL certificate was added using certbot and Let's Encrypt by installing the certbot and python-certbot-nginx packages, and running certbox --nginx certonly. This yielded /etc/letsencrypt/live/ and associated files. The /etc/nginx/ssl directory as created and a /etc/nginx/ssl/dhparam.pem file was created with openssl dhparam -out /etc/nginx/ssl/dhparam.pem 4096. This was in order to create a more secure configuration than default as explained here.

Following the instructions from the Galaxy Receiving Files with nginx documentation and advice from Marius van den Beek, nginx-extras was installed from the recommended PPA, yielding nginx, nginx-common and nginx-extras packages for version 1.10.3-0ubuntu0.16.04.2ppa1. Then a file /etc/nginx/conf.d/custom.conf was created with content as per this gist. This is effectively a combination of the options suggested by the Galaxy admin docs with those in /etc/letsencrypt/options-ssl-nginx.conf. The server configuration directives from the recommended Galaxy configuration were adapted and put in /etc/nginx/sites-available/galaxy. The resulting configuration is in this gist. Once added, the configuration was activated by removing the /etc/nginx/sites-enabled/default file and linking the galaxy configuration fie in its place. Finally, /etc/nginx/nginx.conf was altered by changing the user used to run the server to galaxyuser (i.e. "user galaxyuser"). To connect Galaxy to nginx, the socket: option in the Galaxy config/galaxy.yml and the configuration in the nginx site configuration were harmonised as per the relevant documentation. Since the unix socket was not created on startup, a http connection and thus TCP socket on localhost was used.

The third step was configuring Galaxy to start using supervisord. This was based on the [program:web] configuration from the Galaxy starting and stopping configuration guide. And this is where things started going wrong. Using this configuration, the data upload tool didn't work as it used the system Python, not Python from the Galaxy virtualenv configured in /galaxy/galaxysrv/galaxy/.venv. To ensure that the Galaxy virtualenv was activated before running the upload tool, the VIRTUAL_ENV config was added to the /etc/supervisor/conf.d/galaxy configuration, resulting in the config shown in this gist.

The fourth step was to configure CVMFS to allow access to the reference data collection used on I installed the cvmfs package by following the instructions to install the apt repository and then apt-get install cvmfs. The correct configuration was learned from Björn Grüning (@bgruening)'s bgruening/galaxy-stable Docker container with some help from @scholtalbers on gitter:

a. In /etc/cvmfs/domain.d/ put the line as per this gist.

b. In /etc/cvmfs/default.local put:


c. In /etc/cvmfs/keys/ put:

-----END PUBLIC KEY-----

d. Add the line /cvmfs /etc/auto.cvmfs to /etc/auto.master yielding a file looking like this gist.

e. A /cvmfs directory was created to be a mount point (mkdir /cvmfs).

f. The autofs service was restarted (systemctl restart autofs) and then a ls /cvmfs/ shows a pretty collection of reference data.

g. updated The config/galaxy.yml file was updated so that the tool_data_table_config_path key contains references to the files that are stored in CVMFS. The final value of this key was:

tool_data_table_config_path: /cvmfs/,/cvmfs/,config/tool_data_table_conf.xml

This might not fit on your screen, so see the config fragment here. After the update to the Galaxy config all service were restarted (with sudo supervisorctl restart all).

My fifth and final step was to test the Galaxy server by installing bowtie2 from the toolshed and working through the first steps of the mapping tutorial. Both human (hg19) and fruitfly (dm3) reference genomes were downloaded (and apparently stored in /var/lib/cvmfs/shared) using CVMFS and the bowtie2 mapping was run against them successfully, yielding the results expected from the tutorial.

Future work? There is lots - I have to connect the server to a cluster, using Slurm, and enable Interactive Environments... I'll blog about that when I get there.

MaterializeCSS vs ReactJS: the case of the select

For historical reasons, the COMBAT TB web interface uses materialize for its styling. So far so good. That is, until I tried to deploy my code, written with React.JS. See, as I understand it, materialize has in some cases decided to replace some HTML elements with its own version of them. Notably the <select> element. And ReactJS relies on these elements for its own operation.

The first problem I had was that <select> elements vanished. Turns out you need a bit of Javascript to make them work:

$(document).ready(function() {

The next problem, however, was that the onChange handlers that ReactJS uses don't trigger events. Luckily that has been discussed before.

I've got two types of <select> element in the code I was writing for the COMBAT TB Explorer web application: static ones (there from the birth of the page) and dynamically generated ones. For the static ones I added some code to link up the events in the componentDidMount handler:

componentDidMount: function() {
    $(document).ready(function() {
    $('#modeselectdiv').on('change', 'select', null, this.handleModeChange);
    $('#multicompselectdiv').on('change', 'select', null, this.handleMultiCompChange);


but this didn't work for the dynamically generated elements, I think because they are only rendered after an AJAX call returns. For since I know a state change triggers the render event, I added the handler hook-up after the data was return and deployed (to the application's state), for example:

success: function(datasets) {
    var dataset_list = [];
    var dataset_list_length = datasets.length
    for (var i = 0; i < dataset_list_length; i++) {
        dataset_list.push({'name': datasets[i]['name'], id: datasets[i]['id']});
    this.setState({datasets: dataset_list, dataset_id: dataset_list[0].id});
    $('#datasetselectdiv').on('change', 'select', null, this.handleDatasetChange);

Turns out this works. The state change handlers are now linked in, they keep the state up to date with what the user is doing on the form, and the whole thing (that links the application to a Galaxy instance) works. Yay!

How Galaxy resolves dependencies (or not)

There are two parts to building a link between Galaxy and command line bioinformatics tools: the tool XML that specifies a mapping between the Galaxy web user interface and the tool command line and tool dependencies that specify how to source the actual packages that implement the tool's commands. I spent a bit of time today digging into how that second set of components operates as part of my work on SANBI's local Galaxy installation.

Requirements and resolvers

In the tool XML dependencies are specified using <requirement> clauses, for example:

<requirement type="package" version="">bwa</requirement>

This is taken from the bwa tool XML that I installed from the toolshed and specifies a particular version of the bwa short read aligner. Not all requirements have a version attached, however - this is from the BAM to BigWig converter, one of the datatype converters that comes with Galaxy (in the lib/galaxy/datatypes/converters directory) and is crucial to the operation of Trackster, the in-Galaxy genome browser:

<requirement type="package">bedtools</requirement>

These dependencies are fed into the dependency manager (DependecyManager in lib/galaxy/tools/deps/ which uses various dependency resolvers to generate shell commands that make the dependency available at runtime. These shell commands are passed on to the Galaxy job that actually executes the tool (see the prepare method of JobWrapper).

The default config

Galaxy provides a configuration file, config/dependency_resolvers_conf.xml to configure how the dependency resolvers are used. There is no sample provided but this pull request shows that the default is:

  <tool_shed_packages />
  <galaxy_packages />
  <galaxy_packages versionless="true" />

These aren't the only available resolvers - there is a homebrew resolver and one (with big warning stickers) that uses Environment Modules - but what is shown above is the default.

A quick note: if all the resolvers fail, you'll see a message such as:

Failed to resolve dependency on 'bedtools', ignoring

in your Galaxy logs (paster.log) and, quite likely, the job depending on that requirement will fail, since the required tool is not in Galaxy's PATH.

Tool Shed Packages resolver

The tool_shed_packages resolver is designed to find packages installed from the Galaxy Toolshed. These are installed in the location specified as tool_dependency_dir in config/galaxy.ini. I'll refer to this location as base_path. Toolshed packages are installed as base_path/toolname/version/toolshed_owner/toolshed_package_name/changeset_revision so for example BLAST+ on our system is base_path/blast+/2.2.31/iuc/package_blast_plus_2_2_31/e36f75574aec at the moment. The tool_shed_packages resolver cannot handle <requirement> clauses without a version number except for when they are referring to type="set_environment", not type="package" requirements. Uhhhhh, ok I'll explain set_environment a bit later. In general, if you want to resolve your dependency through the toolshed, you need to specify a version number.

Underneath it all, the tool_shed_packages resolver looks for a file called provided as part of the packaged tool. This file contains settings that are sourced (i.e. . into the job script Galaxy ends up executing.

Galaxy Packages resolver

The galaxy_packages resolver is oriented towards manually installed packages. Note that it is called twice - first with its default - version supporting - form and secondly with versionless="true". The software found by the galaxy_packages resolver is installed under the same base_path as for the tool_shed_packages resolver, but that's where the similarity ends. This resolver looks under base_path/toolname/version by default, so for example base_path/bedtools/2.22. If it finds a bin/ directory in the specified path it will add that to the path Galaxy uses to find binaries. If, however, it finds a file name it will emit code to source that file. This means that you can use the script any way you want to add a tool to Galaxy's path. For example, here is our for bedtools:


if [ -z "$MODULEPATH" ] ; then
  . /etc/profile.d/

module add bedtools/bedtools-2.20.1

That uses our existing Environment Modules installation to add bedtools to the PATH.

The versionless="true" incarnation of the galaxy_packages resolver works similarly, except that it looks for a symbolic link in base_path/toolname/default, e.g. base_path/bedtools/default. This needs to point to a version numbered directory containing a bin/ subdirectory or file as above. It is this support for versionless="true" that allows for the resolution of <requirement> specifications with no version.

Supporting versionless requirements

As might not be obvious from the discussion thus far, given the default Galaxy setup, the only way versionless requirements can be satisfied is with a manually installed package with a default link that is resolved with the galaxy_packages resolver. So even if you have the relevant requirement installed via a toolshed package, it will not be used to resolve a versionless requirement: you'd have to make a symlink from base_path/toolname/default to base_path/toolname/version and symlink the buried within the package's folders to that base_path/toolname/version directory. You could do that, or you could just manually add a as per the galaxy_packages schema and not install packages through the toolshed.

By the way, the set_environment type of requirement mentioned earlier is related to thing referred to when talking about tool_shed_packages is a special kind of 'package' that simply contains an script. They are expected to be installed as base_path/environment_settings/toolname/toolshed_owner/toolshed_package_name/changeset_revision and they're the only thing checked for by tool_shed_packages when trying to resolve a versionless requirement.

In Conclusion

If you're read this far, congratulations. I hope this information is useful; after all, everything could change with a wriggle of John Chilton's fingers (wink). Part of the reason that I believe Galaxy dependency resolution is so complicated is that the Galaxy community has been trying to solve the bioinformatics package management problem, something that has bedevilled the community ever since I started working in the field in the 90s.

The problem is this: bioinformatics is a small field of computing. A colleague of mine compared us to a spec of phytoplankton floating in a computing sea and I think he's right. Scientific software is often poorly engineered and most often not well packaged, and keeping up with the myriad ways of installing the software we use, keeping it up to date and having it all work together is dizzyingly hard and time consuming. And there aren't enough hands on the job. Much better resources fields are still trying to solve this problem and its clear that Galaxy is going to try and benefit from that work.

Some of the by-default-unused dependency resolvers try and leverage old (Environment Modules) and new (Homebrew) solutions for dependency management, and with luck and effort (probably more of the latter) things will get better in the future. For our personal Galaxy installation I'm going to probably be doing a bit of manual maintenance and using the galaxy_packages scheme more than packages from the toolshed. I might try and fix up the modules resolver, at least to make it work with what we've got going at SANBI. And hopefully I'll get less confused by those Failed to resolve dependency messages in the future!