Galaxy and the notorious rfind() error: HistoryDatasetAssociation objects aren’t strings

I have repeatedly triggered this error when writing Galaxy tool wrappers:

2018-03-26 13:36:58,408 ERROR [galaxy.jobs.runners] (2) Failure preparing job
Traceback (most recent call last):
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/jobs/runners/init.py", line 170, in prepare_job
job_wrapper.prepare()
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/jobs/init.py", line 909, in prepare
self.command_line, self.extra_filenames, self.environment_variables = tool_evaluator.build()
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/tools/evaluation.py", line 445, in build
raise e
AttributeError: 'HistoryDatasetAssociation' object has no attribute 'rfind'
FAIL

The command in the tool wrapper at the time included:

#import os.path
#set report_name os.path.splitext(os.path.basename($input_vcf))[0] + '.html'
tbvcfreport generate '$input_vcf' &&
mv '$report_name' $output

In the main part of the template, $input_vcf, which is a reference to an input dataset, effectively behaves like a string, as
it is substituted with the filename of the input dataset. In the #set part, however, it is a Python variable that refers
to the underlying HistoryDatasetAssociation. Thus the obscure looking error message, because a HDA is indeed not a string
and has no .rfind() method.

The error can be fixed by wrapping $input_vcf in a str() call to convert it into its string representation, i.e.
the filename I am interested in:

#import os.path
#set report_name os.path.splitext(os.path.basename(str($input_vcf)))[0] + '.html'
tbvcfreport generate '$input_vcf' &&
mv '$report_name' $output

Thanks to Marius van den Beek (@mvdbeek) for catching this for me.

A Galaxy 18.01 install

We are preparing for Galaxy Africa in a few weeks’ time, which will feature some Galaxy training. In preparation for that I installed a new Ubuntu 16.04 virtual machine to host a 18.01 Galaxy server. The aim is to set up a production Galaxy server. To that end, the server is being hosted on a 1 TB Ceph RBD partition mounted on /galaxy. A user called galaxyuser was created on our FreeIPA authentication environment, and /galaxy/galaxysrv was created to host Galaxy files.

The first step of setup was to clone Galaxy 18.01 release and configure it for production use. The postgresql database server was installed and a user created for galaxyuser and then that user used to create the galaxy database. I configured the database and added myself (pvh@sanbi.ac.za) as an admin user.

The next step was to install nginx. As far as possible I tried to not alter the “out of the box” nginx configuration, to make it easier to do upgrades later. To that end, firstly, a SSL certificate was added using certbot and Let’s Encrypt by installing the certbot and python-certbot-nginx packages, and running certbox --nginx certonly. This yielded /etc/letsencrypt/live/galaxy.sanbi.ac.za and associated files. The /etc/nginx/ssl directory as created and a /etc/nginx/ssl/dhparam.pem file was created with openssl dhparam -out /etc/nginx/ssl/dhparam.pem 4096. This was in order to create a more secure configuration than default as explained here.

Following the instructions from the Galaxy Receiving Files with nginx documentation and advice from Marius van den Beek, nginx-extras was installed from the recommended PPA, yielding nginx, nginx-common and nginx-extras packages for version 1.10.3-0ubuntu0.16.04.2ppa1. Then a file /etc/nginx/conf.d/custom.conf was created with content as per this gist. This is effectively a combination of the options suggested by the Galaxy admin docs with those in /etc/letsencrypt/options-ssl-nginx.conf. The server configuration directives from the recommended Galaxy configuration were adapted and put in /etc/nginx/sites-available/galaxy. The resulting configuration is in this gist. Once added, the configuration was activated by removing the /etc/nginx/sites-enabled/default file and linking the galaxy configuration fie in its place. Finally, /etc/nginx/nginx.conf was altered by changing the user used to run the server to galaxyuser (i.e. “user galaxyuser”). To connect Galaxy to nginx, the socket: option in the Galaxy config/galaxy.yml and the configuration in the nginx site configuration were harmonised as per the relevant documentation. Since the unix socket was not created on startup, a http connection and thus TCP socket on localhost was used.

The third step was configuring Galaxy to start using supervisord. This was based on the [program:web] configuration from the Galaxy starting and stopping configuration guide. And this is where things started going wrong. Using this configuration, the data upload tool didn’t work as it used the system Python, not Python from the Galaxy virtualenv configured in /galaxy/galaxysrv/galaxy/.venv. To ensure that the Galaxy virtualenv was activated before running the upload tool, the VIRTUAL_ENV config was added to the /etc/supervisor/conf.d/galaxy configuration, resulting in the config shown in this gist.

The fourth step was to configure CVMFS to allow access to the reference data collection used on usegalaxy.org. I installed the cvmfs package by following the instructions to install the apt repository and then apt-get install cvmfs. The correct configuration was learned from Björn Grüning (@bgruening)’s bgruening/galaxy-stable Docker container with some help from @scholtalbers on gitter:

a. In /etc/cvmfs/domain.d/galaxyproject.org.conf put the line as per this gist.

b. In /etc/cvmfs/default.local put:

CVMFS_REPOSITORIES="data.galaxyproject.org"
CVMFS_HTTP_PROXY="DIRECT"
CVMFS_QUOTA_LIMIT="4000"
CVMFS_USE_GEOAPI="yes"

c. In /etc/cvmfs/keys/data.galaxyproject.org.pub put:

-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA5LHQuKWzcX5iBbCGsXGt
6CRi9+a9cKZG4UlX/lJukEJ+3dSxVDWJs88PSdLk+E25494oU56hB8YeVq+W8AQE
3LWx2K2ruRjEAI2o8sRgs/IbafjZ7cBuERzqj3Tn5qUIBFoKUMWMSIiWTQe2Sfnj
GzfDoswr5TTk7aH/FIXUjLnLGGCOzPtUC244IhHARzu86bWYxQJUw0/kZl5wVGcH
maSgr39h1xPst0Vx1keJ95AH0wqxPbCcyBGtF1L6HQlLidmoIDqcCQpLsGJJEoOs
NVNhhcb66OJHah5ppI1N3cZehdaKyr1XcF9eedwLFTvuiwTn6qMmttT/tHX7rcxT
owIDAQAB
-----END PUBLIC KEY-----

d. Add the line /cvmfs /etc/auto.cvmfs to /etc/auto.master yielding a file looking like this gist.

e. A /cvmfs directory was created to be a mount point (mkdir /cvmfs).

f. The autofs service was restarted (systemctl restart autofs) and then a ls /cvmfs/data.galaxyproject.org/byhand shows a pretty collection of reference data.

g. updated The config/galaxy.yml file was updated so that the tool_data_table_config_path key contains references to the files that are stored in CVMFS. The final value of this key was:

tool_data_table_config_path: /cvmfs/data.galaxyproject.org/byhand/location/tool_data_table_conf.xml,/cvmfs/data.galaxyproject.org/managed/location/tool_data_table_conf.xml,config/tool_data_table_conf.xml

This might not fit on your screen, so see the config fragment here. After the update to the Galaxy config all service were restarted (with sudo supervisorctl restart all).

My fifth and final step was to test the Galaxy server by installing bowtie2 from the toolshed and working through the first steps of the mapping tutorial. Both human (hg19) and fruitfly (dm3) reference genomes were downloaded (and apparently stored in /var/lib/cvmfs/shared) using CVMFS and the bowtie2 mapping was run against them successfully, yielding the results expected from the tutorial.

Future work? There is lots – I have to connect the server to a cluster, using Slurm, and enable Interactive Environments… I’ll blog about that when I get there.

MaterializeCSS vs ReactJS: the case of the select

For historical reasons, the COMBAT TB web interface uses materialize for its styling. So far so good. That is, until I tried to deploy my code, written with React.JS. See, as I understand it, materialize has in some cases decided to replace some HTML elements with its own version of them. Notably the <select> element. And ReactJS relies on these elements for its own operation.

The first problem I had was that <select> elements vanished. Turns out you need a bit of Javascript to make them work:

$(document).ready(function() {
    $('select').material_select();
});

The next problem, however, was that the onChange handlers that ReactJS uses don’t trigger events. Luckily that has been discussed before.

I’ve got two types of <select> element in the code I was writing for the COMBAT TB Explorer web application: static ones (there from the birth of the page) and dynamically generated ones. For the static ones I added some code to link up the events in the componentDidMount handler:

componentDidMount: function() {
    $(document).ready(function() {
        $('select').material_select();
    });
    $('#modeselectdiv').on('change', 'select', null, this.handleModeChange);
    $('#multicompselectdiv').on('change', 'select', null, this.handleMultiCompChange);

},

but this didn’t work for the dynamically generated elements, I think because they are only rendered after an AJAX call returns. For since I know a state change triggers the render event, I added the handler hook-up after the data was return and deployed (to the application’s state), for example:

success: function(datasets) {
    var dataset_list = [];
    var dataset_list_length = datasets.length
    for (var i = 0; i < dataset_list_length; i++) {
        dataset_list.push({'name': datasets[i]['name'], id: datasets[i]['id']});
    }
    this.setState({datasets: dataset_list, dataset_id: dataset_list[0].id});
    $('select').material_select();
    $('#datasetselectdiv').on('change', 'select', null, this.handleDatasetChange);
}.bind(this),

Turns out this works. The state change handlers are now linked in, they keep the state up to date with what the user is doing on the form, and the whole thing (that links the application to a Galaxy instance) works. Yay!

How Galaxy resolves dependencies (or not)

There are two parts to building a link between Galaxy and command line bioinformatics tools: the tool XML that specifies a mapping between the Galaxy web user interface and the tool command line and tool dependencies that specify how to source the actual packages that implement the tool’s commands. I spent a bit of time today digging into how that second set of components operates as part of my work on SANBI‘s local Galaxy installation.

Requirements and resolvers

In the tool XML dependencies are specified using <requirement> clauses, for example:

<requirement type="package" version="0.7.10.039ea20639">bwa</requirement>

This is taken from the bwa tool XML that I installed from the toolshed and specifies a particular version of the bwa short read aligner. Not all requirements have a version attached, however – this is from the BAM to BigWig converter, one of the datatype converters that comes with Galaxy (in the lib/galaxy/datatypes/converters directory) and is crucial to the operation of Trackster, the in-Galaxy genome browser:

<requirement type="package">bedtools</requirement>

These dependencies are fed into the dependency manager (DependecyManager in lib/galaxy/tools/deps/__init__.py) which uses various dependency resolvers to generate shell commands that make the dependency available at runtime. These shell commands are passed on to the Galaxy job that actually executes the tool (see the prepare method of JobWrapper).

The default config

Galaxy provides a configuration file, config/dependency_resolvers_conf.xml to configure how the dependency resolvers are used. There is no sample provided but this pull request shows that the default is:

<dependency_resolvers>
  <tool_shed_packages />
  <galaxy_packages />
  <galaxy_packages versionless="true" />
</dependency_resolvers>

These aren’t the only available resolvers – there is a homebrew resolver and one (with big warning stickers) that uses Environment Modules – but what is shown above is the default.

A quick note: if all the resolvers fail, you’ll see a message such as:

Failed to resolve dependency on 'bedtools', ignoring

in your Galaxy logs (paster.log) and, quite likely, the job depending on that requirement will fail, since the required tool is not in Galaxy’s PATH.

Tool Shed Packages resolver

The tool_shed_packages resolver is designed to find packages installed from the Galaxy Toolshed. These are installed in the location specified as tool_dependency_dir in config/galaxy.ini. I’ll refer to this location as base_path. Toolshed packages are installed as base_path/toolname/version/toolshed_owner/toolshed_package_name/changeset_revision so for example BLAST+ on our system is base_path/blast+/2.2.31/iuc/package_blast_plus_2_2_31/e36f75574aec at the moment. The tool_shed_packages resolver cannot handle <requirement> clauses without a version number except for when they are referring to type="set_environment", not type="package" requirements. Uhhhhh, ok I’ll explain set_environment a bit later. In general, if you want to resolve your dependency through the toolshed, you need to specify a version number.

Underneath it all, the tool_shed_packages resolver looks for a file called env.sh provided as part of the packaged tool. This file contains settings that are sourced (i.e. . env.sh) into the job script Galaxy ends up executing.

Galaxy Packages resolver

The galaxy_packages resolver is oriented towards manually installed packages. Note that it is called twice – first with its default – version supporting – form and secondly with versionless="true". The software found by the galaxy_packages resolver is installed under the same base_path as for the tool_shed_packages resolver, but that’s where the similarity ends. This resolver looks under base_path/toolname/version by default, so for example base_path/bedtools/2.22. If it finds a bin/ directory in the specified path it will add that to the path Galaxy uses to find binaries. If, however, it finds a file name env.sh it will emit code to source that file. This means that you can use the env.sh script any way you want to add a tool to Galaxy’s path. For example, here is our env.sh for bedtools:

#!/bin/sh

if [ -z "$MODULEPATH" ] ; then
  . /etc/profile.d/module.sh
fi

module add bedtools/bedtools-2.20.1

That uses our existing Environment Modules installation to add bedtools to the PATH.

The versionless="true" incarnation of the galaxy_packages resolver works similarly, except that it looks for a symbolic link in base_path/toolname/default, e.g. base_path/bedtools/default. This needs to point to a version numbered directory containing a bin/ subdirectory or env.sh file as above. It is this support for versionless="true" that allows for the resolution of <requirement> specifications with no version.

Supporting versionless requirements

As might not be obvious from the discussion thus far, given the default Galaxy setup, the only way versionless requirements can be satisfied is with a manually installed package with a default link that is resolved with the galaxy_packages resolver. So even if you have the relevant requirement installed via a toolshed package, it will not be used to resolve a versionless requirement: you’d have to make a symlink from base_path/toolname/default to base_path/toolname/version and symlink the env.sh buried within the package’s folders to that base_path/toolname/version directory. You could do that, or you could just manually add a env.sh as per the galaxy_packages schema and not install packages through the toolshed.

By the way, the set_environment type of requirement mentioned earlier is related to thing referred to when talking about tool_shed_packages is a special kind of ‘package’ that simply contains an env.sh script. They are expected to be installed as base_path/environment_settings/toolname/toolshed_owner/toolshed_package_name/changeset_revision and they’re the only thing checked for by tool_shed_packages when trying to resolve a versionless requirement.

In Conclusion

If you’re read this far, congratulations. I hope this information is useful; after all, everything could change with a wriggle of John Chilton‘s fingers (wink). Part of the reason that I believe Galaxy dependency resolution is so complicated is that the Galaxy community has been trying to solve the bioinformatics package management problem, something that has bedevilled the community ever since I started working in the field in the 90s.

The problem is this: bioinformatics is a small field of computing. A colleague of mine compared us to a spec of phytoplankton floating in a computing sea and I think he’s right. Scientific software is often poorly engineered and most often not well packaged, and keeping up with the myriad ways of installing the software we use, keeping it up to date and having it all work together is dizzyingly hard and time consuming. And there aren’t enough hands on the job. Much better resources fields are still trying to solve this problem and its clear that Galaxy is going to try and benefit from that work.

Some of the by-default-unused dependency resolvers try and leverage old (Environment Modules) and new (Homebrew) solutions for dependency management, and with luck and effort (probably more of the latter) things will get better in the future. For our personal Galaxy installation I’m going to probably be doing a bit of manual maintenance and using the galaxy_packages scheme more than packages from the toolshed. I might try and fix up the modules resolver, at least to make it work with what we’ve got going at SANBI. And hopefully I’ll get less confused by those Failed to resolve dependency messages in the future!