How Galaxy resolves dependencies (or not)

There are two parts to building a link between Galaxy and command line bioinformatics tools: the tool XML that specifies a mapping between the Galaxy web user interface and the tool command line and tool dependencies that specify how to source the actual packages that implement the tool’s commands. I spent a bit of time today digging into how that second set of components operates as part of my work on SANBI‘s local Galaxy installation.

Requirements and resolvers

In the tool XML dependencies are specified using <requirement> clauses, for example:

<requirement type="package" version="">bwa</requirement>

This is taken from the bwa tool XML that I installed from the toolshed and specifies a particular version of the bwa short read aligner. Not all requirements have a version attached, however – this is from the BAM to BigWig converter, one of the datatype converters that comes with Galaxy (in the lib/galaxy/datatypes/converters directory) and is crucial to the operation of Trackster, the in-Galaxy genome browser:

<requirement type="package">bedtools</requirement>

These dependencies are fed into the dependency manager (DependecyManager in lib/galaxy/tools/deps/ which uses various dependency resolvers to generate shell commands that make the dependency available at runtime. These shell commands are passed on to the Galaxy job that actually executes the tool (see the prepare method of JobWrapper).

The default config

Galaxy provides a configuration file, config/dependency_resolvers_conf.xml to configure how the dependency resolvers are used. There is no sample provided but this pull request shows that the default is:

  <tool_shed_packages />
  <galaxy_packages />
  <galaxy_packages versionless="true" />

These aren’t the only available resolvers – there is a homebrew resolver and one (with big warning stickers) that uses Environment Modules – but what is shown above is the default.

A quick note: if all the resolvers fail, you’ll see a message such as:

Failed to resolve dependency on 'bedtools', ignoring

in your Galaxy logs (paster.log) and, quite likely, the job depending on that requirement will fail, since the required tool is not in Galaxy’s PATH.

Tool Shed Packages resolver

The tool_shed_packages resolver is designed to find packages installed from the Galaxy Toolshed. These are installed in the location specified as tool_dependency_dir in config/galaxy.ini. I’ll refer to this location as base_path. Toolshed packages are installed as base_path/toolname/version/toolshed_owner/toolshed_package_name/changeset_revision so for example BLAST+ on our system is base_path/blast+/2.2.31/iuc/package_blast_plus_2_2_31/e36f75574aec at the moment. The tool_shed_packages resolver cannot handle <requirement> clauses without a version number except for when they are referring to type="set_environment", not type="package" requirements. Uhhhhh, ok I’ll explain set_environment a bit later. In general, if you want to resolve your dependency through the toolshed, you need to specify a version number.

Underneath it all, the tool_shed_packages resolver looks for a file called provided as part of the packaged tool. This file contains settings that are sourced (i.e. . into the job script Galaxy ends up executing.

Galaxy Packages resolver

The galaxy_packages resolver is oriented towards manually installed packages. Note that it is called twice – first with its default – version supporting – form and secondly with versionless="true". The software found by the galaxy_packages resolver is installed under the same base_path as for the tool_shed_packages resolver, but that’s where the similarity ends. This resolver looks under base_path/toolname/version by default, so for example base_path/bedtools/2.22. If it finds a bin/ directory in the specified path it will add that to the path Galaxy uses to find binaries. If, however, it finds a file name it will emit code to source that file. This means that you can use the script any way you want to add a tool to Galaxy’s path. For example, here is our for bedtools:


if [ -z "$MODULEPATH" ] ; then
  . /etc/profile.d/

module add bedtools/bedtools-2.20.1

That uses our existing Environment Modules installation to add bedtools to the PATH.

The versionless="true" incarnation of the galaxy_packages resolver works similarly, except that it looks for a symbolic link in base_path/toolname/default, e.g. base_path/bedtools/default. This needs to point to a version numbered directory containing a bin/ subdirectory or file as above. It is this support for versionless="true" that allows for the resolution of <requirement> specifications with no version.

Supporting versionless requirements

As might not be obvious from the discussion thus far, given the default Galaxy setup, the only way versionless requirements can be satisfied is with a manually installed package with a default link that is resolved with the galaxy_packages resolver. So even if you have the relevant requirement installed via a toolshed package, it will not be used to resolve a versionless requirement: you’d have to make a symlink from base_path/toolname/default to base_path/toolname/version and symlink the buried within the package’s folders to that base_path/toolname/version directory. You could do that, or you could just manually add a as per the galaxy_packages schema and not install packages through the toolshed.

By the way, the set_environment type of requirement mentioned earlier is related to thing referred to when talking about tool_shed_packages is a special kind of ‘package’ that simply contains an script. They are expected to be installed as base_path/environment_settings/toolname/toolshed_owner/toolshed_package_name/changeset_revision and they’re the only thing checked for by tool_shed_packages when trying to resolve a versionless requirement.

In Conclusion

If you’re read this far, congratulations. I hope this information is useful; after all, everything could change with a wriggle of John Chilton‘s fingers (wink). Part of the reason that I believe Galaxy dependency resolution is so complicated is that the Galaxy community has been trying to solve the bioinformatics package management problem, something that has bedevilled the community ever since I started working in the field in the 90s.

The problem is this: bioinformatics is a small field of computing. A colleague of mine compared us to a spec of phytoplankton floating in a computing sea and I think he’s right. Scientific software is often poorly engineered and most often not well packaged, and keeping up with the myriad ways of installing the software we use, keeping it up to date and having it all work together is dizzyingly hard and time consuming. And there aren’t enough hands on the job. Much better resources fields are still trying to solve this problem and its clear that Galaxy is going to try and benefit from that work.

Some of the by-default-unused dependency resolvers try and leverage old (Environment Modules) and new (Homebrew) solutions for dependency management, and with luck and effort (probably more of the latter) things will get better in the future. For our personal Galaxy installation I’m going to probably be doing a bit of manual maintenance and using the galaxy_packages scheme more than packages from the toolshed. I might try and fix up the modules resolver, at least to make it work with what we’ve got going at SANBI. And hopefully I’ll get less confused by those Failed to resolve dependency messages in the future!