There are two parts to building a link between Galaxy and command line bioinformatics tools: the tool XML that specifies a mapping between the Galaxy web user interface and the tool command line and tool dependencies that specify how to source the actual packages that implement the tool's commands. I spent a bit of time today digging into how that second set of components operates as part of my work on SANBI's local Galaxy installation.
Requirements and resolvers
In the tool XML dependencies are specified using <requirement>
clauses, for example:
<requirement type="package" version="0.7.10.039ea20639">bwa</requirement>
This is taken from the bwa
tool XML that I installed from the toolshed and specifies a particular version of the bwa
short read aligner. Not all requirements have a version attached, however - this is from the BAM to BigWig converter, one of the datatype converters that comes with Galaxy (in the lib/galaxy/datatypes/converters
directory) and is crucial to the operation of Trackster, the in-Galaxy genome browser:
<requirement type="package">bedtools</requirement>
These dependencies are fed into the dependency manager (DependecyManager
in lib/galaxy/tools/deps/__init__.py
) which uses various dependency resolvers to generate shell commands that make the dependency available at runtime. These shell commands are passed on to the Galaxy job that actually executes the tool (see the prepare
method of JobWrapper
).
The default config
Galaxy provides a configuration file, config/dependency_resolvers_conf.xml
to configure how the dependency resolvers are used. There is no sample provided but this pull request shows that the default is:
<dependency_resolvers>
<tool_shed_packages />
<galaxy_packages />
<galaxy_packages versionless="true" />
</dependency_resolvers>
These aren't the only available resolvers - there is a homebrew
resolver and one (with big warning stickers) that uses Environment Modules - but what is shown above is the default.
A quick note: if all the resolvers fail, you'll see a message such as:
Failed to resolve dependency on 'bedtools', ignoring
in your Galaxy logs (paster.log
) and, quite likely, the job depending on that requirement will fail, since the required tool is not in Galaxy's PATH
.
Tool Shed Packages resolver
The tool_shed_packages
resolver is designed to find packages installed from the Galaxy Toolshed. These are installed in the location specified as tool_dependency_dir
in config/galaxy.ini
. I'll refer to this location as base_path
. Toolshed packages are installed as base_path/toolname/version/toolshed_owner/toolshed_package_name/changeset_revision
so for example BLAST+ on our system is base_path/blast+/2.2.31/iuc/package_blast_plus_2_2_31/e36f75574aec
at the moment. The tool_shed_packages
resolver cannot handle <requirement>
clauses without a version number except for when they are referring to type="set_environment"
, not type="package"
requirements. Uhhhhh, ok I'll explain set_environment
a bit later. In general, if you want to resolve your dependency through the toolshed, you need to specify a version number.
Underneath it all, the tool_shed_packages
resolver looks for a file called env.sh
provided as part of the packaged tool. This file contains settings that are sourced (i.e. . env.sh
) into the job script Galaxy ends up executing.
Galaxy Packages resolver
The galaxy_packages
resolver is oriented towards manually installed packages. Note that it is called twice - first with its default - version supporting - form and secondly with versionless="true"
. The software found by the galaxy_packages
resolver is installed under the same base_path
as for the tool_shed_packages
resolver, but that's where the similarity ends. This resolver looks under base_path/toolname/version
by default, so for example base_path/bedtools/2.22
. If it finds a bin/
directory in the specified path it will add that to the path Galaxy uses to find binaries. If, however, it finds a file name env.sh
it will emit code to source that file. This means that you can use the env.sh
script any way you want to add a tool to Galaxy's path. For example, here is our env.sh
for bedtools:
#!/bin/sh
if [ -z "$MODULEPATH" ] ; then
. /etc/profile.d/module.sh
fi
module add bedtools/bedtools-2.20.1
That uses our existing Environment Modules installation to add bedtools
to the PATH.
The versionless="true"
incarnation of the galaxy_packages
resolver works similarly, except that it looks for a symbolic link in base_path/toolname/default
, e.g. base_path/bedtools/default
. This needs to point to a version numbered directory containing a bin/
subdirectory or env.sh
file as above. It is this support for versionless="true"
that allows for the resolution of <requirement>
specifications with no version.
Supporting versionless requirements
As might not be obvious from the discussion thus far, given the default Galaxy setup, the only way versionless requirements can be satisfied is with a manually installed package with a default
link that is resolved with the galaxy_packages
resolver. So even if you have the relevant requirement installed via a toolshed package, it will not be used to resolve a versionless requirement: you'd have to make a symlink from base_path/toolname/default
to base_path/toolname/version
and symlink the env.sh
buried within the package's folders to that base_path/toolname/version
directory. You could do that, or you could just manually add a env.sh
as per the galaxy_packages
schema and not install packages through the toolshed.
By the way, the set_environment
type of requirement mentioned earlier is related to thing referred to when talking about tool_shed_packages
is a special kind of 'package' that simply contains an env.sh
script. They are expected to be installed as base_path/environment_settings/toolname/toolshed_owner/toolshed_package_name/changeset_revision
and they're the only thing checked for by tool_shed_packages
when trying to resolve a versionless requirement.
In Conclusion
If you're read this far, congratulations. I hope this information is useful; after all, everything could change with a wriggle of John Chilton's fingers (wink). Part of the reason that I believe Galaxy dependency resolution is so complicated is that the Galaxy community has been trying to solve the bioinformatics package management problem, something that has bedevilled the community ever since I started working in the field in the 90s.
The problem is this: bioinformatics is a small field of computing. A colleague of mine compared us to a spec of phytoplankton floating in a computing sea and I think he's right. Scientific software is often poorly engineered and most often not well packaged, and keeping up with the myriad ways of installing the software we use, keeping it up to date and having it all work together is dizzyingly hard and time consuming. And there aren't enough hands on the job. Much better resources fields are still trying to solve this problem and its clear that Galaxy is going to try and benefit from that work.
Some of the by-default-unused dependency resolvers try and leverage old (Environment Modules) and new (Homebrew) solutions for dependency management, and with luck and effort (probably more of the latter) things will get better in the future. For our personal Galaxy installation I'm going to probably be doing a bit of manual maintenance and using the galaxy_packages
scheme more than packages from the toolshed. I might try and fix up the modules
resolver, at least to make it work with what we've got going at SANBI. And hopefully I'll get less confused by those Failed to resolve dependency
messages in the future!