There are two parts to building a link between Galaxy and command line bioinformatics tools: the tool XML that specifies a mapping between the Galaxy web user interface and the tool command line and tool dependencies that specify how to source the actual packages that implement the tool's commands. I spent a bit of time today digging into how that second set of components operates as part of my work on SANBI's local Galaxy installation.
Requirements and resolvers
In the tool XML dependencies are specified using
<requirement> clauses, for example:
<requirement type="package" version="0.7.10.039ea20639">bwa</requirement>
This is taken from the
bwa tool XML that I installed from the toolshed and specifies a particular version of the
bwa short read aligner. Not all requirements have a version attached, however - this is from the BAM to BigWig converter, one of the datatype converters that comes with Galaxy (in the
lib/galaxy/datatypes/converters directory) and is crucial to the operation of Trackster, the in-Galaxy genome browser:
These dependencies are fed into the dependency manager (
lib/galaxy/tools/deps/__init__.py) which uses various dependency resolvers to generate shell commands that make the dependency available at runtime. These shell commands are passed on to the Galaxy job that actually executes the tool (see the
prepare method of
The default config
Galaxy provides a configuration file,
config/dependency_resolvers_conf.xml to configure how the dependency resolvers are used. There is no sample provided but this pull request shows that the default is:
<dependency_resolvers> <tool_shed_packages /> <galaxy_packages /> <galaxy_packages versionless="true" /> </dependency_resolvers>
These aren't the only available resolvers - there is a
homebrew resolver and one (with big warning stickers) that uses Environment Modules - but what is shown above is the default.
A quick note: if all the resolvers fail, you'll see a message such as:
Failed to resolve dependency on 'bedtools', ignoring
in your Galaxy logs (
paster.log) and, quite likely, the job depending on that requirement will fail, since the required tool is not in Galaxy's
Tool Shed Packages resolver
tool_shed_packages resolver is designed to find packages installed from the Galaxy Toolshed. These are installed in the location specified as
config/galaxy.ini. I'll refer to this location as
base_path. Toolshed packages are installed as
base_path/toolname/version/toolshed_owner/toolshed_package_name/changeset_revision so for example BLAST+ on our system is
base_path/blast+/2.2.31/iuc/package_blast_plus_2_2_31/e36f75574aec at the moment. The
tool_shed_packages resolver cannot handle
<requirement> clauses without a version number except for when they are referring to
type="package" requirements. Uhhhhh, ok I'll explain
set_environment a bit later. In general, if you want to resolve your dependency through the toolshed, you need to specify a version number.
Underneath it all, the
tool_shed_packages resolver looks for a file called
env.sh provided as part of the packaged tool. This file contains settings that are sourced (i.e.
. env.sh) into the job script Galaxy ends up executing.
Galaxy Packages resolver
galaxy_packages resolver is oriented towards manually installed packages. Note that it is called twice - first with its default - version supporting - form and secondly with
versionless="true". The software found by the
galaxy_packages resolver is installed under the same
base_path as for the
tool_shed_packages resolver, but that's where the similarity ends. This resolver looks under
base_path/toolname/version by default, so for example
base_path/bedtools/2.22. If it finds a
bin/ directory in the specified path it will add that to the path Galaxy uses to find binaries. If, however, it finds a file name
env.sh it will emit code to source that file. This means that you can use the
env.sh script any way you want to add a tool to Galaxy's path. For example, here is our
env.sh for bedtools:
#!/bin/sh if [ -z "$MODULEPATH" ] ; then . /etc/profile.d/module.sh fi module add bedtools/bedtools-2.20.1
That uses our existing Environment Modules installation to add
bedtools to the PATH.
versionless="true" incarnation of the
galaxy_packages resolver works similarly, except that it looks for a symbolic link in
base_path/bedtools/default. This needs to point to a version numbered directory containing a
bin/ subdirectory or
env.sh file as above. It is this support for
versionless="true" that allows for the resolution of
<requirement> specifications with no version.
Supporting versionless requirements
As might not be obvious from the discussion thus far, given the default Galaxy setup, the only way versionless requirements can be satisfied is with a manually installed package with a
default link that is resolved with the
galaxy_packages resolver. So even if you have the relevant requirement installed via a toolshed package, it will not be used to resolve a versionless requirement: you'd have to make a symlink from
base_path/toolname/version and symlink the
env.sh buried within the package's folders to that
base_path/toolname/version directory. You could do that, or you could just manually add a
env.sh as per the
galaxy_packages schema and not install packages through the toolshed.
By the way, the
set_environment type of requirement mentioned earlier is related to thing referred to when talking about
tool_shed_packages is a special kind of 'package' that simply contains an
env.sh script. They are expected to be installed as
base_path/environment_settings/toolname/toolshed_owner/toolshed_package_name/changeset_revision and they're the only thing checked for by
tool_shed_packages when trying to resolve a versionless requirement.
If you're read this far, congratulations. I hope this information is useful; after all, everything could change with a wriggle of John Chilton's fingers (wink). Part of the reason that I believe Galaxy dependency resolution is so complicated is that the Galaxy community has been trying to solve the bioinformatics package management problem, something that has bedevilled the community ever since I started working in the field in the 90s.
The problem is this: bioinformatics is a small field of computing. A colleague of mine compared us to a spec of phytoplankton floating in a computing sea and I think he's right. Scientific software is often poorly engineered and most often not well packaged, and keeping up with the myriad ways of installing the software we use, keeping it up to date and having it all work together is dizzyingly hard and time consuming. And there aren't enough hands on the job. Much better resources fields are still trying to solve this problem and its clear that Galaxy is going to try and benefit from that work.
Some of the by-default-unused dependency resolvers try and leverage old (Environment Modules) and new (Homebrew) solutions for dependency management, and with luck and effort (probably more of the latter) things will get better in the future. For our personal Galaxy installation I'm going to probably be doing a bit of manual maintenance and using the
galaxy_packages scheme more than packages from the toolshed. I might try and fix up the
modules resolver, at least to make it work with what we've got going at SANBI. And hopefully I'll get less confused by those
Failed to resolve dependency messages in the future!