Making a Galaxy data manager idempotent (a hack)

In programming the term idempotent is used to mean that you can run a function or an application more than once and get the same results each time. In this blog post I will be discussing a hack I used in the primer_scheme_bedfiles data manager. The data manager installs reference files used by pipelines designed for processing ARTIC protocol amplicon sequencing results. The ARTIC / PrimalSeq protocol uses tiled PCR to amplify a genetic sample of a virus and this approach is the backbone to most SARS-CoV-2 sequencing happening around the world today. A pre-sequencing step of the protocol uses pools of specially designed primers, and the sequence corresponding to these primers needs to be removed from the reads during the bioinformatic analysis. Primer locations are described in BED-like files that get fed into tools like ARTIC minion and ivar trim. To avoid the need for the user to supply these each time an analysis is run, the [primer_scheme_bedfiles data manager](https://github.com/galaxyproject/tools-iuc/tree/master/data_managers/data_manager_primer_scheme_bedfiles) stores these in a shared data area accessible to all tools on the Galaxy server it runs on.

While users can supply their own BED files to upload a new primer scheme description, most commonly users will want to download them from the relevant websites where they are published. This is where a challenge comes in: only one copy of each primer scheme file should be stored, but the download module does not know which schemes the Galaxy server has installed. Offering to install the same scheme twice is an error. The solution is to use a little-known feature of Galaxy tool authoring: the Galaxy tool actually has access to the state of the currently running Galaxy server (using the $__app__ variable) and this can be used to look up the contents of data tables. Here is the code in question:

            #set $data_table = $__app__.tool_data_tables.get("primer_scheme_bedfiles")
            #if $data_table is not None:
                #set $known_primers = [ row[0] for row in $data_table.get_fields() ]
                #set $primer_list = ','.join([ primer_name for primer_name in $input.primers if primer_name not in known_primers ])
            #else
                #set $primer_list = $input.primers
            #end if

The reference to $__app__.tool_data_tables is a ToolDataTableManager (defined here) which effectively acts like a hash keyed on the tool data table name. The primer_scheme_bedfiles tool data table is a TabularToolDataTable, which in turn provides the get_fields() method that returns a two-dimensional array. The first dimension is rows and in this particular table I'm talking about, the first column is the value field, i.e. the name of the primer scheme. The code above then computes the difference between the primer schemes requested by the user and the ones already installed and only downloads those that aren't already present. It could have used sets instead of lists for a little extra efficiency but the logic would largely remain the same.

One peculiarity discovered was the processing of the list comprehension - the primer_name variable used here is used without a $ because the Cheetah template system wants it that way. The $input.primers becomes a string (val1,val2) when used as a value in the template but in the context of the #set and list comprehension is a list (['val1', 'val2']).

Note that as Marius van den Beek pointed out when I mentioned this technique on Twitter, using $__app__ to look inside the state of the Galaxy server is not recommended and may well break in the (near?) future. I hope that by that time Galaxy gives you a different way to see the current state of a data table to preserve the possibilities of idempotent behaviour.

I'm largely writing this post to document how things work (this is documented in the comments in the code) in case I forget next time I need to write a similar DM or in case this post helps others in the Galaxy tool developer community.

Galaxy and the notorious rfind() error: HistoryDatasetAssociation objects aren’t strings

I have repeatedly triggered this error when writing Galaxy tool wrappers:

2018-03-26 13:36:58,408 ERROR [galaxy.jobs.runners] (2) Failure preparing job
Traceback (most recent call last):
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/jobs/runners/init.py", line 170, in prepare_job
job_wrapper.prepare()
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/jobs/init.py", line 909, in prepare
self.command_line, self.extra_filenames, self.environment_variables = tool_evaluator.build()
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/tools/evaluation.py", line 445, in build
raise e
AttributeError: 'HistoryDatasetAssociation' object has no attribute 'rfind'
FAIL

The command in the tool wrapper at the time included:

#import os.path
#set report_name os.path.splitext(os.path.basename($input_vcf))[0] + '.html'
tbvcfreport generate '$input_vcf' &&
mv '$report_name' $output

In the main part of the template, $input_vcf, which is a reference to an input dataset, effectively behaves like a string, as
it is substituted with the filename of the input dataset. In the #set part, however, it is a Python variable that refers
to the underlying HistoryDatasetAssociation. Thus the obscure looking error message, because a HDA is indeed not a string
and has no .rfind() method.

The error can be fixed by wrapping $input_vcf in a str() call to convert it into its string representation, i.e.
the filename I am interested in:

#import os.path
#set report_name os.path.splitext(os.path.basename(str($input_vcf)))[0] + '.html'
tbvcfreport generate '$input_vcf' &&
mv '$report_name' $output

Thanks to Marius van den Beek (@mvdbeek) for catching this for me.

A Galaxy 18.01 install

We are preparing for Galaxy Africa in a few weeks' time, which will feature some Galaxy training. In preparation for that I installed a new Ubuntu 16.04 virtual machine to host a 18.01 Galaxy server. The aim is to set up a production Galaxy server. To that end, the server is being hosted on a 1 TB Ceph RBD partition mounted on /galaxy. A user called galaxyuser was created on our FreeIPA authentication environment, and /galaxy/galaxysrv was created to host Galaxy files.

The first step of setup was to clone Galaxy 18.01 release and configure it for production use. The postgresql database server was installed and a user created for galaxyuser and then that user used to create the galaxy database. I configured the database and added myself (pvh@sanbi.ac.za) as an admin user.

The next step was to install nginx. As far as possible I tried to not alter the "out of the box" nginx configuration, to make it easier to do upgrades later. To that end, firstly, a SSL certificate was added using certbot and Let's Encrypt by installing the certbot and python-certbot-nginx packages, and running certbox --nginx certonly. This yielded /etc/letsencrypt/live/galaxy.sanbi.ac.za and associated files. The /etc/nginx/ssl directory as created and a /etc/nginx/ssl/dhparam.pem file was created with openssl dhparam -out /etc/nginx/ssl/dhparam.pem 4096. This was in order to create a more secure configuration than default as explained here.

Following the instructions from the Galaxy Receiving Files with nginx documentation and advice from Marius van den Beek, nginx-extras was installed from the recommended PPA, yielding nginx, nginx-common and nginx-extras packages for version 1.10.3-0ubuntu0.16.04.2ppa1. Then a file /etc/nginx/conf.d/custom.conf was created with content as per this gist. This is effectively a combination of the options suggested by the Galaxy admin docs with those in /etc/letsencrypt/options-ssl-nginx.conf. The server configuration directives from the recommended Galaxy configuration were adapted and put in /etc/nginx/sites-available/galaxy. The resulting configuration is in this gist. Once added, the configuration was activated by removing the /etc/nginx/sites-enabled/default file and linking the galaxy configuration fie in its place. Finally, /etc/nginx/nginx.conf was altered by changing the user used to run the server to galaxyuser (i.e. "user galaxyuser"). To connect Galaxy to nginx, the socket: option in the Galaxy config/galaxy.yml and the configuration in the nginx site configuration were harmonised as per the relevant documentation. Since the unix socket was not created on startup, a http connection and thus TCP socket on localhost was used.

The third step was configuring Galaxy to start using supervisord. This was based on the [program:web] configuration from the Galaxy starting and stopping configuration guide. And this is where things started going wrong. Using this configuration, the data upload tool didn't work as it used the system Python, not Python from the Galaxy virtualenv configured in /galaxy/galaxysrv/galaxy/.venv. To ensure that the Galaxy virtualenv was activated before running the upload tool, the VIRTUAL_ENV config was added to the /etc/supervisor/conf.d/galaxy configuration, resulting in the config shown in this gist.

The fourth step was to configure CVMFS to allow access to the reference data collection used on usegalaxy.org. I installed the cvmfs package by following the instructions to install the apt repository and then apt-get install cvmfs. The correct configuration was learned from Björn Grüning (@bgruening)'s bgruening/galaxy-stable Docker container with some help from @scholtalbers on gitter:

a. In /etc/cvmfs/domain.d/galaxyproject.org.conf put the line as per this gist.

b. In /etc/cvmfs/default.local put:

CVMFS_REPOSITORIES="data.galaxyproject.org"
CVMFS_HTTP_PROXY="DIRECT"
CVMFS_QUOTA_LIMIT="4000"
CVMFS_USE_GEOAPI="yes"

c. In /etc/cvmfs/keys/data.galaxyproject.org.pub put:

-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA5LHQuKWzcX5iBbCGsXGt
6CRi9+a9cKZG4UlX/lJukEJ+3dSxVDWJs88PSdLk+E25494oU56hB8YeVq+W8AQE
3LWx2K2ruRjEAI2o8sRgs/IbafjZ7cBuERzqj3Tn5qUIBFoKUMWMSIiWTQe2Sfnj
GzfDoswr5TTk7aH/FIXUjLnLGGCOzPtUC244IhHARzu86bWYxQJUw0/kZl5wVGcH
maSgr39h1xPst0Vx1keJ95AH0wqxPbCcyBGtF1L6HQlLidmoIDqcCQpLsGJJEoOs
NVNhhcb66OJHah5ppI1N3cZehdaKyr1XcF9eedwLFTvuiwTn6qMmttT/tHX7rcxT
owIDAQAB
-----END PUBLIC KEY-----

d. Add the line /cvmfs /etc/auto.cvmfs to /etc/auto.master yielding a file looking like this gist.

e. A /cvmfs directory was created to be a mount point (mkdir /cvmfs).

f. The autofs service was restarted (systemctl restart autofs) and then a ls /cvmfs/data.galaxyproject.org/byhand shows a pretty collection of reference data.

g. updated The config/galaxy.yml file was updated so that the tool_data_table_config_path key contains references to the files that are stored in CVMFS. The final value of this key was:

tool_data_table_config_path: /cvmfs/data.galaxyproject.org/byhand/location/tool_data_table_conf.xml,/cvmfs/data.galaxyproject.org/managed/location/tool_data_table_conf.xml,config/tool_data_table_conf.xml

This might not fit on your screen, so see the config fragment here. After the update to the Galaxy config all service were restarted (with sudo supervisorctl restart all).

My fifth and final step was to test the Galaxy server by installing bowtie2 from the toolshed and working through the first steps of the mapping tutorial. Both human (hg19) and fruitfly (dm3) reference genomes were downloaded (and apparently stored in /var/lib/cvmfs/shared) using CVMFS and the bowtie2 mapping was run against them successfully, yielding the results expected from the tutorial.

Future work? There is lots - I have to connect the server to a cluster, using Slurm, and enable Interactive Environments... I'll blog about that when I get there.