Making a Galaxy data manager idempotent (a hack)

In programming the term idempotent is used to mean that you can run a function or an application more than once and get the same results each time. In this blog post I will be discussing a hack I used in the primer_scheme_bedfiles data manager. The data manager installs reference files used by pipelines designed for processing ARTIC protocol amplicon sequencing results. The ARTIC / PrimalSeq protocol uses tiled PCR to amplify a genetic sample of a virus and this approach is the backbone to most SARS-CoV-2 sequencing happening around the world today. A pre-sequencing step of the protocol uses pools of specially designed primers, and the sequence corresponding to these primers needs to be removed from the reads during the bioinformatic analysis. Primer locations are described in BED-like files that get fed into tools like ARTIC minion and ivar trim. To avoid the need for the user to supply these each time an analysis is run, the [primer_scheme_bedfiles data manager](https://github.com/galaxyproject/tools-iuc/tree/master/data_managers/data_manager_primer_scheme_bedfiles) stores these in a shared data area accessible to all tools on the Galaxy server it runs on.

While users can supply their own BED files to upload a new primer scheme description, most commonly users will want to download them from the relevant websites where they are published. This is where a challenge comes in: only one copy of each primer scheme file should be stored, but the download module does not know which schemes the Galaxy server has installed. Offering to install the same scheme twice is an error. The solution is to use a little-known feature of Galaxy tool authoring: the Galaxy tool actually has access to the state of the currently running Galaxy server (using the $__app__ variable) and this can be used to look up the contents of data tables. Here is the code in question:

            #set $data_table = $__app__.tool_data_tables.get("primer_scheme_bedfiles")
            #if $data_table is not None:
                #set $known_primers = [ row[0] for row in $data_table.get_fields() ]
                #set $primer_list = ','.join([ primer_name for primer_name in $input.primers if primer_name not in known_primers ])
            #else
                #set $primer_list = $input.primers
            #end if

The reference to $__app__.tool_data_tables is a ToolDataTableManager (defined here) which effectively acts like a hash keyed on the tool data table name. The primer_scheme_bedfiles tool data table is a TabularToolDataTable, which in turn provides the get_fields() method that returns a two-dimensional array. The first dimension is rows and in this particular table I’m talking about, the first column is the value field, i.e. the name of the primer scheme. The code above then computes the difference between the primer schemes requested by the user and the ones already installed and only downloads those that aren’t already present. It could have used sets instead of lists for a little extra efficiency but the logic would largely remain the same.

One peculiarity discovered was the processing of the list comprehension – the primer_name variable used here is used without a $ because the Cheetah template system wants it that way. The $input.primers becomes a string (val1,val2) when used as a value in the template but in the context of the #set and list comprehension is a list (['val1', 'val2']).

Note that as Marius van den Beek pointed out when I mentioned this technique on Twitter, using $__app__ to look inside the state of the Galaxy server is not recommended and may well break in the (near?) future. I hope that by that time Galaxy gives you a different way to see the current state of a data table to preserve the possibilities of idempotent behaviour.

I’m largely writing this post to document how things work (this is documented in the comments in the code) in case I forget next time I need to write a similar DM or in case this post helps others in the Galaxy tool developer community.