For the Galaxy IUC Tools and Collections codefest we (the SANBI software developers) decided to take on what we thought would be a simple job: make the
bamtools_split tool output a dataset collection instead of multiple datasets. So here’s the output clause of the old (multiple datasets) version of
<outputs> <data format="txt" name="report" label="BAMSplitter Run" hidden="true"> <discover_datasets pattern="split_bam\.(?P<designation>.+)\.bam" ext="bam" visible="true"/> </data> </outputs>
and this needed to change to:
<outputs> <collection name="report" type="list" label="BAMSplitter Run"> <discover_datasets pattern="split_bam\.(?P<designation>.+)\.bam" ext="bam"/> </collection> </outputs>
In other words, the
<data> element just gets changed to a
<collection> element and the
<discover_datasets> element remains essentially the same. So we did this change and everything ran fine except: the output collection was empty. Why?
Lots of debugging followed, based on a fresh checkout of the Galaxy codebase. We discovered that the crucial function here is
collect_dynamic_collections() in the
galaxy.tools.parameters.output_collect module. This is called by the
finish() method of the
Jobclass, via the
Toolclass’ method of the same name.
collect_dynamic_collections function identifies output collections in a tool’s definition and then uses a collection builder to map job output files to a dataset collection type. The collection builder is a factory class defined in
galaxy.dataset_collections.builder and each dataset collection type (defined in
galaxy.dataset_collections.builder.types) has its own way of moving output elements into the members of a collection type.
Anyway, we traced this code all the way through to the point where it was obvious the dataset collection was being created successfully and then turned to the other Galaxy devs (John Chilton specifically) to ask for help, only to discover that the problem was gone. The dataset collection was somehow populated! It turns out that if your Galaxy tool creates an output dataset collection that has an uncertain number of members (like a
list collection) then it is populated asynchronously and you need to refresh the history to see its members – this is known bug.
So that’s been quite a learning curve. The final tool is on Github. The
collection tag for outputs was introduced above. We haven’t explored its
pair mode, but check out Peter Briggs’ trimmomatic tool which has an option to output as a
pair type dataset collection.
test section of the tool configuration, you can use a dataset collection like this:
<test> <param name="input_bam" ftype="bam" value="bamtools-input1.bam"/> <param name="analysis_type_selector" value="-mapped"/> <output_collection name="report"> <element name="MAPPED" file="bamtools-split-MAPPED1.bam" /> <element name="UNMAPPED" file="bamtools-split-UNMAPPED1.bam" /> </output_collection> </test>
output_collection tag essentially groups outputs together, with each
element tag taking the place that of an individual
output tag. Each
element tag has a name that maps to one of the names identified by the
discover_datasets pattern (perhaps index numbers can be used instead of names, I don’t know) and can use the test attributes that
With the tests updated and some suitable sample data in place the tests pass and the tool is ready for a pull request. There was some discussion though on the semantics of this tool… for more go and read the comments on the PR.