OrthoMCL and BLAST: Adventures in the (SANBI) Galaxy

BLAST in Galaxy

Part of my work for the week was to start using Galaxy more extensively at SANBI. I.e. to make it more usable. Last week I wrote an authentication plugin to allow a Galaxy server to authenticate using PAM. This got accepted into the 15.07 release of Galaxy, so I updated our Galaxy server to that release. I had neglected to include the example auth_conf.xml in the code I committed, but working off the example on my laptop I got PAM authentication working as a replacement for the previous HTTP authentication (which also spoke to PAM on the backend). I also took the opportunity to switch our server to using HTTPS using the SANBI wildcard certificate.

My first attempt at a practical use for our server came when I needed to run the BLAST step of the OrthoMCL pipeline. OrthoMCL uses an all-against-all BLAST as its input dataset, and based on the data I had from our colleagues, I had a collection of about 300,000 proteins to BLAST against each other. I started this off as an array job at CHPC but thought I could try and work locally as well, as a proof of concept. (Actually there was a previous step, filtering out poor proteins, but I’ll get to that below.) My first attempt at using BLAST hit a bug: “NotFound: cannot find ‘files_path’ while searching for ‘db_opts.histdb.files_path'”. This exception was thrown from __build_command_line in Galaxy’s lib/galaxy/tools/evaluation.py because the BLAST wrappers use an attribute called files_path instead of extra_files_path. Peter Cock and John Chilton discuss the problem in this Github issue and Peter quickly committed a workaround to the BLAST tools.

Having fixed that, and having prepared the protein set (outside Galaxy), I decided to take a chance on the Galaxy “parallelisation” code. This is enabled through appropriate tags in the tool XML, and in the case of the blastp wrapper splits the query dataset into chunks of 1000 sequences each before submitting jobs (in Galaxy terms, actually tasks, not fully fledged jobs) to the cluster. Unfortunately these are individual jobs, not an array job, because array jobs are only implemented in the still-only-on-the-horizon DRMAA version 2. In any event, our cluster can handle thousands of job submissions so I hit go, saw the history item turn from grey to yellow, and waited. Unfortunately, after a day or so it went red (failed), but by then I was too busy with other stuff to debug it. To be continued…

(As an aside, the BLAST wrapper wraps BLAST+, whereas OrthoMCL uses legacy BLAST. I still need to check that the BLAST wrapper exposes enough flags in order to guarantee equivalence. A useful guide for some of the corresponding flags can be found on this page about ortholog finding).

OrthoMCL in and out of Galaxy

As mentioned previously, I was running BLAST as part of the OrthoMCL pipeline. OrthoMCL uses BLAST, MCL and a database (in the version we use, SQLite3) to compute the orthologs in a set of proteins. The pipeline has two steps before the BLAST stage (orthomclAdjustFasta and orthomclFilterFasta), five between the BLAST and MCL stages and a final step to process the MCL output. Currently I use a Makefile to execute the pipeline but at the GCC2015 Hackathon AJ started work on some wrappers for the steps in the pipeline. There has been previous work on executing OrthoMCL within Galaxy but that ran the entire workflow as a single tool. We want to implement the pipeline as a Galaxy workflow because that way we can (in theory at least) benefit from improvements in how BLAST is executed (e.g. parallelism) or even replace the BLAST step with a similar (but apparently faster) tool such as Diamond. The OrthoMCL pipeline is pretty linear so even given the limited capabilities of workflows in current Galaxy (as discussed by John Chilton at BOSC 2015) creating a OrthoMCL workflow should be pretty easy.

To that end we’ve now got a Github repository for the tool wrappers. I’m trying to follow the structure that groups like IUC use. AJ’s working on orthomclAdjustFasta, so I decided to tackle orthomclFilterFasta, a tool that takes a directory full of FASTA files as input, does some simple filtering and outputs a combined FASTA file. I’m not 100% sure on the requirements for the command line (I need to go back into the code and see how it is executed) so I’ve got a tool that generates a single shell command in the form:

mkdir inputs && /bin/bash orthomcl_prepare_dataset_for_filter.sh dataset1.dat && /bin/bash orthomcl prepare_dataset_for_filter.sh dataset2.dat && orthomclFilterFasta inputs/ <p1> <p2>

The prepare_dataset_for_filter.sh is just a simple script to take a FASTA file, extract the tag that OrthoMCL uses to identify sets (added by orthomclAdjustFasta) and renames the file according to that tag. The orthomclFilterFasta tool insists that input files end in .fasta and are named according to their tag.

In any event the tool runs fine on a local Galaxy install. The next step is to get tool dependencies right, which is where the stuff in the package directory comes in. Galaxy can install packages for you (in an admin-configurable folder). For a tool the dependencies it needs are specified in a file called tool_dependencies.xml that is in the same folder as the tool XML.

The tool dependencies specify packages to install. For OrthoMCL two new packages have been written (see here, one for OrthoMCL and one for the Perl DBD::SQLite module that it depends on. OrthoMCL in turn depends on Perl and DBD::SQLite. This is done using a repository_dependencies.xml file – I’m still not sure if this is the correct approach, but in any event it follows the guide for simple repository dependencies in Galaxy. One limitation to repository dependencies is that they apparently only work within a single toolshed, so what to do if the package you require is in another toolshed?

Thus far the tool dependency stuff has been tested on a local Galaxy installation. It doesn’t work. Directories are created, but they are empty. Further testing and a better testing procedure is needed. Eric Rasche mentioned that Marius van den Beek has some Jenkins based testing framework that uses Docker to create sandboxes, and it is here – so perhaps getting this up and running is a next step.

And then finally, FastOrtho seems like a possibly viable alternative to OrthoMCL. The output seems roughly similar to OrthoMCLs and it is much faster (and as a single tool with no Perl dependencies, easier to package), but as with all new tools in bioinformatics, we’ll have to prove that it works well enough to replace OrthoMCL (which is somewhat of a standard in this domain). Well check back in a few weeks for updates…

Reflections on Big Data, Bioinformatics and the recent UCT/UWC workshop

Monday and Tuesday of this week was largely consumed by a focus on Big Data. First, Ton Engbersen from the IBM/ASTRON Centre for Exascale Technology presented a talk at UCT on microservers, data gravity and microclouds.

The microserver in question is the [DOME](http://www.hpcwire.com/2014/04/10/dome-ibm-research- microserver-freescale/), a system-on-chip based device that crams 128 computers into a 19″ 2U rack drawer. Each computer takes up 13cm x 5 cm (and is 6 mm thick) and provides 12 PowerPC cores with up to 48 GB RAM, resulting in a 2U rack with 1536 cores and over 6 TB of RAM. The whole thing is cooled with warm water, an IBM innovation that is currently in use on the SuperMUC supercomputer in Leipzig, read more about its benefits on their page.

The DOME server is being developed to analyse data from the SKA, an exascale computing problem. The SKA is anticipated to generate between 300 to 1500 petabytes of data per year, putting it on the extreme end of scientific enterprises in terms of data volume. While big data is commonly associated with data volume, researchers at IBM identify four V’s of big data: volume, velocity, variety and veracity. Volume is straightforward. Velocity speaks to the rate at which new data appears. With the amount of sequence data available in GenBank growing at an exponential rate, both volume and velocity of data threaten to outstrip the ability of bioinformatics centres to analyse data. In terms of integration of data, however, my presentation on the big data of tuberculosis focussed more on the variety and veracity of available data. A survey of the data published alongside research articles in the field shows that much of the variety of data gleaned through bioinformatics experiments is lost or only retained in closed institutional databases (and thus effectively lost to the field). An overview of health data collected as part of the NIH-funded Centre for Predictive Computational Phenotyping illustrates the problem of data veracity: electronic health records for patients are often incomplete and lack the vocabulary researchers require to identify disease presence of progression.

Managing the data collections necessary to study e.g. the global state of TB prevalence and treatment will require digital curation of multiple datasets drawn from experiments performed in a range of domains. As Ton Engbersen pointed out that the growing size of data means that “compared to the cost of moving bytes around, everything else is free” (originally a Jim Gray quote). Add to this the (much more tractable) fact that the skills required to build stores and curate these datasets are unevenly distributed, data collections are set to become “the new oil”. Engebersen proposes a solution: micro-clouds that offer the possibility to move code to the data rather than the other way round. Such entities would require a sophisticated cross-institutional authentication framework – almost certainly built on digital certificates – to allow authorised software agents to interface with data. This immediate suggests a set of research priorities to add to SANBI’s existing research projects on data storage and data movement. Luckily this research overlaps with some research interests at UWC Computer Science.

The workshop concluded with some agreements to collaborate between UCT and UWC on big data, but the perspectives delivered show that there is much more at play than the SKA. The fact that both UWC and UCT have established bioinformatics expertise and are located on the established SANReN backbone means that there’s an immediate opportunity to share knowledge and experiments on projects that tackle all four V’s of big data. Lots of ideas… the coming year will see how they can be put into practice.

A mouldy myth

WHAT-were-they-thinkingSomeone at my home institution, the University of the Western Cape, has decided that the way to attract students is to wave a picture of mouldy bread at them. Presumably they don’t think that having top class postgraduate programmes at places like BCB or PLAAS or SANBI is worth advertising. Nope, instead we should talk about mouldy bread. Or rather, a myth about Alexander Fleming and mouldy bread. Thus the modified (Gimped in fact) image to the left.

The original proclaims “What if [Fleming] never looked twice at something as ordinary as stale bread?”. Well, I don’t think we really know what Alexander Fleming thought of stale bread. What we do know is that stale bread had nothing to do with his (re)discovery of the antibacterial action of Penicillium moulds. Instead:

“Returning from holiday on September 3, 1928, Fleming began to sort through petri dishes containing colonies of Staphylococcus, bacteria that cause boils, sore throats and abscesses. He noticed something unusual on one dish. It was dotted with colonies, save for one area where a blob of mold was growing. The zone immediately around the mold—later identified as a rare strain of Penicillium notatum—was clear, as if the mold had secreted something that inhibited bacterial growth.” [source]

So it was a lazy attitude towards cleaning the lab — not mouldy bread — that led to Fleming’s discovery. That’s the first thing this blurb got wrong. What offends me more, however, is the clichéd image of the heroic scientist’s discovery sparking a paradigm shift. In reality, the antibacterial effect of Penicillum was known before Fleming, with a range of scientists and traditional knowledges describing the antibacterial effects of mould, or of Penicillum specifically. Just four years before Fleming’s discovery, Andre Gratia and Sara Dath discovered the antibacterial effect of a species of Penicillium, also as result of contamination of a bacterial culture.

What made Fleming’s discovery significant was not the moment of discovery and subsequent insight, but rather what he did afterwards: instead of merely publishing a paper and moving on to another topic, he spent years trying to get other scientists — chemists especially — interested in the new substance’s potential. It was just over a decade later that Howard Florey’s team assembled a strange collection of baths and milkchurns as part of the first penicillin production line. Before Florey, however, there was Dr Cecil Paine, a student of Fleming‘s, who used a crude penicillin extract to successfully treat an eye infection in 1931. (Paine later was a colleague of Florey’s) And Ernst Chain, the scientist in Florey’s lab that led the penicillin research, allegedly extracted the compound from a sample of mould that had been sub-cultured from Fleming’s original isolate. Florey’s science also drew on the clinical trials conducted by his wife, Ethel Florey. So the links between Fleming, the Floreys, Chain and the practical use of penicillin drew on a rich culture of openness and experimental. In addition, as efforts got under way to industrialise penicillin production, Sara Dath’s work in collecting a “monumental number of moulds and bacteria” proved useful. Science was then, and is now, a product of a process of collective enquiry and effort, and while “scientific interest” is key, reducing the story of penicillin to a Scot staring at stale bread does violence to history. And UWC could do better than to peddle this mouldy myth.