Galaxy 22.05 and the cluster_venv

I recently upgraded one of my Galaxy servers to version 22.05 (once again using Ansible and the excellent Galaxy Ansible roles). Unfortunatey this exposed a problem with our environment. The Galaxy server that I configured is running on Ubuntu 20.04, but the HPC cluster's worker nodes use Ubuntu 18.04. Ubuntu 18.04 comes with Python 3.6, which is too old for modern Galaxy.

Thanks to advice from Galaxy admin's chat I installed a Conda environment by downloading Miniconda3 and doing a batch install (Miniconda3-latest-Linux-x86_64.sh -b -p GALAXY_DIR/conda, where GALAXY_DIR is my Galaxy installation directory). Then I activated this with source GALAXY_DIR/conda/bin/activate) and proceeded to run Galaxy's common_startup.sh as described in this post. My GALAXY_VIRTUAL_ENV in job_conf.xml points to the location of the virtualenv.

The end result is that Galaxy uses this virtual environment to provide Python on cluster nodes, and job submission now works again.

Loading Nanopore data into Galaxy

Basecalled nanopore data is output in a structure like this:

fastq_pass/
.....barcode01/
........XXXX.fastq
........YYYY.fastq
.....barcode02/
........AAAA.fastq
........BBBB.fastq

In other words, for each barcode there is a collection of multiple fastq files. While it is straightforward to concatenate these together on the command line, for non-command-line-users that is not as easy. To get this into Galaxy for analysis, you can follow these steps:

  1. Upload the whole fastq_pass directory to the Galaxy server using FTP upload
  2. Using the Rule Based Uploader, select that you want to create Collections and that your source is the FTP Directory and use these rules (these rules expect there to be a single directory in your FTP area with barcode directories. If that is not the case, you need to change the .*/barcode\\d+/(.*) regular expression, to match your upload directory. For example if your upload directory was called fastq_pass make it fastq_pass/barcode\\d+/(.*). These rules also assume that your data is gzipped FASTQ. If it is not gzipped fastq, changed the fastq.gz$ to fastq$ and changed the fastqsanger.gz to fastqsanger. You can also find these rules here:
    {
    "rules": [
    {
      "type": "add_column_metadata",
      "value": "path"
    },
    {
      "type": "add_filter_regex",
      "target_column": 0,
      "expression": "fastq.gz$",
      "invert": false
    },
    {
      "type": "add_column_regex",
      "target_column": 0,
      "expression": ".*/barcode\\d+/(.*)",
      "group_count": 1
    },
    {
      "type": "add_column_regex",
      "target_column": 0,
      "expression": "./(barcode\\d+)/.*",
      "group_count": 1
    },
    {
      "type": "swap_columns",
      "target_column_0": 1,
      "target_column_1": 2
    }
    ],
    "mapping": [
    {
      "type": "ftp_path",
      "columns": [
        2
      ]
    },
    {
      "type": "list_identifiers",
      "columns": [
        1,
        2
      ],
      "editing": false
    }
    ],
    "extension": "fastqsanger.gz"
    }
    
  3. Give the collection a name (e.g. samples1) and Upload. A new list of lists called samples1 will be created.
  4. Create a tabular mapping file with the first column being the barcode name and the second column to sample name that you want to use. You can either create this as a TSV (e.g. using Excel) or you can type it into the "Paste" box in the Galaxy uploader (and make sure to select the type as tabular and the "convert spaces to tabs" option in the settings).
  5. Using the input collection and tabular renaming file as inputs, run this workflow. The result is a list with the elements of the list being the concatenated reads from the barcode directories.

You can watch a video demo of this method here.

An upgrade-friendly Slurm Installation

At SANBI we have a small HPC (see our Annual report) that uses Slurm as a scheduler. Its always a good idea to keep this up to date and unfortunately the version available in the Ubuntu package repository tends to be quite old (e.g. 17.11 for Ubuntu 18.04 and 19.05 for Slurm 20.04).

The Slurm upgrade procedure is mentioned in their Quick Start Administrator Guide. In short, the daemons need to be upgraded in a specific order, starting with slurmdbd, followed by a nested upgrade of slurmctld and the slurmd on each compute node. To facilitate this process our install is on shared storage (CephFS, but could also be NFS) and looks as follows:

    /tools/admin/slurm
    ├── 18.08.9
    ├── 19.05.7
    ├── 20.02.5
    ├── ctld -> /tools/admin/slurm/20.02.5
    ├── current -> /tools/admin/slurm/20.02.5
    ├── d -> /tools/admin/slurm/20.02.5
    ├── dbd -> /tools/admin/slurm/20.02.5
    ├── etc

To install slurm, the slurm source is unpacked and compiled with slurm, with the configure options like:

    ./configure --prefix=/tools/admin/slurm/20.02.5 --sysconfdir=/tools/admin/slurm/etc

As can be seen from the above listing, the d, ctld, dbd and current links link to the current version of Slurm in use.
Each daemon is managed by systemd and configured with a file in /etc/systemd/system. For example here is the configuration of slurmctld (i.e. /etc/systemd/system/slurmctld):

[Unit]
Description=Slurm controller daemon
After=network.target munge.service
ConditionPathExists=/tools/admin/slurm/etc/slurm.conf

[Service]
Type=oneshot
EnvironmentFile=-/etc/sysconfig/slurmctld
ExecStartPre=-/usr/bin/pkill -KILL slurmctld
ExecStart=/tools/admin/slurm/ctld/sbin/slurmctld $SLURMCTLD_OPTIONS
ExecReload=-/bin/kill -HUP $MAINPID
ExecStop=-/usr/bin/pkill -KILL slurmctld
PIDFile=/var/run/slurm/slurmctld.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

After installing this file, you need to run sudo systemctl daemon-reload.

Note the line in bold in the above config file. The executable is run via the /tools/admin/slurm/ctld/sbin folder. Because this /tools/admin/slurm/ctld path is a symlink, upgrading the slurmctld involves simply changing the symlink to point to the new slurm version.

The upgrade process for slurmdbd and slurmctld is quite straightforward, just follow the procedure for database backup and upgrade as mentioned in the docs. For slurmd, the upgrade procedure (backup of StateSaveLocation and restart of slurmd) needs to happen on each worker node. This can best be automated using ansible. As noted in the Slurm admin documentation, you can at most upgrade between two major releases. Due to a security issue older Slurm versions are not available from the main download page but you can still get them from Github (e.g. version 19.05).

A final note - this procedure, with per-version symlinks etc, was based on something I read online before executing at SANBI. I can't recall where I read this but if you were the source and would like credit, please look me up and let me know.

P.S. after the system upgrade, I recompile and re-install the slurm-drmaa module that you can find here.

Galaxy 21.05 upgrade and cluster_venv

As part of my (rather prolonged) work towards a M.Sc. in bioinformatics, I maintain a Galaxy server at SANBI. I've recently upgraded to Galaxy 21.05, at the time of this writing the latest Galaxy release. You can read more about that release here.

My Galaxy server is deployed using Ansible with a combination of the standard Galaxy roles and ones developed at SANBI to match our infrastructure. Specifically, we have roles for integrating with our infrastructure's authentication, monitoring and CephFS filesystem. I also wrote a workaround for deploying letsencrypt based SSL. You can find this configuration in this repository.

The Galaxy server integrates with our cluster, the worker nodes of which are running Ubuntu 18.04 (the Galaxy server is on Ubuntu 20.04). For a number of tasks, Galaxy requires tools to have some access to Python libraries that are not part of core Python for the business of "finishing" jobs (i.e. feeding results back into Galaxy) and so on. In the past I have found that using the single virtualenv that the Galaxy roles configure on the Galaxy server causes problems when running jobs on the cluster. Thus I have a specific venv for running on the cluster that is configured on the cluster. I.e. after the Galaxy server install was completed, I logged into one of the cluster worker nodes as root, deleted the old cluster_venv and ran:

cd /projects/galaxy/pvh_masters_galaxy1
export GALAXY_VIRTUAL_ENV=$(pwd)/cluster_venv
cd server
scripts/common_startup.sh --skip-client-build --skip-samples

Obviously it would be better to automate the above, but I have not got around to doing so yet. I'm not sure if this is the best approach but it works at least for our environment, so I'm writing this blog post in case it is useful to others (or to jog my own memory down the line!). This cluster_venv setup is exposed to the job runners in job_conf.xml - here is a snippet of my configuration:

<job_conf>
    <plugins workers="4">
        <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner"/>
        <plugin id="slurm" type="runner" load="galaxy.jobs.runners.slurm:SlurmJobRunner"/>
    </plugins>
    <destinations default="dynamic">
        <destination id="slurm" runner="slurm">
            <param id="tmp_dir">True</param>
            <env id="GALAXY_VIRTUAL_ENV">/projects/galaxy/pvh_masters_galaxy1/cluster_venv</env>
            <env id="GALAXY_CONFIG_FILE">/projects/galaxy/pvh_masters_galaxy1/config/galaxy.yml</env>
        </destination>
        <destination id="local" runner="local"/>
        <destination id="dynamic" runner="dynamic">
            <param id="tmp_dir">True</param>
            <param id="type">dtd</param>
        </destination>
        <destination id="cluster_default" runner="slurm">
            <param id="tmp_dir">True</param>
            <env id="SLURM_CONF">/tools/admin/slurm/etc/slurm.conf</env>
            <env id="GALAXY_VIRTUAL_ENV">/projects/galaxy/pvh_masters_galaxy1/cluster_venv</env>
            <env id="GALAXY_CONFIG_FILE">/projects/galaxy/pvh_masters_galaxy1/config/galaxy.yml</env>
            <param id="nativeSpecification">--mem=10000</param>
            <resubmit condition="memory_limit_reached" destination="cluster_20G" />
        </destination>

P.S. this was the only manual task I had to perform (on the Galaxy side of things). Mostly the update consisted of updating our SANBI ansible roles to support Ubuntu 20.04 (and Ceph octopus), switching to the latest roles (as described in the training material for Galaxy admins), flicking the version number from release_20.09 to release_21.05 and running the Ansible playbook.

Solving Bluetooth Audio Delay on Ubuntu 20.04

For quite some time I've been frustrated at the state of Ubuntu's support for Bluetooth audio. As a result, I've always gone with wired headphones. Now maybe its something about me but I've not had the best of luck with this. Headphones last a few months before some plug or wire breaks. In the worst case scenario the headphone jack on my laptop starts giving issues... so wireless is great. If it works.

I had to replace my headset recently and bought a HAVIT H2590BT, a fairly entry-level thing, and it worked fine for listening to music but as soon as I wanted more "real time" audio there were problems. I first noticed this on Duolingo and a bit of debugging showed that the problem was a delay in the audio. This became rather embarasing when doing a call with colleagues.

Turns out that Bluetooth has audio profiles that affect the operation of the headset. A2DP focuses on giving best audio quality, whereas HFP and HSP are more focused on real-time responsiveness. Unfortunately with the standard PulseAudio (13.99.1) on my Ubuntu 20.04 I could only connect to the headphones using A2DP. I came across posts on askubuntu.com from 2015 onwards talking about this issue and some suggested switching profiles but I couldn't seem to get that right.

Then I found this post from @normankev141. Unfortunately the plugin that he suggested has been deprecated by the author, who suggested moving to PipeWire. I switched to PipeWire using the instructions from this askubuntu post, rebooted and now I've got a much richer selection of profiles:

$ pactl list cards
[...]
Card #54
Name: bluez_card.C5_78_21_3A_9F_DB
Driver: module-bluez5-device.c
Owner Module: n/a
[...]
Profiles:
    off: Off (sinks: 0, sources: 0, priority: 0, available: yes)
    a2dp-sink: High Fidelity Playback (A2DP Sink) (sinks: 1, sources: 0, priority: 0, available: yes)
    headset-head-unit: Headset Head Unit (HSP/HFP) (sinks: 1, sources: 1, priority: 0, available: yes)
    a2dp-sink-sbc: High Fidelity Playback (A2DP Sink, codec SBC) (sinks: 1, sources: 0, priority: 0, available: yes)
    headset-head-unit-cvsd: Headset Head Unit (HSP/HFP, codec CVSD) (sinks: 1, sources: 1, priority: 0,    available: yes)
Active Profile: a2dp-sink-sbc

I can now switch profiles with pactl set-card-profile bluez_card.C5_78_21_3A_9F_DB a2dp-sink or pactl set-card-profile bluez_card.C5_78_21_3A_9F_DB headset-head-unit. I've made these two aliases for my shell:

alias goodaudio="pactl set-card-profile $(pactl list cards |grep 'Name: bluez' |awk '{print $2}') a2dp-sink"
alias headset="pactl set-card-profile $(pactl list cards |grep 'Name: bluez' |awk '{print $2}') headset-head-unit"

I haven't yet got around to linking these to some kind of Gnome utility so that I can toggle the profiles yet. Its on the TODO list.

MegaRAID write cache policy with lsmcli

A couple of weeks ago I had a near disaster when some of our servers lost power while their RAID was set to "write-through" caching with non-working batteries. The result was filesystem corruption and failure on 2 out of 3 Ceph monitor servers. In the past I have written about using MegaCLI for RAID admin. MegaCLI has been replaced by StorCLI, which I found here on the Broadcom web pages. I unpacked the various zip files until I got the storcli-007.0309.0000.0000-1.noarch.rpm RPM and installed that to get the MegaRAID storcli64 tool. Instead of using that directly I'm using the libstoragemgmt tools with the lsmcli tool. On CentOS 7 this required installing libstoragemgmt and libstoragemgmt-megaraid-plugin and starting the lsmd daemon systemctl start libstoragemgmt.

This all set up, I found the volume with lsmcli -u megaraid:// list --type VOLUMES:

[root@ceph-mon2 ~]# lsmcli -u megaraid:// list --type VOLUMES
ID                               | Name | SCSI VPD 0x83                    | Size         | Disabled | Pool ID | System ID | Disk Paths
---------------------------------------------------------------------------------------------------------------------------------------
6003048016dfd2001cf1d19f0af655a3 | VD 0 | 6003048016dfd2001cf1d19f0af655a3 | 597998698496 | No       | :DG0    |           | /dev/sda  

then the volume-cache-info command:

[root@ceph-mon2 ~]# lsmcli -u megaraid:// volume-cache-info --vol  6003048016dfd2001cf1d19f0af655a3
Volume ID                        | Write Cache Policy | Write Cache | Read Cache Policy | Read Cache | Physical Disk Cache
--------------------------------------------------------------------------------------------------------------------------
6003048016dfd2001cf1d19f0af655a3 | Write Back         | Write Back  | Enabled           | Enabled    | Use Disk Setting   

and set the policy to AUTO (which means write-back when the battery is ok, write-through otherwise):

[root@ceph-mon2 ~]# lsmcli -u megaraid:// volume-write-cache-policy-update --vol  6003048016dfd2001cf1d19f0af655a3 --policy AUTO
Volume ID                        | Write Cache Policy | Write Cache   | Read Cache Policy | Read Cache | Physical Disk Cache
----------------------------------------------------------------------------------------------------------------------------
6003048016dfd2001cf1d19f0af655a3 | Write Through      | Write Through | Enabled           | Enabled    | Use Disk Setting   

There doesn't seem to be a direct way to query the battery backup unit (BBU) with lsmcli but /opt/MegaRAID/storcli/storcli64 show will show you what the status is.

MaterializeCSS vs ReactJS: the case of the select

For historical reasons, the COMBAT TB web interface uses materialize for its styling. So far so good. That is, until I tried to deploy my code, written with React.JS. See, as I understand it, materialize has in some cases decided to replace some HTML elements with its own version of them. Notably the <select> element. And ReactJS relies on these elements for its own operation.

The first problem I had was that <select> elements vanished. Turns out you need a bit of Javascript to make them work:

$(document).ready(function() {
    $('select').material_select();
});

The next problem, however, was that the onChange handlers that ReactJS uses don't trigger events. Luckily that has been discussed before.

I've got two types of <select> element in the code I was writing for the COMBAT TB Explorer web application: static ones (there from the birth of the page) and dynamically generated ones. For the static ones I added some code to link up the events in the componentDidMount handler:

componentDidMount: function() {
    $(document).ready(function() {
        $('select').material_select();
    });
    $('#modeselectdiv').on('change', 'select', null, this.handleModeChange);
    $('#multicompselectdiv').on('change', 'select', null, this.handleMultiCompChange);

},

but this didn't work for the dynamically generated elements, I think because they are only rendered after an AJAX call returns. For since I know a state change triggers the render event, I added the handler hook-up after the data was return and deployed (to the application's state), for example:

success: function(datasets) {
    var dataset_list = [];
    var dataset_list_length = datasets.length
    for (var i = 0; i < dataset_list_length; i++) {
        dataset_list.push({'name': datasets[i]['name'], id: datasets[i]['id']});
    }
    this.setState({datasets: dataset_list, dataset_id: dataset_list[0].id});
    $('select').material_select();
    $('#datasetselectdiv').on('change', 'select', null, this.handleDatasetChange);
}.bind(this),

Turns out this works. The state change handlers are now linked in, they keep the state up to date with what the user is doing on the form, and the whole thing (that links the application to a Galaxy instance) works. Yay!

Making Ubuntu 14.04 and CentOS 7 NFS work together

I just spent a frustrating morning configuring our servers to talk NFS to each other properly. So we have:

1) NFS servers (ceph-mon1 and so on) running CentOS 7.
2) NFS clients (gridj1 and so on) running Ubuntu 14.04.

The first problem: RBD mounting and NFS startup were not configured on the servers. I fixed that by adding entries in /etc/ceph/rbdmap and enabling the rbdmap and nfs-server services using systemctl enable. I also used e2label to label the ext4 filesystems in the RBDs and then used these labels in /etc/fstab instead of device names. And used the _netdev option in the mount options because these devices are network devices.

The second problem: I had to add the insecure option to the exports in /etc/exports. This is because the mount request comes from a port higher than 1024, a so-called insecure port. And then exportfs -r to resync everything.

And the third problem: Ubuntu autofs makes a NFS4 mount request by default (even though I had specified nfsvers=3 in the mount options), and I haven't configured NFS4's authenticated mounts, so I was getting authenticated mount request from 192.168.6.71:862 messages in /var/log.messages on the NFS server. I switched the NFS server to not do NFS4 by adding --no-nfs-version 4 to the RPCNFSDARGS variable in /etc/sysconfig/nfs on the server, restarted the NFS server (systemctl restart nfs-server) and the mounts finally worked.

Finally, documented this here for posterity...

Faster Galaxy with uWSGI

I recently switched out local Galaxy server to be run using uWSGI and supervisord instead of the standard run.sh (which uses Paste under the hood). I followed the Galaxy scaling guide and it was pretty accurate except for a few details. I won't be showing the changes to Galaxy config files, they are exactly as related on that page.

I installed supervisord by doing pip install supervisor in the virtualenv that Galaxy uses. Then I put a supervisord.conf in the config/ directory of our Galaxy install and it starts like this:

[inet_http_server]
port=127.0.0.1:9001

[supervisord]

[supervisorctl]

The [inet_http_server] section directs supervisord to listen on localhost port 9001. The following two sections, [supervisord] and [supervisorctl] need to be present but can be empty. The rest of the configuration is as per that on the Scaling page with a few changes I'll explain below:

[program:galaxy_uwsgi]
command         = /opt/galaxy/.venv/bin/uwsgi --plugin python --ini-paste /opt/galaxy/config/galaxy.ini --die-on-term
directory       = /opt/galaxy
umask           = 022
autostart       = true
autorestart     = true
startsecs       = 10
user            = galaxy
environment     = PATH=/opt/galaxy/.venv:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin,PYTHON_EGG_CACHE=/opt/galaxy/.python-eggs,PYTHONPATH=/opt/galaxy/eggs/PasteDeploy-1.5.0-    py2.7.egg,SGE_ROOT=/var/lib/gridengine
numprocs        = 1
stopsignal      = TERM

[program:handler]
command         = /opt/galaxy/.venv/bin/python ./scripts/paster.py serve config/galaxy.ini --server-name=handler%(process_num)s --pid-file=/opt/galaxy/handler%(process_num)s.pid --log-    file=/opt/galaxy/handler%(process_num)s.log
directory       = /opt/galaxy
process_name    = handler%(process_num)s
numprocs        = 2
umask           = 022
autostart       = true
autorestart     = true
startsecs       = 15
user            = galaxy
environment     = PYTHON_EGG_CACHE=/opt/galaxy/.python-eggs,SGE_ROOT=/var/lib/gridengine

The SGE_ROOT is necessary because our cluster uses Sun Grid Engine and the SGE DRMAA library requires this environment variable. Otherwise this config uses uWSGI installed (using pip) in the virtualenv that Galaxy uses.

This snipped of nginx configuration shows what was commented out and what was added to link nginx to uWSGI:

#proxy_set_header REMOTE_USER $remote_user;
#proxy_set_header X-Forwarded-Host $host;
#proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
#proxy_set_header X-URL-SCHEME https;
#proxy_pass http://galaxy_app;
uwsgi_pass 127.0.0.1:4001;
uwsgi_param UWSGI_SCHEME $scheme;
include uwsgi_params;

Then, how to start and stop it all? Firstly, the supervisord config. The basis for this was the debian-norrgard script from the supervisord initscripts repository. The final script is in this gist. Note these lines:

NAME=supervisord
GALAXY_USER=galaxy
GALAXY_HOME=/opt/galaxy
GALAXY_VENV=$GALAXY_HOME/.venv
DAEMON=$GALAXY_VENV/bin/$NAME
SUPERVISORCTL=$GALAXY_VENV/bin/supervisorctl

They link supervisord to Galaxy settings. Then /etc/init.d/galaxy is in this gist. It depends on the supervisord startup script and starts and stops Galaxy using supervisorctl.

Two things remain unsatisfactory:

  1. The shutdown of Galaxy doesn't work reliably. The use of uWSGI's --die-on-term and stopsignal = TERM in the supervisord.conf is an attempt to remedy this.

  2. The uWSGI config relies on the PasteDeploy egg. This exists on our Galaxy server because it was downloaded by the historical Galaxy startup script. With the switch towards wheel based (instead of egg based) packages, this script is no longer part of a Galaxy install. The uWSGI settings might need to be changed because of this, however, the PasteDeploy package is installed in the virtualenv that Galaxy uses, so perhaps no change is necessary. I haven't tested this.

With these limitations, however, our Galaxy server is working and much more responsive than before.

Automatically commit and push IPython notebook

I'm currently teaching Python at a Software Carpentry workshop at North West University in Potchefstroom. As always there are concerns about pace and about how people can catch up if they fall behind. In a recent discussion on this topic on the Software Carpentry mailing list, David Dotson mentioned that he commits his IPython notebooks by pressing a custom keyboard shortcut which triggers an automatic git add/commit/push. No code was available, but I poked around a bit and found this StackOverflow question and answer which showed how to add a post-save hook to an IPython notebook (with details on doing the same for the newer Project Jupyter notebooks).

So here's my code:

import os
from subprocess import check_call
from shlex import split

def post_save(model, os_path, contents_manager):
    """post-save hook for doing a git commit / push"""
    if model['type'] != 'notebook':
        return # only do this for notebooks
    workdir, filename = os.path.split(os_path)
    if filename.startswith('Scratch') or filename.startswith('Untitled'):
        return # skip scratch and untitled notebooks
    # now do git add / git commit / git push
    check_call(split('git add {}'.format(filename)), cwd=workdir)
    check_call(split('git commit -m "notebook save" {}'.format(filename)), cwd=workdir)
    check_call(split('git push'), cwd=workdir)

c.FileContentsManager.post_save_hook = post_save

This code obviously assumes that your working directory is a git repository and it has been configured with a remote to push to. For this workshop my notebooks are in this git repo on GitHub.

I created a new IPython profile (ipython profile create swcteaching) for use while teaching and added that code to the ipython_notebook_config.py file. You can find this file's location with ipython profile locate swcteaching.

The one little niggle is that the commit message is always the same. I don't know IPython's front-end code well enough, but perhaps there is a way to pop up a window and request a commit message (going towards something more like David Dotson's solution and less like mine).