An upgrade-friendly Slurm Installation

At SANBI we have a small HPC (see our Annual report) that uses Slurm as a scheduler. Its always a good idea to keep this up to date and unfortunately the version available in the Ubuntu package repository tends to be quite old (e.g. 17.11 for Ubuntu 18.04 and 19.05 for Slurm 20.04).

The Slurm upgrade procedure is mentioned in their Quick Start Administrator Guide. In short, the daemons need to be upgraded in a specific order, starting with slurmdbd, followed by a nested upgrade of slurmctld and the slurmd on each compute node. To facilitate this process our install is on shared storage (CephFS, but could also be NFS) and looks as follows:

    ├── 18.08.9
    ├── 19.05.7
    ├── 20.02.5
    ├── ctld -> /tools/admin/slurm/20.02.5
    ├── current -> /tools/admin/slurm/20.02.5
    ├── d -> /tools/admin/slurm/20.02.5
    ├── dbd -> /tools/admin/slurm/20.02.5
    ├── etc

To install slurm, the slurm source is unpacked and compiled with slurm, with the configure options like:

    ./configure --prefix=/tools/admin/slurm/20.02.5 --sysconfdir=/tools/admin/slurm/etc

As can be seen from the above listing, the d, ctld, dbd and current links link to the current version of Slurm in use.
Each daemon is managed by systemd and configured with a file in /etc/systemd/system. For example here is the configuration of slurmctld (i.e. /etc/systemd/system/slurmctld):

Description=Slurm controller daemon munge.service

ExecStartPre=-/usr/bin/pkill -KILL slurmctld
ExecStart=/tools/admin/slurm/ctld/sbin/slurmctld $SLURMCTLD_OPTIONS
ExecReload=-/bin/kill -HUP $MAINPID
ExecStop=-/usr/bin/pkill -KILL slurmctld


After installing this file, you need to run sudo systemctl daemon-reload.

Note the line in bold in the above config file. The executable is run via the /tools/admin/slurm/ctld/sbin folder. Because this /tools/admin/slurm/ctld path is a symlink, upgrading the slurmctld involves simply changing the symlink to point to the new slurm version.

The upgrade process for slurmdbd and slurmctld is quite straightforward, just follow the procedure for database backup and upgrade as mentioned in the docs. For slurmd, the upgrade procedure (backup of StateSaveLocation and restart of slurmd) needs to happen on each worker node. This can best be automated using ansible. As noted in the Slurm admin documentation, you can at most upgrade between two major releases. Due to a security issue older Slurm versions are not available from the main download page but you can still get them from Github (e.g. version 19.05).

A final note - this procedure, with per-version symlinks etc, was based on something I read online before executing at SANBI. I can't recall where I read this but if you were the source and would like credit, please look me up and let me know.

P.S. after the system upgrade, I recompile and re-install the slurm-drmaa module that you can find here.