At SANBI we have a small HPC (see our Annual report) that uses Slurm as a scheduler. Its always a good idea to keep this up to date and unfortunately the version available in the Ubuntu package repository tends to be quite old (e.g. 17.11 for Ubuntu 18.04 and 19.05 for Slurm 20.04).
The Slurm upgrade procedure is mentioned in their Quick Start Administrator Guide. In short, the daemons need to be upgraded in a specific order, starting with slurmdbd, followed by a nested upgrade of slurmctld and the slurmd on each compute node. To facilitate this process our install is on shared storage (CephFS, but could also be NFS) and looks as follows:
/tools/admin/slurm ├── 18.08.9 ├── 19.05.7 ├── 20.02.5 ├── ctld -> /tools/admin/slurm/20.02.5 ├── current -> /tools/admin/slurm/20.02.5 ├── d -> /tools/admin/slurm/20.02.5 ├── dbd -> /tools/admin/slurm/20.02.5 ├── etc
To install slurm, the slurm source is unpacked and compiled with slurm, with the configure options like:
./configure --prefix=/tools/admin/slurm/20.02.5 --sysconfdir=/tools/admin/slurm/etc
As can be seen from the above listing, the
current links link to the current version of Slurm in use.
Each daemon is managed by systemd and configured with a file in
/etc/systemd/system. For example here is the configuration of
[Unit] Description=Slurm controller daemon After=network.target munge.service ConditionPathExists=/tools/admin/slurm/etc/slurm.conf [Service] Type=oneshot EnvironmentFile=-/etc/sysconfig/slurmctld ExecStartPre=-/usr/bin/pkill -KILL slurmctld ExecStart=/tools/admin/slurm/ctld/sbin/slurmctld $SLURMCTLD_OPTIONS ExecReload=-/bin/kill -HUP $MAINPID ExecStop=-/usr/bin/pkill -KILL slurmctld PIDFile=/var/run/slurm/slurmctld.pid KillMode=process LimitNOFILE=51200 LimitMEMLOCK=infinity LimitSTACK=infinity RemainAfterExit=true [Install] WantedBy=multi-user.target
After installing this file, you need to run
sudo systemctl daemon-reload.
Note the line in bold in the above config file. The executable is run via the
/tools/admin/slurm/ctld/sbin folder. Because this
/tools/admin/slurm/ctld path is a symlink, upgrading the
slurmctld involves simply changing the symlink to point to the new slurm version.
The upgrade process for
slurmctld is quite straightforward, just follow the procedure for database backup and upgrade as mentioned in the docs. For
slurmd, the upgrade procedure (backup of
StateSaveLocation and restart of
slurmd) needs to happen on each worker node. This can best be automated using
ansible. As noted in the Slurm admin documentation, you can at most upgrade between two major releases. Due to a security issue older Slurm versions are not available from the main download page but you can still get them from Github (e.g. version 19.05).
A final note - this procedure, with per-version symlinks etc, was based on something I read online before executing at SANBI. I can't recall where I read this but if you were the source and would like credit, please look me up and let me know.
P.S. after the system upgrade, I recompile and re-install the slurm-drmaa module that you can find here.