So what’s in my sample?

When dealing with clinical samples like those from SARS-CoV-2 testing (aka COVID-19 testing) one invariably ends up with a mix of organisms in the sample being sequenced. This can also be true for cultured organisms where we don't know if any contaminants have crept into the sample. This can especially be true if the sample has been sent from elsewhere and we don't know what protocols they were following in preparing the sample for sequencing. This little guide will show you have to use Galaxy (specifically on the server) to make sense of what species(es) you have sequenced.

I will start with two samples in a Galaxy history. The are single ended sequencing samples from Oxford Nanopore sequencing. I presume you have uploaded these and ensured that they are fastqsanger.gz datatypes.

The first step is to organise these into a collection. Since these are single ended samples I can use a list collection. For paired end collection you'd want a list of pairs. Select the checkbox to select multiple samples in a history, select all the samples that you want to group together and then select Build Dataset List or Build List of Dataset Pairs as appropriate.

Then use kraken2 to assign taxonomic labels to each read. The key options to choose here are firstly the input: select the folder icon so that you can use the collection as input. Then select the correct database - it should be a copy of the Standard database, as recent as possible.

Galaxy will run a copy of kraken2 for each dataset. This will take some time (several minutes) but each sample is processed in parallel (up to the capacity of the Galaxy server). While kraken2 is running you can set up the next analyses:

  1. Convert Kraken data to Galaxy taxonomy representation using the most recent taxonomy database available. Select column 2 for the read name and column 3 for the taxonomy ID.

  1. Krona pie chart from taxonomic profile - here you can choose what resolution you want to display: from Class for a high level summary to e.g. Species for a very detailed view. Note that kraken2 resolution might not always be accurate down to the lowest taxonomy levels.

Depending on the load on the Galaxy server, you might wait some time for your jobs to start running. Once they do, however, each of the analyses mentioned above will be run in sequence.

All analyses until the Krona pie chart one produce a list of outputs, one output for each input sample. The Krona pie chart analysis produces a single output with a selector that will show the piece chart for each sample. In this screenshot the first dataset is shown. This is sample002.fastq.gz from the original sample list.

The Krona pie chart can we explored to focus on the different segments and a snapshot can be taken, giving the option to download the visualition as a SVG graphic.

Turning your analysis into a workflow

Finally, in the future one might want to repeat the steps of this analysis using a workflow rather than one by one. To create a workflow, select Extract workflow from the History menu (the drop down button on the top right).

You can then give a descriptive name to the workflow and save it for later use. Select Create workflow to create a re-usable workflow.

And then from the next page edit the workflow to see it in detail. In the workflow editor you can describe the inputs needed and select the output for which steps you want to see. In my example, I have unselected the blue checkboxes for the Kraken2 and Convert Kraken2 tools as those are intermediate results that I am not generally interested in. Once you save the resulting workflow, you can re-use it from the Workflow menu.

I have made my workflows for Paired End Samples and for Single Ended Samples available for re-use on the server.

Acknowledgements and references

The idea for this post was partly inspired by this thread on the Galaxy help forum. And of course I am using these awesome tools:

Ondov, B. D., Bergman, N. H., & Phillippy, A. M. (2013). Krona: Interactive Metagenomic Visualization in a Web Browser. In K. E. Nelson (Ed.), Encyclopedia of Metagenomics (pp. 1–8). Springer.
Afgan, E., Baker, D., Batut, B., van den Beek, M., Bouvier, D., Čech, M., Chilton, J., Clements, D., Coraor, N., Grüning, B. A., Guerler, A., Hillman-Jackson, J., Hiltemann, S., Jalili, V., Rasche, H., Soranzo, N., Goecks, J., Taylor, J., Nekrutenko, A., & Blankenberg, D. (2018). The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Research, 46(W1), W537–W544.
Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1), 257.

Galaxy 22.05 and the cluster_venv

I recently upgraded one of my Galaxy servers to version 22.05 (once again using Ansible and the excellent Galaxy Ansible roles). Unfortunatey this exposed a problem with our environment. The Galaxy server that I configured is running on Ubuntu 20.04, but the HPC cluster's worker nodes use Ubuntu 18.04. Ubuntu 18.04 comes with Python 3.6, which is too old for modern Galaxy.

Thanks to advice from Galaxy admin's chat I installed a Conda environment by downloading Miniconda3 and doing a batch install ( -b -p GALAXY_DIR/conda, where GALAXY_DIR is my Galaxy installation directory). Then I activated this with source GALAXY_DIR/conda/bin/activate) and proceeded to run Galaxy's as described in this post. My GALAXY_VIRTUAL_ENV in job_conf.xml points to the location of the virtualenv.

The end result is that Galaxy uses this virtual environment to provide Python on cluster nodes, and job submission now works again.

Loading Nanopore data into Galaxy

Basecalled nanopore data is output in a structure like this:


In other words, for each barcode there is a collection of multiple fastq files. While it is straightforward to concatenate these together on the command line, for non-command-line-users that is not as easy. To get this into Galaxy for analysis, you can follow these steps:

  1. Upload the whole fastq_pass directory to the Galaxy server using FTP upload
  2. Using the Rule Based Uploader, select that you want to create Collections and that your source is the FTP Directory and use these rules (these rules expect there to be a single directory in your FTP area with barcode directories. If that is not the case, you need to change the .*/barcode\\d+/(.*) regular expression, to match your upload directory. For example if your upload directory was called fastq_pass make it fastq_pass/barcode\\d+/(.*). These rules also assume that your data is gzipped FASTQ. If it is not gzipped fastq, changed the fastq.gz$ to fastq$ and changed the fastqsanger.gz to fastqsanger. You can also find these rules here:
    "rules": [
      "type": "add_column_metadata",
      "value": "path"
      "type": "add_filter_regex",
      "target_column": 0,
      "expression": "fastq.gz$",
      "invert": false
      "type": "add_column_regex",
      "target_column": 0,
      "expression": ".*/barcode\\d+/(.*)",
      "group_count": 1
      "type": "add_column_regex",
      "target_column": 0,
      "expression": "./(barcode\\d+)/.*",
      "group_count": 1
      "type": "swap_columns",
      "target_column_0": 1,
      "target_column_1": 2
    "mapping": [
      "type": "ftp_path",
      "columns": [
      "type": "list_identifiers",
      "columns": [
      "editing": false
    "extension": "fastqsanger.gz"
  3. Give the collection a name (e.g. samples1) and Upload. A new list of lists called samples1 will be created.
  4. Create a tabular mapping file with the first column being the barcode name and the second column to sample name that you want to use. You can either create this as a TSV (e.g. using Excel) or you can type it into the "Paste" box in the Galaxy uploader (and make sure to select the type as tabular and the "convert spaces to tabs" option in the settings).
  5. Using the input collection and tabular renaming file as inputs, run this workflow. The result is a list with the elements of the list being the concatenated reads from the barcode directories.

You can watch a video demo of this method here.

Making a Galaxy data manager idempotent (a hack)

In programming the term idempotent is used to mean that you can run a function or an application more than once and get the same results each time. In this blog post I will be discussing a hack I used in the primer_scheme_bedfiles data manager. The data manager installs reference files used by pipelines designed for processing ARTIC protocol amplicon sequencing results. The ARTIC / PrimalSeq protocol uses tiled PCR to amplify a genetic sample of a virus and this approach is the backbone to most SARS-CoV-2 sequencing happening around the world today. A pre-sequencing step of the protocol uses pools of specially designed primers, and the sequence corresponding to these primers needs to be removed from the reads during the bioinformatic analysis. Primer locations are described in BED-like files that get fed into tools like ARTIC minion and ivar trim. To avoid the need for the user to supply these each time an analysis is run, the [primer_scheme_bedfiles data manager]( stores these in a shared data area accessible to all tools on the Galaxy server it runs on.

While users can supply their own BED files to upload a new primer scheme description, most commonly users will want to download them from the relevant websites where they are published. This is where a challenge comes in: only one copy of each primer scheme file should be stored, but the download module does not know which schemes the Galaxy server has installed. Offering to install the same scheme twice is an error. The solution is to use a little-known feature of Galaxy tool authoring: the Galaxy tool actually has access to the state of the currently running Galaxy server (using the $__app__ variable) and this can be used to look up the contents of data tables. Here is the code in question:

            #set $data_table = $__app__.tool_data_tables.get("primer_scheme_bedfiles")
            #if $data_table is not None:
                #set $known_primers = [ row[0] for row in $data_table.get_fields() ]
                #set $primer_list = ','.join([ primer_name for primer_name in $input.primers if primer_name not in known_primers ])
                #set $primer_list = $input.primers
            #end if

The reference to $__app__.tool_data_tables is a ToolDataTableManager (defined here) which effectively acts like a hash keyed on the tool data table name. The primer_scheme_bedfiles tool data table is a TabularToolDataTable, which in turn provides the get_fields() method that returns a two-dimensional array. The first dimension is rows and in this particular table I'm talking about, the first column is the value field, i.e. the name of the primer scheme. The code above then computes the difference between the primer schemes requested by the user and the ones already installed and only downloads those that aren't already present. It could have used sets instead of lists for a little extra efficiency but the logic would largely remain the same.

One peculiarity discovered was the processing of the list comprehension - the primer_name variable used here is used without a $ because the Cheetah template system wants it that way. The $input.primers becomes a string (val1,val2) when used as a value in the template but in the context of the #set and list comprehension is a list (['val1', 'val2']).

Note that as Marius van den Beek pointed out when I mentioned this technique on Twitter, using $__app__ to look inside the state of the Galaxy server is not recommended and may well break in the (near?) future. I hope that by that time Galaxy gives you a different way to see the current state of a data table to preserve the possibilities of idempotent behaviour.

I'm largely writing this post to document how things work (this is documented in the comments in the code) in case I forget next time I need to write a similar DM or in case this post helps others in the Galaxy tool developer community.

An upgrade-friendly Slurm Installation

At SANBI we have a small HPC (see our Annual report) that uses Slurm as a scheduler. Its always a good idea to keep this up to date and unfortunately the version available in the Ubuntu package repository tends to be quite old (e.g. 17.11 for Ubuntu 18.04 and 19.05 for Slurm 20.04).

The Slurm upgrade procedure is mentioned in their Quick Start Administrator Guide. In short, the daemons need to be upgraded in a specific order, starting with slurmdbd, followed by a nested upgrade of slurmctld and the slurmd on each compute node. To facilitate this process our install is on shared storage (CephFS, but could also be NFS) and looks as follows:

    ├── 18.08.9
    ├── 19.05.7
    ├── 20.02.5
    ├── ctld -> /tools/admin/slurm/20.02.5
    ├── current -> /tools/admin/slurm/20.02.5
    ├── d -> /tools/admin/slurm/20.02.5
    ├── dbd -> /tools/admin/slurm/20.02.5
    ├── etc

To install slurm, the slurm source is unpacked and compiled with slurm, with the configure options like:

    ./configure --prefix=/tools/admin/slurm/20.02.5 --sysconfdir=/tools/admin/slurm/etc

As can be seen from the above listing, the d, ctld, dbd and current links link to the current version of Slurm in use.
Each daemon is managed by systemd and configured with a file in /etc/systemd/system. For example here is the configuration of slurmctld (i.e. /etc/systemd/system/slurmctld):

Description=Slurm controller daemon munge.service

ExecStartPre=-/usr/bin/pkill -KILL slurmctld
ExecStart=/tools/admin/slurm/ctld/sbin/slurmctld $SLURMCTLD_OPTIONS
ExecReload=-/bin/kill -HUP $MAINPID
ExecStop=-/usr/bin/pkill -KILL slurmctld


After installing this file, you need to run sudo systemctl daemon-reload.

Note the line in bold in the above config file. The executable is run via the /tools/admin/slurm/ctld/sbin folder. Because this /tools/admin/slurm/ctld path is a symlink, upgrading the slurmctld involves simply changing the symlink to point to the new slurm version.

The upgrade process for slurmdbd and slurmctld is quite straightforward, just follow the procedure for database backup and upgrade as mentioned in the docs. For slurmd, the upgrade procedure (backup of StateSaveLocation and restart of slurmd) needs to happen on each worker node. This can best be automated using ansible. As noted in the Slurm admin documentation, you can at most upgrade between two major releases. Due to a security issue older Slurm versions are not available from the main download page but you can still get them from Github (e.g. version 19.05).

A final note - this procedure, with per-version symlinks etc, was based on something I read online before executing at SANBI. I can't recall where I read this but if you were the source and would like credit, please look me up and let me know.

P.S. after the system upgrade, I recompile and re-install the slurm-drmaa module that you can find here.

Galaxy 21.05 upgrade and cluster_venv

As part of my (rather prolonged) work towards a M.Sc. in bioinformatics, I maintain a Galaxy server at SANBI. I've recently upgraded to Galaxy 21.05, at the time of this writing the latest Galaxy release. You can read more about that release here.

My Galaxy server is deployed using Ansible with a combination of the standard Galaxy roles and ones developed at SANBI to match our infrastructure. Specifically, we have roles for integrating with our infrastructure's authentication, monitoring and CephFS filesystem. I also wrote a workaround for deploying letsencrypt based SSL. You can find this configuration in this repository.

The Galaxy server integrates with our cluster, the worker nodes of which are running Ubuntu 18.04 (the Galaxy server is on Ubuntu 20.04). For a number of tasks, Galaxy requires tools to have some access to Python libraries that are not part of core Python for the business of "finishing" jobs (i.e. feeding results back into Galaxy) and so on. In the past I have found that using the single virtualenv that the Galaxy roles configure on the Galaxy server causes problems when running jobs on the cluster. Thus I have a specific venv for running on the cluster that is configured on the cluster. I.e. after the Galaxy server install was completed, I logged into one of the cluster worker nodes as root, deleted the old cluster_venv and ran:

cd /projects/galaxy/pvh_masters_galaxy1
export GALAXY_VIRTUAL_ENV=$(pwd)/cluster_venv
cd server
scripts/ --skip-client-build --skip-samples

Obviously it would be better to automate the above, but I have not got around to doing so yet. I'm not sure if this is the best approach but it works at least for our environment, so I'm writing this blog post in case it is useful to others (or to jog my own memory down the line!). This cluster_venv setup is exposed to the job runners in job_conf.xml - here is a snippet of my configuration:

    <plugins workers="4">
        <plugin id="local" type="runner" load=""/>
        <plugin id="slurm" type="runner" load=""/>
    <destinations default="dynamic">
        <destination id="slurm" runner="slurm">
            <param id="tmp_dir">True</param>
            <env id="GALAXY_VIRTUAL_ENV">/projects/galaxy/pvh_masters_galaxy1/cluster_venv</env>
            <env id="GALAXY_CONFIG_FILE">/projects/galaxy/pvh_masters_galaxy1/config/galaxy.yml</env>
        <destination id="local" runner="local"/>
        <destination id="dynamic" runner="dynamic">
            <param id="tmp_dir">True</param>
            <param id="type">dtd</param>
        <destination id="cluster_default" runner="slurm">
            <param id="tmp_dir">True</param>
            <env id="SLURM_CONF">/tools/admin/slurm/etc/slurm.conf</env>
            <env id="GALAXY_VIRTUAL_ENV">/projects/galaxy/pvh_masters_galaxy1/cluster_venv</env>
            <env id="GALAXY_CONFIG_FILE">/projects/galaxy/pvh_masters_galaxy1/config/galaxy.yml</env>
            <param id="nativeSpecification">--mem=10000</param>
            <resubmit condition="memory_limit_reached" destination="cluster_20G" />

P.S. this was the only manual task I had to perform (on the Galaxy side of things). Mostly the update consisted of updating our SANBI ansible roles to support Ubuntu 20.04 (and Ceph octopus), switching to the latest roles (as described in the training material for Galaxy admins), flicking the version number from release_20.09 to release_21.05 and running the Ansible playbook.

Solving Bluetooth Audio Delay on Ubuntu 20.04

For quite some time I've been frustrated at the state of Ubuntu's support for Bluetooth audio. As a result, I've always gone with wired headphones. Now maybe its something about me but I've not had the best of luck with this. Headphones last a few months before some plug or wire breaks. In the worst case scenario the headphone jack on my laptop starts giving issues... so wireless is great. If it works.

I had to replace my headset recently and bought a HAVIT H2590BT, a fairly entry-level thing, and it worked fine for listening to music but as soon as I wanted more "real time" audio there were problems. I first noticed this on Duolingo and a bit of debugging showed that the problem was a delay in the audio. This became rather embarasing when doing a call with colleagues.

Turns out that Bluetooth has audio profiles that affect the operation of the headset. A2DP focuses on giving best audio quality, whereas HFP and HSP are more focused on real-time responsiveness. Unfortunately with the standard PulseAudio (13.99.1) on my Ubuntu 20.04 I could only connect to the headphones using A2DP. I came across posts on from 2015 onwards talking about this issue and some suggested switching profiles but I couldn't seem to get that right.

Then I found this post from @normankev141. Unfortunately the plugin that he suggested has been deprecated by the author, who suggested moving to PipeWire. I switched to PipeWire using the instructions from this askubuntu post, rebooted and now I've got a much richer selection of profiles:

$ pactl list cards
Card #54
Name: bluez_card.C5_78_21_3A_9F_DB
Driver: module-bluez5-device.c
Owner Module: n/a
    off: Off (sinks: 0, sources: 0, priority: 0, available: yes)
    a2dp-sink: High Fidelity Playback (A2DP Sink) (sinks: 1, sources: 0, priority: 0, available: yes)
    headset-head-unit: Headset Head Unit (HSP/HFP) (sinks: 1, sources: 1, priority: 0, available: yes)
    a2dp-sink-sbc: High Fidelity Playback (A2DP Sink, codec SBC) (sinks: 1, sources: 0, priority: 0, available: yes)
    headset-head-unit-cvsd: Headset Head Unit (HSP/HFP, codec CVSD) (sinks: 1, sources: 1, priority: 0,    available: yes)
Active Profile: a2dp-sink-sbc

I can now switch profiles with pactl set-card-profile bluez_card.C5_78_21_3A_9F_DB a2dp-sink or pactl set-card-profile bluez_card.C5_78_21_3A_9F_DB headset-head-unit. I've made these two aliases for my shell:

alias goodaudio="pactl set-card-profile $(pactl list cards |grep 'Name: bluez' |awk '{print $2}') a2dp-sink"
alias headset="pactl set-card-profile $(pactl list cards |grep 'Name: bluez' |awk '{print $2}') headset-head-unit"

I haven't yet got around to linking these to some kind of Gnome utility so that I can toggle the profiles yet. Its on the TODO list.

Galaxy and the notorious rfind() error: HistoryDatasetAssociation objects aren’t strings

I have repeatedly triggered this error when writing Galaxy tool wrappers:

2018-03-26 13:36:58,408 ERROR [] (2) Failure preparing job
Traceback (most recent call last):
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/jobs/runners/", line 170, in prepare_job
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/jobs/", line 909, in prepare
self.command_line, self.extra_filenames, self.environment_variables =
File "/tmp/tmp9OzX0h/galaxy-dev/lib/galaxy/tools/", line 445, in build
raise e
AttributeError: 'HistoryDatasetAssociation' object has no attribute 'rfind'

The command in the tool wrapper at the time included:

#import os.path
#set report_name os.path.splitext(os.path.basename($input_vcf))[0] + '.html'
tbvcfreport generate '$input_vcf' &&
mv '$report_name' $output

In the main part of the template, $input_vcf, which is a reference to an input dataset, effectively behaves like a string, as
it is substituted with the filename of the input dataset. In the #set part, however, it is a Python variable that refers
to the underlying HistoryDatasetAssociation. Thus the obscure looking error message, because a HDA is indeed not a string
and has no .rfind() method.

The error can be fixed by wrapping $input_vcf in a str() call to convert it into its string representation, i.e.
the filename I am interested in:

#import os.path
#set report_name os.path.splitext(os.path.basename(str($input_vcf)))[0] + '.html'
tbvcfreport generate '$input_vcf' &&
mv '$report_name' $output

Thanks to Marius van den Beek (@mvdbeek) for catching this for me.

A Galaxy 18.01 install

We are preparing for Galaxy Africa in a few weeks' time, which will feature some Galaxy training. In preparation for that I installed a new Ubuntu 16.04 virtual machine to host a 18.01 Galaxy server. The aim is to set up a production Galaxy server. To that end, the server is being hosted on a 1 TB Ceph RBD partition mounted on /galaxy. A user called galaxyuser was created on our FreeIPA authentication environment, and /galaxy/galaxysrv was created to host Galaxy files.

The first step of setup was to clone Galaxy 18.01 release and configure it for production use. The postgresql database server was installed and a user created for galaxyuser and then that user used to create the galaxy database. I configured the database and added myself ( as an admin user.

The next step was to install nginx. As far as possible I tried to not alter the "out of the box" nginx configuration, to make it easier to do upgrades later. To that end, firstly, a SSL certificate was added using certbot and Let's Encrypt by installing the certbot and python-certbot-nginx packages, and running certbox --nginx certonly. This yielded /etc/letsencrypt/live/ and associated files. The /etc/nginx/ssl directory as created and a /etc/nginx/ssl/dhparam.pem file was created with openssl dhparam -out /etc/nginx/ssl/dhparam.pem 4096. This was in order to create a more secure configuration than default as explained here.

Following the instructions from the Galaxy Receiving Files with nginx documentation and advice from Marius van den Beek, nginx-extras was installed from the recommended PPA, yielding nginx, nginx-common and nginx-extras packages for version 1.10.3-0ubuntu0.16.04.2ppa1. Then a file /etc/nginx/conf.d/custom.conf was created with content as per this gist. This is effectively a combination of the options suggested by the Galaxy admin docs with those in /etc/letsencrypt/options-ssl-nginx.conf. The server configuration directives from the recommended Galaxy configuration were adapted and put in /etc/nginx/sites-available/galaxy. The resulting configuration is in this gist. Once added, the configuration was activated by removing the /etc/nginx/sites-enabled/default file and linking the galaxy configuration fie in its place. Finally, /etc/nginx/nginx.conf was altered by changing the user used to run the server to galaxyuser (i.e. "user galaxyuser"). To connect Galaxy to nginx, the socket: option in the Galaxy config/galaxy.yml and the configuration in the nginx site configuration were harmonised as per the relevant documentation. Since the unix socket was not created on startup, a http connection and thus TCP socket on localhost was used.

The third step was configuring Galaxy to start using supervisord. This was based on the [program:web] configuration from the Galaxy starting and stopping configuration guide. And this is where things started going wrong. Using this configuration, the data upload tool didn't work as it used the system Python, not Python from the Galaxy virtualenv configured in /galaxy/galaxysrv/galaxy/.venv. To ensure that the Galaxy virtualenv was activated before running the upload tool, the VIRTUAL_ENV config was added to the /etc/supervisor/conf.d/galaxy configuration, resulting in the config shown in this gist.

The fourth step was to configure CVMFS to allow access to the reference data collection used on I installed the cvmfs package by following the instructions to install the apt repository and then apt-get install cvmfs. The correct configuration was learned from Björn Grüning (@bgruening)'s bgruening/galaxy-stable Docker container with some help from @scholtalbers on gitter:

a. In /etc/cvmfs/domain.d/ put the line as per this gist.

b. In /etc/cvmfs/default.local put:


c. In /etc/cvmfs/keys/ put:

-----END PUBLIC KEY-----

d. Add the line /cvmfs /etc/auto.cvmfs to /etc/auto.master yielding a file looking like this gist.

e. A /cvmfs directory was created to be a mount point (mkdir /cvmfs).

f. The autofs service was restarted (systemctl restart autofs) and then a ls /cvmfs/ shows a pretty collection of reference data.

g. updated The config/galaxy.yml file was updated so that the tool_data_table_config_path key contains references to the files that are stored in CVMFS. The final value of this key was:

tool_data_table_config_path: /cvmfs/,/cvmfs/,config/tool_data_table_conf.xml

This might not fit on your screen, so see the config fragment here. After the update to the Galaxy config all service were restarted (with sudo supervisorctl restart all).

My fifth and final step was to test the Galaxy server by installing bowtie2 from the toolshed and working through the first steps of the mapping tutorial. Both human (hg19) and fruitfly (dm3) reference genomes were downloaded (and apparently stored in /var/lib/cvmfs/shared) using CVMFS and the bowtie2 mapping was run against them successfully, yielding the results expected from the tutorial.

Future work? There is lots - I have to connect the server to a cluster, using Slurm, and enable Interactive Environments... I'll blog about that when I get there.

MegaRAID write cache policy with lsmcli

A couple of weeks ago I had a near disaster when some of our servers lost power while their RAID was set to "write-through" caching with non-working batteries. The result was filesystem corruption and failure on 2 out of 3 Ceph monitor servers. In the past I have written about using MegaCLI for RAID admin. MegaCLI has been replaced by StorCLI, which I found here on the Broadcom web pages. I unpacked the various zip files until I got the storcli-007.0309.0000.0000-1.noarch.rpm RPM and installed that to get the MegaRAID storcli64 tool. Instead of using that directly I'm using the libstoragemgmt tools with the lsmcli tool. On CentOS 7 this required installing libstoragemgmt and libstoragemgmt-megaraid-plugin and starting the lsmd daemon systemctl start libstoragemgmt.

This all set up, I found the volume with lsmcli -u megaraid:// list --type VOLUMES:

[root@ceph-mon2 ~]# lsmcli -u megaraid:// list --type VOLUMES
ID                               | Name | SCSI VPD 0x83                    | Size         | Disabled | Pool ID | System ID | Disk Paths
6003048016dfd2001cf1d19f0af655a3 | VD 0 | 6003048016dfd2001cf1d19f0af655a3 | 597998698496 | No       | :DG0    |           | /dev/sda  

then the volume-cache-info command:

[root@ceph-mon2 ~]# lsmcli -u megaraid:// volume-cache-info --vol  6003048016dfd2001cf1d19f0af655a3
Volume ID                        | Write Cache Policy | Write Cache | Read Cache Policy | Read Cache | Physical Disk Cache
6003048016dfd2001cf1d19f0af655a3 | Write Back         | Write Back  | Enabled           | Enabled    | Use Disk Setting   

and set the policy to AUTO (which means write-back when the battery is ok, write-through otherwise):

[root@ceph-mon2 ~]# lsmcli -u megaraid:// volume-write-cache-policy-update --vol  6003048016dfd2001cf1d19f0af655a3 --policy AUTO
Volume ID                        | Write Cache Policy | Write Cache   | Read Cache Policy | Read Cache | Physical Disk Cache
6003048016dfd2001cf1d19f0af655a3 | Write Through      | Write Through | Enabled           | Enabled    | Use Disk Setting   

There doesn't seem to be a direct way to query the battery backup unit (BBU) with lsmcli but /opt/MegaRAID/storcli/storcli64 show will show you what the status is.