MegaRAID write cache policy with lsmcli

A couple of weeks ago I had a near disaster when some of our servers lost power while their RAID was set to “write-through” caching with non-working batteries. The result was filesystem corruption and failure on 2 out of 3 Ceph monitor servers. In the past I have written about using MegaCLI for RAID admin. MegaCLI has been replaced by StorCLI, which I found here on the Broadcom web pages. I unpacked the various zip files until I got the storcli-007.0309.0000.0000-1.noarch.rpm RPM and installed that to get the MegaRAID storcli64 tool. Instead of using that directly I’m using the libstoragemgmt tools with the lsmcli tool. On CentOS 7 this required installing libstoragemgmt and libstoragemgmt-megaraid-plugin and starting the lsmd daemon systemctl start libstoragemgmt.

This all set up, I found the volume with lsmcli -u megaraid:// list --type VOLUMES:

[root@ceph-mon2 ~]# lsmcli -u megaraid:// list --type VOLUMES
ID                               | Name | SCSI VPD 0x83                    | Size         | Disabled | Pool ID | System ID | Disk Paths
6003048016dfd2001cf1d19f0af655a3 | VD 0 | 6003048016dfd2001cf1d19f0af655a3 | 597998698496 | No       | :DG0    |           | /dev/sda  

then the volume-cache-info command:

[root@ceph-mon2 ~]# lsmcli -u megaraid:// volume-cache-info --vol  6003048016dfd2001cf1d19f0af655a3
Volume ID                        | Write Cache Policy | Write Cache | Read Cache Policy | Read Cache | Physical Disk Cache
6003048016dfd2001cf1d19f0af655a3 | Write Back         | Write Back  | Enabled           | Enabled    | Use Disk Setting   

and set the policy to AUTO (which means write-back when the battery is ok, write-through otherwise):

[root@ceph-mon2 ~]# lsmcli -u megaraid:// volume-write-cache-policy-update --vol  6003048016dfd2001cf1d19f0af655a3 --policy AUTO
Volume ID                        | Write Cache Policy | Write Cache   | Read Cache Policy | Read Cache | Physical Disk Cache
6003048016dfd2001cf1d19f0af655a3 | Write Through      | Write Through | Enabled           | Enabled    | Use Disk Setting   

There doesn’t seem to be a direct way to query the battery backup unit (BBU) with lsmcli but /opt/MegaRAID/storcli/storcli64 show will show you what the status is.

Making Ubuntu 14.04 and CentOS 7 NFS work together

I just spent a frustrating morning configuring our servers to talk NFS to each other properly. So we have:

1) NFS servers (ceph-mon1 and so on) running CentOS 7.
2) NFS clients (gridj1 and so on) running Ubuntu 14.04.

The first problem: RBD mounting and NFS startup were not configured on the servers. I fixed that by adding entries in /etc/ceph/rbdmap and enabling the rbdmap and nfs-server services using systemctl enable. I also used e2label to label the ext4 filesystems in the RBDs and then used these labels in /etc/fstab instead of device names. And used the _netdev option in the mount options because these devices are network devices.

The second problem: I had to add the insecure option to the exports in /etc/exports. This is because the mount request comes from a port higher than 1024, a so-called insecure port. And then exportfs -r to resync everything.

And the third problem: Ubuntu autofs makes a NFS4 mount request by default (even though I had specified nfsvers=3 in the mount options), and I haven’t configured NFS4’s authenticated mounts, so I was getting authenticated mount request from messages in /var/log.messages on the NFS server. I switched the NFS server to not do NFS4 by adding --no-nfs-version 4 to the RPCNFSDARGS variable in /etc/sysconfig/nfs on the server, restarted the NFS server (systemctl restart nfs-server) and the mounts finally worked.

Finally, documented this here for posterity…

docker -G and non-local groups

So we (like many other labs) store our user identity information in LDAP. I created a docker group in LDAP so that its memberships is valid across our cluster. When I tried to run a docker command, however, I got this error:

Get http:///var/run/docker.sock/v1.20/containers/json: dial unix /var/run/docker.sock: permission denied.
* Are you trying to connect to a TLS-enabled daemon without TLS?
* Is your docker daemon up and running?

Turns out that /var/run/docker.sock was owned by root:root, not root:docker as expected. Running docker in debug mode I saw this message:

DEBU[0000] Warning: could not change group /var/run/docker.sock to docker: Group docker not found 

After a big of poking around and verifying that the group did exist I came across the code in unix_socket.go. To make a longish (lines 41-83) story short, docker relies on libcontainer for its user/group lookups and these parse the /etc/group file, ignoring nsswitch.conf (and thus identity providers like LDAP).

If you use the numeric gid (docker daemon -G 555 for instance) then you get some strange messages in the log (example is from debug mode):

WARN[0000] Could not find GID 555                       
DEBU[0000] 555 group found. gid: 555

but the ownership of the docker Unix socket is set as expected.

Mixed HTTP / WordPress authentication with nginx

Our WordPress blog / site network at uses Daniel Westermann-Clark’s HTTP Authentication plugin. Despite the scary warning on the plugin’s page over at, the setup has worked well for us over the years. The only problem has been allowing external (non-SANBI) users to edit pages. This is something we really need for the annual bioinformatics course website.

After quite a few false starts I realised that the solution is to use passive authentication. I.e. do not force the user to use HTTP authentication but rather make it an optional extra. We use PAM for nginx authentication (via nginx_auth_pam, included in the nginx build in Ubuntu’s nginx-full module) and the crucial bit of the nginx config looks like this:

        location /sanbi-login.html {
            auth_pam  "SANBI authentication";
            auth_pam_service_name "nginx";
        # Process only the requests to wp-login and wp-admin
        location ~ /wp-(admin|login|includes|content) {
          try_files $uri $uri/ \1/index.php?args;

          location ~ \.php$ {
             try_files $uri =404;
             include fastcgi_params;
             fastcgi_param REMOTE_USER $remote_user;
             fastcgi_index index.php;
             fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
             fastcgi_pass  php;
             fastcgi_intercept_errors on;

(Full nginx config is here). The nginx server forwards the REMOTE_USER variable to the php5-fpm backend but doesn’t actually force HTTP authentication for any of the WordPress content. Instead only a single page is protected with HTTP authentication, and that page simply redirects to the WordPress admin interface. Here’s the source of sanbi-login.html:

<title>SANBI login</title>
<meta http-equiv="refresh" content="0; url=/wp-admin" />

This works well because ours is a sub-domain based WordPress network. In the HTTP Authentication plugin settings, we point to the address of the sanbi-login.html page as the login page, so the login page ends up looking like this:

WP login screen.

The Log in with HTTP Authentication takes you to the sanbi-login.html page that requests HTTP Authentication and immediately redirects back to /wp-admin. The login form uses conventional WordPress authentication.

One remaining challenge is how to set this option for all sites in the network. Currently it needs to be configured for each site individually. The setting is stored in the wordpress database in the wp_BLOGNUMBER_options table (where BLOGNUMBER is the number of the blog in the wordpress network) with option_name = ‘http_authentication_options’, as part of a serialised array. This code suggests how to set an option for all sites in a network but as yet there is no way to do this from the wordpress web interface, and I’m a bit loath to work on the database directly.

An experimental Ceph storage cluster for the Computer Science Netlab

Ceph is an open source distributed object store and filesystem developed originally developed by Sage Weil for his PhD. Saying that again, but slower this time, Ceph is an object store where the objects are **distributed over a collection of drives and serves. And there’s a filesystem component too. The basic blocks of Ceph are object storage daemons (OSDs) and monitoring daemons (MONs). Each OSD manages a block of storage into which are placed objects, aggregated into placement groups (PGs). A datastructure called a

CRUSH map, which is a distributed hash table used to look up where particular data objects are stored. Distributed hashtables underlie most distributed storage software (e.g. you also find one in GlusterFS). Keeping track of the state of everything are the MONs that use the PAXOS algorithm to maintain a consistent set of knowledge about the state of the cluster. There are always an odd number of MONs: typically 3 for a small cluster.

Once you put it all together you get RADOS, the Reliable Autonomic Distributed Object Store (see the 2007 paper if you want). Storage in RADOS is divided into pools where you set per-pool policy for object size and striping and replication. So a pool might have say 4096 PGs containing objects of maximum 4 MB each filled with 64 KB stripes of data, replicated to ensure that there are 3 copies of each PG available at all times.

Ceph provides 3 options for accessing RADOS: first there is the RADOS object gateway, a RESTful service allowing individual objects (e.g. VM images) to be stored and retrieved. Then there’s the RADOS Block Device (RBD), a iSCSI like block device that can be mapped to a virtual drive on the client computer and finally there is CephFS, a POSIX-compliant filesystem running on top of storage pools in RADOS (the filesystem uses an extra daemon the metadata server (MDS), to map filesystem locations such as /home to objects).

The Ceph architecture allows heterogenous disks and servers to be aggregated into a storage pool in a way that is more flexible than traditional RAID and with substantially shorter rebuild times.

So to demo Ceph I set it up in the Computer Science Netlab at UWC, using a fair sprinkling of ansible scripting and the ceph-deploy tool. The architecture is three MONs, normandy, netlab2-ws and netlab6-ws and three OSDs, netlab2-ws, netlab6-ws and netlab17-ws. On the OSDs I’ve used the existing filesystem as the Ceph store – Ceph should have its own storage partitions, but I didn’t want to go to the trouble of repartitioning machines for the demo. The release used is Giant, the latest and as-yet-unreleased Ceph release.

TODO: detail the initial installation

I’ll hopefully getting around to documenting the initial steps, but let me show how I add a new OSD to the Ceph cluster.

First, create the /dfs directory on the new OSD (in this case netlab10-ws). The “-s -k -K” arguments for ansible mean ask for SSH password, ask for sudo password and use sudo. Since the commands I’m running need to be run as root on the remote machine, I need all that.

ansible -s -k -K -m file -a "name=/dfs owner=root state=directory" netlab10-ws

Then add a user to the remote machine. The password set is never actually used, since we’ll be using SSH public keys to do the log in and passwordless sudo.

ansible  -u pvh -k -K -m user -a 'name=netlab-ceph home=/dfs/netlab-ceph createhome=yes password="SOMEENCRYPTEDSTUFF" shell=/bin/bash comment="Ceph User" state=present' netlab10-ws

ansible -k -s -K -m authorized_key -a 'user=netlab-ceph key="SSH PUBLIC KEY GOES HERE"' netlab10-ws

ansible -k -K -s -m copy -a 'content="netlab-ceph ALL = (root) NOPASSWD:ALL\n" dest=/etc/sudoers.d/050_ceph mode=0444 owner=root' netlab10-ws 

Next we need to install a NTP daemon and synchronise time (we should actually run an in-lab NTP server, but for now we’re using a public one to set the time). Ceph relies on everything being in close time synchronisation to operate.

ansible -s -k -K -m apt -a 'name=ntp state=present' netlab10-ws

ansible -s -k -K -m command -a 'ntpdate' netlab10-ws
ansible -s -k -K -m service -a 'name=ntp state=started enabled=true' netlab10-ws

Then use the ceph-deploy tool to install the actual ceph packages, create a directory for the OSD to put its data in, initialize, activate and we’re done!

ceph-deploy install --release=giant netlab10-ws

ansible -s -k -K -m file -a 'name=/dfs/osd3 state=directory' netlab10-ws

ceph-deploy osd prepare netlab10-ws:/dfs/osd3
ceph-deploy osd activate netlab10-ws:/dfs/osd3

You can check on the state of the cluster with sudo ceph status, where you’ll see something like this:

netlab-ceph@normandy:~$ sudo ceph status
    cluster 915d5e83-2950-4860-ba97-2118c061036f
     health HEALTH_WARN 18 pgs degraded; 220 pgs peering; 93 pgs stuck inactive; 93 pgs stuck unclean; recovery 296/2880 objects degraded (10.278%)
     monmap e1: 3 mons at {netlab2-ws=,netlab6-ws=,normandy=}, election epoch 14, quorum 0,1,2 normandy,netlab2-ws,netlab6-ws
     mdsmap e5: 1/1/1 up {0=normandy=up:active}
     osdmap e64: 4 osds: 4 up, 4 in
      pgmap v4413: 320 pgs, 3 pools, 3723 MB data, 960 objects
            63303 MB used, 799 GB / 907 GB avail
            296/2880 objects degraded (10.278%)
                  18 active+degraded
                 220 peering
                  82 active+clean
recovery io 11307 kB/s, 2 objects/s
  client io 6596 kB/s wr, 4 op/s

Or you can watch it rebalancing itself with ceph -w:

netlab-ceph@normandy:~$ sudo ceph -w
    cluster 915d5e83-2950-4860-ba97-2118c061036f
     health HEALTH_WARN 121 pgs degraded; 8 pgs recovering; 28 pgs stuck unclean; recovery 1308/3864 objects degraded (33.851%)
     monmap e1: 3 mons at {netlab2-ws=,netlab6-ws=,normandy=}, election epoch 14, quorum 0,1,2 normandy,netlab2-ws,netlab6-ws
     mdsmap e5: 1/1/1 up {0=normandy=up:active}
     osdmap e64: 4 osds: 4 up, 4 in
      pgmap v4445: 320 pgs, 3 pools, 4991 MB data, 1288 objects
            63424 MB used, 799 GB / 907 GB avail
            1308/3864 objects degraded (33.851%)
                 113 active+degraded
                 199 active+clean
                   8 active+recovering+degraded
recovery io 14070 kB/s, 3 objects/s

2014-10-16 11:44:39.069514 mon.0 [INF] pgmap v4445: 320 pgs: 113 active+degraded, 199 active+clean, 8 active+recovering+degraded; 4991 MB data, 63424 MB used, 799 GB / 907 GB avail; 1308/3864 objects degraded (33.851%); 14070 kB/s, 3 objects/s recovering
2014-10-16 11:44:41.178062 mon.0 [INF] pgmap v4446: 320 pgs: 113 active+degraded, 199 active+clean, 8 active+recovering+degraded; 4991 MB data, 63473 MB used, 799 GB / 907 GB avail; 1306/3864 objects degraded (33.799%); 9782 kB/s, 2 objects/s recovering

To remove an OSD, you can use these commands (using out newly create osd.3 as an example) – they take the OSD out of the storage cluster, stop the daemon, remove it from the CRUSH map, delete authentication keys and finally remove the OSD from the cluster’s list of OSDs.

netlab-ceph@normandy:~$ sudo ceph osd out 3
marked out osd.3. 
netlab-ceph@normandy:~$ ssh netlab10-ws sudo stop ceph-osd-all
ceph-osd-all stop/waiting
netlab-ceph@normandy:~$ sudo ceph osd crush remove osd.3
removed item id 3 name 'osd.3' from crush map
netlab-ceph@normandy:~$ sudo ceph auth del osd.3
netlab-ceph@normandy:~$ sudo ceph osd rm 3
removed osd.3

Then, for good measure, you can remove the data:

netlab-ceph@normandy:~$ ssh netlab10-ws sudo rm -rf /dfs/osd3/\*

Also not covered in this blog is how I added a RBD device and how I created and mounted a CephFS filesystem. Well… bug me till I finish writing this thing.

Installing Slurm on CentOS using Ansible

Helping the UWC Student Cluster Challenge team prepare for the final round (at the CHPC National Meeting) has given me an excuse to play with some new toys: I’ve got a mini-cluster of three VMs running on my laptop (using KVM and libvirt), and I’ve been looking into Slurm as a cluster scheduler. At SANBI we run SGE, lots of other people use Torque, but I’ve been interested in Slurm for a while, because its a fully open source scheduler with some big name users and seemingly a bright future. Then, I’m big into systems administration task automation: at SANBI we use puppet (and I personally use fabric), but Bruce Becker introduced me to Ansible, and so I took the opportunity to build an Ansible playbook to install Slurm on my mini-cluster.

Ansible playbooks are written in YAML and describe a set of tasks that need to be applied to a set of servers. These tasks are defined in terms of a set of modules and executed using ssh  (optionally using ZeroMQ to speed up data transfer) so you need to have ssh access to the machines you want to administer. I did this by adding my ssh key to the authorized_keys file for the root users on my mini-cluster. Like puppet recipes, ansible playbooks are (largely) declarative: you specify what you want, not how to achieve it. Unlike puppet recipes, ansible playbooks run tasks in order, first to last.

So my cluster has three nodes: head (the head node), and two workers: worker1 and worker2. These are on a private (virtual) LAN with DNS being provided by the head node (so the DNS names are head.cluster, etc). My laptop is the VM host and the IP addresses of all nodes are in /etc/hosts on the machine. The laptop has Ansible 1.4 installed from Rodney Quillo’s PPA. This is crucial: I use a bunch of features that are only available in 1.4.

The Slurm Quick Start Administrator Guide outlines the steps needed to install Slurm in a general way. First, I downloaded Slurm 2.6.4 and then installed the dependencies I needed to compile it:

openmpi-devel pam-devel hwloc-devel rrdtool-devel ncurses-devel munge-devel

This is not an exhaustive list: I had previously installed software on the nodes, so there might be stuff I left off this list. To identify missing dependencies, look at the config.log after the configure stage and search for WARNING messages. I unpacked Slurm and did:

./configure --prefix=/opt/slurm
sudo make install

This installed Slurm in /opt/slurm. I then created an archive of the slurm install:

cd /opt
tar jcf /var/tmp/slurm-bin.tar.bz2 slurm/*

I deployed this with Ansible to the nodes in the cluster. My Ansible setup uses a hostfile (/etc/ansible/hosts) that defines the hosts and host groups:


(You don’t need to use a system-wide hosts file like I did, you can specify an alternative hostfile with the -i flag on the ansible-playbook command line.) This was my initial ansible playbook (slurm.yml):

- hosts: cluster
 remote_user: root
 - name: create slurm user
 user: name=slurm createhome=no home=/opt/slurm 
 shell=/sbin/nologin state=present
 - name: install slurm dependencies
 yum: name={{item}} 
 - pam 
 - hwloc 
 - rrdtool 
 - ncurses 
 - munge
 - name: create slurm directories
 file: path=/var/spool/slurmd owner=slurm mode=0755 state=directory
 - name: copy slurm binaries to /tmp
 copy: src=slurm-bin.tar.bz2 dest=/tmp/slurm-bin.tar.bz2
 - name: unpack slurm binary distribution
 command: /bin/tar jxf /tmp/slurm-bin.tar.bz2 chdir=/opt 
 - name: install slurm configuration file
 copy: src=slurm.conf dest=/opt/slurm/etc/slurm.conf
 notify: restart slurm
 - name: install slurm path file in /etc/profile.d
 copy: dest=/etc/profile.d/ mode=0755 owner=root
 - name: install slurm started script in /etc/init.d
 copy: src=init.d.slurm dest=/etc/init.d/slurm mode=0755 owner=root
 - name: enable munge service
 service: name=munge state=started enabled=yes
 - name: enable slurm service startup
 service: name=slurm state=started enabled=yes
 - name: restart slurm
 service: name=slurm state=restarted

As mentioned previously, ansible playbooks are read top down. So the steps taken are:

  1. Create slurm user.
  2. Install slurm dependencies (the non-devel versions of the previously mentioned packages).
  3. Create slurm spool directory (/var/spool/slurmd) and make it owned by the slurm user.
  4. Upload and unpack the slurm-bin.tar.bz2 that was previously created.
  5. Install the slurm configuration file to /opt/slurm/etc/slurm.conf. The first draft of this was created with the Slurm 2.6 configuration tool and the final version is here.
  6. Install the script to /etc/profile.d that sets the PATH to include slurm binaries. This file contains:


    if [ -z “$MANPATH” ] ; then
    export PATH MANPATH 

  7. Install init.d.slurm from slurm distribution’s etc/ directory to /etc/init.d/slurm. This handles start/stop of both slurmctld (on head) and slurmd (on worker nodes).
  8. Ensure that the munge daemon is started. I have previously generated a munge key using the instructions on the munge website and distributed this to /etc/munge/munge.key on each of the nodes in the cluster. This was done with another ansible playbook (not shown).

  9. Once the file is installed the service is enabled (with something like chkconfig slurm on) and the slurm daemons are started (service slurm start).
  10. The playbook was then split up into Ansible roles and roles were added to create a NFS server on the head node and share /home from the head node and mount it over /home on the worker nodes.

Once this was all up and running, the system was tested by using sbatch to run a simple script. Here’s the script,


echo Hello World

This was submitted with the command:


After that worked the AMG benchmark was run using MPI. Here’s the script:

mpirun src/test/amg2013 -laplace -P 1 1 $SLURM_NTASKS -n 64 64 64 -solver 2

and run using:

sbatch -n 2

Compared to my experience with SGE, Slurm seems to run jobs really fast and compared to Torque+Maui it seems pretty easy to set up.

As mentioned above I switched over my playbook to using Ansible roles. Roles allow you to split out the components of configured into a particular directory structure and then mix these into your final playbook. So the roles structures I currently have is:

├── munge
│   ├── files
│   │   └── munge.key
│   ├── handlers
│   │   └── main.yml
│   └── tasks
│       └── main.yml
├── nfs-client
│   └── tasks
│       └── main.yml
├── nfs-common
│   └── tasks
│       └── main.yml
├── nfs-server
│   ├── files
│   │   └── exports
│   ├── handlers
│   │   └── main.yml
│   └── tasks
│       └── main.yml
└── slurm
    ├── files
    │   ├── init.d.slurm
    │   ├── slurm-bin.tar.bz2
    │   ├── slurm.conf
    │   └──
    ├── handlers
    │   └── main.yml
    └── tasks
        └── main.yml

Effectively what Ansible roles do is to split the sections of your playbook out into a directory structure. This is then used in the final playbook (slurm.yml):

- hosts: cluster
 remote_user: root
 - munge
 - slurm
 - nfs-common
 - name: disable firewall
 service: name=iptables enabled=no state=stopped
- hosts: head
 remote_user: root
 - nfs-server
- hosts: workers
 remote_user: root
 - nfs-client

And finally I’m at a stage where I can run:

ansible-playbook slurm.yml

And have the complete infrastructure required for a Slurm install set up on my virtual cluster.
[edited to add whitespace to Ansible playbooks as per suggestion from Michael de Haan @laserllama]


SFP, fastlink and funnies with a Dell switch

A couple of weeks ago I dropped in on the UWC Student Cluster Competition team to see how they were progressing with their cluster configuration. and I discovered that they were struggling with the networking on their cluster. As a test, they’d set up two Dell rack mounted servers, connected them to a 10 Gb switch (a Dell 8100 series switch as I recall) and then connected the switch to the campus network to try and get an IP via DHCP. The switch was getting an IP, but the server weren’t. The servers are running CentOS by the way.

As a test, we set up a DHCP server on Nicole’s laptop (we tried on Eugene’s first, but I just couldn’t quite get my head around how Arch Linux does things) and watched the traffic. After some time, we saw DHCP traffic and got DHCP working, but in a mysterious way: if we restarted the networking, the interface would fail to acquire an IP. Then if we ran ifup some time later, it would acquire an IP quite fine. After I left I decided to google around a bit (having discovered the LINKDELAY setting in CentOS network scripts), and lo and behold, someone else reported exactly the same problem and suggested that fastlink be enabled.

So what is this fastlink thing? Seems that in other contexts it is called Port Fast and its a known way to solve DHCP negotiation issues. By default network switches implement the Spanning Tree Protocol (STP) on their ports in order to configure into a spanning tree (and avoid loops or unreachable ports). This involves a delay as a port becomes active, the delay that caused the DHCP query to time out and the problem we saw. If you know you have a host connected to a port, you can set Port Fast on that port, thereby avoiding the delay. Ah, well everyone has to encounter some or other funny the first time they set up a server. And by the way, for a rhythmic description of STP, consult the Algorhyme (or listen to it).

Cluster building at UWC

students assembling compute cluster

Motse, Eugene (hidden) and Saeed assembling a Dell R710 they’re going to use in their cluster

Last year the Centre for High Performance Computing (CHPC) ran a student cluster building competition for the first time, alongside their national meeting. The winning team progressed to the International Student Cluster Challenge in Leipzig and won top honours there. Observing the teams at work last year convinced me this is something we need to introduce to UWC, so his year when David Macleod from the CHPC’s ACE Lab is announced the second installment of  competition I contacted Computer Science to make sure we had a team. From that side Reg Dodds is facilitating things, and their sysadmin, Daniel Leenderts, is offering a helping hand. The team is being mentored by Motse Lehata, and includes Warren Jacobus, Saeed Natha, Nicole Thomas and Eugene de Beste.

On Tuesday Long and myself wandered over the CS to observe and assist with the unpacking and installation of the practice cluster that Dell had sponsored. I’ll be thin on the technical details in case they don’t want it shared, but it provides enough hardware for installing and testing an operating system and applications for benchmarking. I’m hoping to use this cluster building as an opportunity to get students (and faculty) interested in building cyberinfrastructure as an area for research and maybe even future careers. After all, right now I’ve got the distinct impression that the small number of people I know that run the (mostly Linux) servers that power South African e-Research infrastructure ended up in that career path largely by accident. With big international projects like the SKA and H3Africa coming on stream in the next few years, we’re going to need a much large pool of scientific computing, High Performance Computing, scientific workflows (my personal research bugbear), data curation, storage and re-use and so on expertise. Right now, as far as I can see, there is no decent curriculum out there to train these people, something that I’m trying to address in my small way as part of the H3ABionet, and there is no clear track through the educational institutions into the research infrastructure (as opposed to pure research) side of things. Its gotta change!

e-Research Africa 2013

Quite by accident I ended up attending (and speaking at) the e-Research Africa 2013 conference. This was held in Cape Town, and largely organised, I gather, by Ed Rybicki and Sakkie Janse van Rensburg, from UCT. Ed is the Academic Liason to the UCT Research Portal project, and Sakkie is the Executive Director of ICTS (basically Campus IT services) at UCT. Sakkie was at University of the Free State previously (which in my mind is currently most notable for providing employment to Albert van Eck, one of the more experience HPC admins I know).

The conference started with a keynote from Paul Bonnington, the Director of e-Research at Monash University, and what struck me about Paul’s presentation was the careful attention given to the human and institutional factors that got into e-Research productivity. The topic was “eResearch: Building the Scientific Instruments of the 21st Century – 10 Lessons Learned”, and it set the tone for the conference with a few key message:

  1. e-Research infrastructure is built for an unknown future. Paul gave the example of PlyC lysin, a novel bacteria-killing compound, data on whose structure was captured in 2008 and stored on Monash’s myTardis repository. This data was only analysed in 2011: i.e. careful capture and preservation of data from previous experiments was key to a major discovery. Contrast this with research and teaching pipelines that focus on single end points (papers or graduates). Which leads me to:
  2. e-Research infrastructure development should follow a spiral model. For those not familiar with spiral models, they’re a process model that Barry Boehm came up with in the 1980s and they’re specifically designed to manage successive iterations of requirements gathering, risk assessment, development and planning and…
  3. The role of the University is to be the enduring home for the e-Research process.

Think about this a bit: if research output is no longer (simply) papers, but also includes data and code, what allows research to have long term value? Long term, past research maintains value because it is kept accessible by a structure of support that provides it to present researchers. This is, in Paul’s vision, the university, but its also a set of people, technologies and processes. So its the data and code repositories, its the curation effort that ensures that data is stored in accessible ways and according to meaningful schema, its the metadata that allows us to find prior work. And value for who? At the biggest picture level, society, but in a more immediate sense, value for researchers. Thus three more things:

  1. That “unknown future” is best known by people actually doing academic research. So their input in the “spiral” process is vital. In personal terms, I’m more than ever convinced that UWC needs a “e-Research Reference Group” drawn from interested academic staff from different departments that can outline requirements for future e-research infrastructure requirements.1
  2. Academics are, of course, not infrastructure builders. Infrastructure builders come in different forms – library people, IT people, etc – but in order to build effective e-Research infrastructure, they need to be partners with academics. In other words, there needs to be a common goal: research output. This is different to traditional “IT support”. In my little bubble at SANBI I’ve worked this way over the years: I’ll often partner with individuals or small groups to get work done, with them providing the “domain knowledge” and me grounding the process in computing realities (and hopefully adding a bit of software engineering wisdom etc).
  3. This partnership implies that there needs to be a growth path that recognises and rewards the work of these infrastructure-building partners.2 Paul referred to this as a “third track” in the university, distinct from both academic staff and non-academic support staff. (Ok this is a bit self-interested because I’ve been one of those “non-academic support staff (that participates in research)” for years.)

Ed’s written a blog post about the conference, and there were loads of interesting bits and pieces, such as Yvonne Sing Min’s work on building both a database (the “Vault”) and web front end to allow UCT researchers to have a central toolset for managing their research profiles (sometime similar to what we’re doing for H3ABionet with the NetCapDB), and Hein de Jager mentioning that they’re using Backblaze storage pods at UCT (gotta go see those!), and Andre le Roux’s presentation on redesigning infrastructure to accommodate research, with its focus on people, process and technology. I fear that my talk on scientific workflow systems might have been pitched at the wrong level, but it happened regardless. The presentations are online, unfortunately they don’t include the presentation from day 4 (the workshop day) yet, so Dr Musa Mhlanga’s fascinating talk on using high throughput microscopy for studying biological pathways is missing. I (and other people) tweeted a bit from the conference, using the #eresearch2013 hashtag.

Besides the talks, there was some good networking, since admins / ops people from SANBI, UWC ICS, University of Stellenbosch and UCT were all present at various times. We had a lunchtime meeting (along with Inus from CHPC) to launch a HPC Forum, which basically means that we have a mailing list and also a set of physical meetings to share experience and knowledge with regards to running High Performance Computing sites. If you’re interested in this, drop me a mail.


1. As an illustration of investing in this unknown future,  in “Where Wizards Stay Up Late: The Origins Of The Internet“, Hafner and Lyon report on J. C. R. Licklider’s request to buy a computer for BBN:[Licklider] believed the future of scientific  research was going to be linked to high-speed computers, and he thought computing was  a good field for BBN to enter. He had been at BBN for less than a year when he told Beranek he’d like to buy a computer. By way of persuasion, Lick stressed that the computer he had in mind was a very modern machine—its programs and data were punched on paper tape rather than the conventional stacks of IBM cards.

“What will it cost?” Beranek asked him.
“Around $25,000.”
“That’s a lot of money,” Beranek replied. “What are you going to do with it?”
“I don’t know.”
Licklider was convinced the company would be able to get contracts from the government to do basic research using computers. The $25,000, he assured Beranek, wouldn’t be wasted.

None of the company’s three principals knew much about computers. Beranek knew that Lick, by contrast, was almost evangelistic in his belief that computers would change not only the way people thought about problems but the way problems were solved. Beranek’s faith in Licklider won the day. “I decided it was worth the risk to spend $25,000 on an unknown machine for an unknown purpose,” Beranek said.

2. For a little rant on how hiring difficult hiring computational people to support biologists is, see C. Titus Brown’s “Dear Abby” blog post.


Gotchas in dual-mail-server setup

At SANBI we do spam filtering on a dedicated machine, where we run qpsmtpd with various plugins. The faces the big scary Internet and then any mail that passes its filters is delivered to our main mailserver, where the mailboxes live. Some years ago I wrote a plugin for qpsmtpd that does recipient checking, i.e. it connects to the main mailserver and uses the RCPT TO command to check if the mail can be delivered. I discovered a significant gotcha with this approach: any mail passing the spam filter was being accepted. I.e. I’d accidentially created an open relay (but only for non-spam-filter-triggering mail). So this post is just a note to self (and others that might make this mistake): your final mail server should treat the spam filtering proxy as an external mailserver, i.e. relaying should not be permitted. I did this by changing the mynetworks setting in the main mailserver’s Postfix configuration to exclude the spam filtering server’s IP. (Note that exclusions must be before inclusions in this statement, so !<spam filter IP> had to come before <spam filter IP’s network>.)

Now things are working again, and hopefully we’ll be out of the blocklists soon. However, I took the opportunity to look at what’s out there as filtering SMTP proxies, and it seems that Haraka is interesting. Haraka is Node.js based, so its an event based server written (largely) in Javascript. Kind of like Python’s Twisted. So maybe in the future we’ll switch to Haraka: that is, if we don’t just migrate all our mail to Gmail.

POSTSCRIPT: I forgot that we use our spam filter machine as a mailserver for external clients (when authenticated with SMTP AUTH), so my plan didn’t work. Turns out that what I actually needed was to enable the check_rcpt plugin together with my own plugin, because check_rcpt checks for mail relaying.

PPS: The correct response from a plugin if you think the message is kosher is DECLINED, not OK. OK means we’re sure the message is OK, whereas DECLINED means pass it to the next plugin. Drat!