e-Research Africa 2013

Quite by accident I ended up attending (and speaking at) the e-Research Africa 2013 conference. This was held in Cape Town, and largely organised, I gather, by Ed Rybicki and Sakkie Janse van Rensburg, from UCT. Ed is the Academic Liason to the UCT Research Portal project, and Sakkie is the Executive Director of ICTS (basically Campus IT services) at UCT. Sakkie was at University of the Free State previously (which in my mind is currently most notable for providing employment to Albert van Eck, one of the more experience HPC admins I know).

The conference started with a keynote from Paul Bonnington, the Director of e-Research at Monash University, and what struck me about Paul's presentation was the careful attention given to the human and institutional factors that got into e-Research productivity. The topic was "eResearch: Building the Scientific Instruments of the 21st Century - 10 Lessons Learned", and it set the tone for the conference with a few key message:

  1. e-Research infrastructure is built for an unknown future. Paul gave the example of PlyC lysin, a novel bacteria-killing compound, data on whose structure was captured in 2008 and stored on Monash's myTardis repository. This data was only analysed in 2011: i.e. careful capture and preservation of data from previous experiments was key to a major discovery. Contrast this with research and teaching pipelines that focus on single end points (papers or graduates). Which leads me to:
  2. e-Research infrastructure development should follow a spiral model. For those not familiar with spiral models, they're a process model that Barry Boehm came up with in the 1980s and they're specifically designed to manage successive iterations of requirements gathering, risk assessment, development and planning and...
  3. The role of the University is to be the enduring home for the e-Research process.

Think about this a bit: if research output is no longer (simply) papers, but also includes data and code, what allows research to have long term value? Long term, past research maintains value because it is kept accessible by a structure of support that provides it to present researchers. This is, in Paul's vision, the university, but its also a set of people, technologies and processes. So its the data and code repositories, its the curation effort that ensures that data is stored in accessible ways and according to meaningful schema, its the metadata that allows us to find prior work. And value for who? At the biggest picture level, society, but in a more immediate sense, value for researchers. Thus three more things:

  1. That "unknown future" is best known by people actually doing academic research. So their input in the "spiral" process is vital. In personal terms, I'm more than ever convinced that UWC needs a "e-Research Reference Group" drawn from interested academic staff from different departments that can outline requirements for future e-research infrastructure requirements.1
  2. Academics are, of course, not infrastructure builders. Infrastructure builders come in different forms - library people, IT people, etc - but in order to build effective e-Research infrastructure, they need to be partners with academics. In other words, there needs to be a common goal: research output. This is different to traditional "IT support". In my little bubble at SANBI I've worked this way over the years: I'll often partner with individuals or small groups to get work done, with them providing the "domain knowledge" and me grounding the process in computing realities (and hopefully adding a bit of software engineering wisdom etc).
  3. This partnership implies that there needs to be a growth path that recognises and rewards the work of these infrastructure-building partners.2 Paul referred to this as a "third track" in the university, distinct from both academic staff and non-academic support staff. (Ok this is a bit self-interested because I've been one of those "non-academic support staff (that participates in research)" for years.)

Ed's written a blog post about the conference, and there were loads of interesting bits and pieces, such as Yvonne Sing Min's work on building both a database (the "Vault") and web front end to allow UCT researchers to have a central toolset for managing their research profiles (sometime similar to what we're doing for H3ABionet with the NetCapDB), and Hein de Jager mentioning that they're using Backblaze storage pods at UCT (gotta go see those!), and Andre le Roux's presentation on redesigning infrastructure to accommodate research, with its focus on people, process and technology. I fear that my talk on scientific workflow systems might have been pitched at the wrong level, but it happened regardless. The presentations are online, unfortunately they don't include the presentation from day 4 (the workshop day) yet, so Dr Musa Mhlanga's fascinating talk on using high throughput microscopy for studying biological pathways is missing. I (and other people) tweeted a bit from the conference, using the #eresearch2013 hashtag.

Besides the talks, there was some good networking, since admins / ops people from SANBI, UWC ICS, University of Stellenbosch and UCT were all present at various times. We had a lunchtime meeting (along with Inus from CHPC) to launch a HPC Forum, which basically means that we have a mailing list and also a set of physical meetings to share experience and knowledge with regards to running High Performance Computing sites. If you're interested in this, drop me a mail.


1. As an illustration of investing in this unknown future,  in "Where Wizards Stay Up Late: The Origins Of The Internet", Hafner and Lyon report on J. C. R. Licklider's request to buy a computer for BBN:[Licklider] believed the future of scientific  research was going to be linked to high-speed computers, and he thought computing was  a good field for BBN to enter. He had been at BBN for less than a year when he told Beranek he’d like to buy a computer. By way of persuasion, Lick stressed that the computer he had in mind was a very modern machine—its programs and data were punched on paper tape rather than the conventional stacks of IBM cards.

“What will it cost?” Beranek asked him.
“Around $25,000.”
“That’s a lot of money,” Beranek replied. “What are you going to do with it?”
“I don’t know.”
Licklider was convinced the company would be able to get contracts from the government to do basic research using computers. The $25,000, he assured Beranek, wouldn’t be wasted.

None of the company’s three principals knew much about computers. Beranek knew that Lick, by contrast, was almost evangelistic in his belief that computers would change not only the way people thought about problems but the way problems were solved. Beranek’s faith in Licklider won the day. “I decided it was worth the risk to spend $25,000 on an unknown machine for an unknown purpose,” Beranek said.

2. For a little rant on how hiring difficult hiring computational people to support biologists is, see C. Titus Brown's "Dear Abby" blog post.


Gotchas in dual-mail-server setup

At SANBI we do spam filtering on a dedicated machine, where we run qpsmtpd with various plugins. The faces the big scary Internet and then any mail that passes its filters is delivered to our main mailserver, where the mailboxes live. Some years ago I wrote a plugin for qpsmtpd that does recipient checking, i.e. it connects to the main mailserver and uses the RCPT TO command to check if the mail can be delivered. I discovered a significant gotcha with this approach: any mail passing the spam filter was being accepted. I.e. I'd accidentially created an open relay (but only for non-spam-filter-triggering mail). So this post is just a note to self (and others that might make this mistake): your final mail server should treat the spam filtering proxy as an external mailserver, i.e. relaying should not be permitted. I did this by changing the mynetworks setting in the main mailserver's Postfix configuration to exclude the spam filtering server's IP. (Note that exclusions must be before inclusions in this statement, so !<spam filter IP> had to come before <spam filter IP's network>.)

Now things are working again, and hopefully we'll be out of the blocklists soon. However, I took the opportunity to look at what's out there as filtering SMTP proxies, and it seems that Haraka is interesting. Haraka is Node.js based, so its an event based server written (largely) in Javascript. Kind of like Python's Twisted. So maybe in the future we'll switch to Haraka: that is, if we don't just migrate all our mail to Gmail.

POSTSCRIPT: I forgot that we use our spam filter machine as a mailserver for external clients (when authenticated with SMTP AUTH), so my plan didn't work. Turns out that what I actually needed was to enable the check_rcpt plugin together with my own plugin, because check_rcpt checks for mail relaying.

PPS: The correct response from a plugin if you think the message is kosher is DECLINED, not OK. OK means we're sure the message is OK, whereas DECLINED means pass it to the next plugin. Drat!

A Puppet definition for a Ensembl API server

At SANBI we use Puppet to manage system configuration for our servers. This significantly reduces the management headache, allowing us to make changes in a central location (e.g. what the DNS server IP addresses are) and also allows us to create "classes" of servers for different roles. Recently we hosted a course on the Ensembl Genome Browser taught by Bert Overduin of the EBI. In addition to teaching people how to use the Ensembl website, Bert taught a number of students how to use the Ensembl Perl API. I set up a VM, using the web interface to SANBI's private VM cloud, and created a puppet definition that would install the Ensembl API on the server. So here's a commented version of the definition I created.

First, a note about puppet: Puppet configuration is declarative, in other words it defines what should be, not (necessarily) how to get there. Each configuration item creates a "resource". Puppet provides a bunch of resource types out of the box and allows you to define your own types. For this server, I defined two types, the download and the unpack types, referring to a resource that required downloading and a resource that required unpacking respectively. These definitions went in my .pp file ahead of my server definition, along with a download_and_unpack type that combined the two definitions. The download_and_unpack type uses resource ordering, in its arrow (->) form. Since the Puppet configuration language is declarative, not imperative, you cannot assume that resources are created in the order that you specify, so if order is a requirement you need to specify it. Anyway here are these types:

define download( $url, $dist='defaultvalue', $download_dir='/var/tmp' ) {

    if $dist == 'defaultvalue' {
        $path_els = split($url, '/')
        $dist_file = $path_els[-1]
    } else {
        $dist_file = $dist
    $downloaded_dist = "$download_dir/$dist_file"
    exec { "download_$title":
        creates => $downloaded_dist,
        path => '/usr/bin',
        command => "wget -O $downloaded_dist $url",

define unpack ( $dist, $creates, $dest='/opt', $download_dir='/var/tmp' ) {
    $suffix = regsubst($dist, '^.*(gz|bz2)$', '\1', 'I')
    if $suffix == 'gz' {
         $comp_flag = 'z'
    } elsif $suffix == 'bz2' {
         $comp_flag = 'j'
    } else { 
         $comp_flag = ''

    exec { "unpack_$title":
         creates => "$dest/$creates",
         command => "tar -C $dest -${comp_flag}xf $download_dir/$dist",
         path => '/bin',

define download_and_unpack ( $url, $dist='defaultvalue', 
                             $creates, $dest='/opt',
                             $download_dir='/var/tmp' ) {
    if $dist == 'defaultvalue' {
        $path_els = split($url, '/')
        $dist_file = $path_els[-1]
    } else {
        $dist_file = $dist
    download { "get_$title":
        url => $url,
        dist => $dist_file, 
        download_dir => $download_dir 
    unpack { "install_$title":
        dist => $dist_file, 
        creates => $creates, 
        dest => $dest, 
        download_dir => $download_dir 

Just one last notes on these types: they use exec, that executes a command. In Puppet exec will be executed each time the config is run, unless you use a creates, onlyif or unless statement. I thus use knowledge of what the commands do to specify that they should NOT be run if certain files exist.

Then there is one more type I need: a Ensembl course user with a particular defined password (the password matches the username - yes, very insecure, but this is on a throwaway VM for a single course). This is defined in terms of a user and an exec resource. The exec resource checks for the presence of the username *without* a password in /etc/shadow, and if it exists uses usermod to set the password (first generating it using openssl). Note that the generate() function runs on the Puppet server, not the client, so anything you are using there needs to be installed on the server (in this case it was openssl that was installed on the server already).

define enscourse_createuser {
    $tmp = generate("/usr/bin/openssl","passwd","-1",$name)
    $password_hash = inline_template('<%= @tmp.chomp %>')
    user { "$name":
      require => Group['enscourse'],
      ensure => present,
      gid => 'enscourse',
      comment => "Ensembl Course User $name",
      home => "/home/$name",
      managehome => true,
      shell => '/bin/bash',
    exec { "/usr/sbin/usermod -p '${password_hash}' ${name}":
      onlyif => "/bin/egrep -q '^${name}:[*!]' /etc/shadow",
      require => User[$name],

With the custom types out of the way we can start looking at the Puppet node that defines the "enscourse.sanbi.ac.za" server configuration:

node 'enscourse.sanbi.ac.za' inherits 'sanbi-server-ubuntu1204' {
    network::interface { "eth0":
         ipaddr  => "",
         netmask => "",

We have an established "base machine definition" that we inherit from. This is *not* the recommended way to create Puppet configs, but we didn't know that when we started using Puppet at SANBI. Puppet's type system encourages a kind of mixin style programming, so there should be a set of Puppet classes e.g. sanbi-server or ubuntu-1204-server, and we should include them in the node definition. Just a quick note: Puppet classes are effectively singleton objects: they define a collection of resources that is declared once (as soon as the class is used in an include statement) in the entire Puppet catalog (a Puppet catalog is the collection of resources that will be applied to a particular system). Read Craig Dunn's blog for a bit on the difference between Puppet defined types and classes.

We then define the network interface parameters (an entry on SANBI's private Class C network). And then onwards to an Augeas definition that ensures that pam_mkhomedir is enabled. Augeas is a configuration management tool that parses text files and turns them into a tree that can be addressed and manipulating using a path specification language.

    augeas { 'mod_mkhomedir in pam':
        context => '/files/etc/pam.d/common-session',
        changes => [ 'ins 1000 after *[last()]',
                     'set 1000/type session',
                     'set 1000/control required',
					 'set 1000/module pam_mkhomedir.so',
					 'set 1000/argument umask=0022',
	    onlyif => "match *[module='pam_mkhomedir.so'] size == 0",

And now on to some package definitions. Ensembl requires a specific version of Bioperl (version 1.7.3) so we need to ensure that the Bioperl from the Ubuntu repositories is not installed. And then we provide a few text editors, the CVS version control system, and the mysql server.

    # pvh - 03/09/2013 - can't use bioperl from ubuntu repo. must be v 1.2.3
    package {['bioperl','bioperl-run']:
        ensure => "absent",

    package {['emacs23-nox', 'joe', 'jupp']:
        ensure => "present",

    package {'cvs':
        ensure => "present",

    package { 'mysql-server':
        ensure => "present",

Now we get to use our download_and_unpack resource type to download and unpack the modules, as specificed by the Ensembl API installation instructions. Then define a /etc/profile.d/ensembl.sh file so that the Ensembl stuff gets added to users' PERL5LIB environment variables:

    download_and_unpack { 'bioperl':
        url => 'http://bioperl.org/DIST/old_releases/bioperl-1.2.3.tar.gz',
        creates => 'bioperl-1.2.3/t/trim.t',

    download_and_unpack { 'ensembl':
        url => 'http://www.ensembl.org/cvsdownloads/ensembl-72.tar.gz',
        creates => 'ensembl/sql/table.sql',

    download_and_unpack { 'ensembl-compara':
        url => 'http://www.ensembl.org/cvsdownloads/ensembl-compara-72.tar.gz',
        creates => 'ensembl-compara/sql/tree-stats.sql',

    download_and_unpack { 'ensembl-variation':
        url => 'http://www.ensembl.org/cvsdownloads/ensembl-variation-72.tar.gz',
        creates => 'ensembl-variation/sql/var_web_config.sql',

    download_and_unpack { 'ensembl-functgenomics':
        url => 'http://www.ensembl.org/cvsdownloads/ensembl-functgenomics-72.tar.gz',
        creates => 'ensembl-functgenomics/sql/trimmed_funcgen_schema.xls',

    file { '/etc/profile.d/ensembl.sh':
        content => '#!/bin/sh
export PERL5LIB
        owner => root,
        mode => 0644,

While much of the Ensembl API is pure Perl, Bert wanted the calc_genotypes tool compiled for use during the course, so we need a few more packages and an exec resource to do the compilation (with the associated creates statement to stop it being re-run on each puppet run):

    # for compiling calc_genotypes
    package { ['libipc-run-perl', 'build-essential']:
       ensure => present,

    exec { 'build_calc_genotypes':
       creates => '/opt/ensembl-variation/C_code/calc_genotypes',
       require => [Download_and_unpack['ensembl-variation'],
       command => 'make calc_genotypes',
       cwd => '/opt/ensembl-variation/C_code',
       user => 'root',
       path => '/bin:/usr/bin',


And finally some ugly hackery. I need a list of users to create, but Puppet doesn't have an easy way to do this. So I wrote a little Python script that generates a list of usernames, separated by @. When I use this with generate() I need to get rid of the spurious newline, which I do using an inline template, and finally generate the list using split(). Yes I know, really ugly. Its this kind of stuff that is making us here at SANBI consider switching to Salt Stack (also because we love Python here).

Anyway, once we've got a list we can just pass it to define a collect of enscourse_createuser resources. The resource naming is a bit off, since "createuser" implies something imperative. I should have just called this enscourse_user or something. And finally close off the curly braces, our node definition is complete!

     $tmp = generate('/usr/local/bin/gen_user_list.py', 'user', 25)
     $user_string = inline_template('<%= @tmp.chomp %>')
     notice("user string :${user_string}:")
     $user_list = split($user_string, '@')

     group { 'enscourse':
       ensure => present

     enscourse_createuser { $user_list: }

Here is that little Python script by the way:


import sys

base = sys.argv[1]
limit = int(sys.argv[2])
num_list = [base + str(x) for x in range(1,limit+1)]
print "@".join(num_list),

Remember that generate() is run on the Puppet server, so this script is installed on there. Well that's it! And here is the whole thing as one block in case you want to copy and paste it:

Continue reading ...

Gotchas with HTTP and HTTPS

I recently installed a WordPress Network to provide blogs (and simple websites) for SANBI (the bioinformatics SANBI, not the biodiversity SANBI), with authentication provided by nginx's auth_pam module (and thus linked to our site-wide authentication, so we don't need to maintain separate WordPress users and passwords). The login page for WordPress is protected with SSL, but I was serving the rest of the site using plain HTTP. This led to a strange bug - when I wanted to edit a post, the editor was a very narrow column.

Like this, with all
the text cramped

What the heck? I tried changing WordPress settings, but as someone said out there on the net "Before you waste hours switching off / on plugins, check the javascript debugger in your browser!". Turns out that Chrome was blocking content because the combination of HTTP and HTTPS meant I had created a "mixed content" site, and modern browsers (including Chrome) frown on such behaviour. Melissa Koenig has a little blog post on mixed content for the curious. After much googling, I found a blog post by Ken Chen detailing how to set this up right: in the SSL config (but not in the main HTTP config) you serve the main WordPress content using HTTPS, ensuring that "mixed content" is avoided.

So here's the nginx config file:

upstream php {
server unix:/var/run/php5-fpm.sock;

server {
listen 443;
ssl on;
server_name .wp.sanbi.ac.za *.wp.sanbi.ac.za blog.sanbi.ac.za *.blog.sanbi.ac.za;

ssl_certificate /etc/ssl/certs/sanbi.pem;
ssl_certificate_key /etc/ssl/private/sanbi.key;

root /usr/lib/wordpress;
index index.php index.html index.htm;

# Process only the requests to wp-login and wp-admin
location ~ /wp-(admin|login|includes|content) {
auth_pam "SANBI authentication";
auth_pam_service_name "nginx";
try_files $uri $uri/ \1/index.php?args;

location ~ \.php$ {
try_files $uri =404;
include fastcgi_params;
fastcgi_param REMOTE_USER $remote_user;
fastcgi_index index.php;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
fastcgi_pass php;
fastcgi_intercept_errors on;
# Redirect everything else to port 80
location / {
return 301 http://$host$request_uri;

server {
#listen 80; ## listen for ipv4; this line is default and implied
#listen [::]:80 default ipv6only=on; ## listen for ipv6

server_name wp.sanbi.ac.za *.wp.sanbi.ac.za blog.sanbi.ac.za *.blog.sanbi.ac.za;
root /usr/lib/wordpress;

index index.php;

location ~ /wp-(?:admin|login) {
return 301 https://$host$request_uri;

location = /favicon.ico {
log_not_found off;
access_log off;

location = /robots.txt {
allow all;
log_not_found off;
access_log off;

location / {
# This is cool because no php is touched for static content.
# include the "?$args" part so non-default permalinks doesn't break when using query string
try_files $uri $uri/ /index.php?$args;

location ~ \.php$ {
#NOTE: You should have "cgi.fix_pathinfo = 0;" in php.ini
location ~ /wp-(admin|login) {
return 301 https://$host$request_uri;
try_files $uri =404;
include fastcgi_params;
fastcgi_intercept_errors on;
fastcgi_pass php;

location ~* ^.+\.(ogg|ogv|svg|svgz|eot|otf|woff|mp4|ttf|rss|atom|jpg|jpeg|gif|png|ico|zip|tgz|gz|rar|bz2|doc|xls|exe|ppt|tar|mid|midi|wav|bmp|rtf)$ {
access_log off; log_not_found off; expires max;

location ~ /\. { deny all; access_log off; log_not_found off; }


Note that this uses a Unix socket for php5-fpm. I've found that I don't need to use WordPress FORCE_SSL_ADMIN (or the WordPress SSL plugin) and since I had some issues with the SSL plugin (it created an invalid default SSL link for newly created network sites), I left out that bit of the config.Oh by the way I use the HTTP authentication plugin to provide an authentication dialog to users. And you can read this guide on the basics of a WordPress network setup on nginx. A quick note: when you add a new site to the network, make sure to go into the dashboard of the new site and enable the automatically create new users setting in the HTTP authentication settings, at least until the new site's owner has created an account for themselves (by logging in).

Finally, the nginx auth_pam module is compiled into the nginx I installed on ubuntu (from the nginx-full package). I created a file /etc/pam.d/nginx that directed authentication to both our Kerberos setup and the Unix accounts on the local server (so that I could create ad-hoc users, e.g. for the default WordPress admin user):

auth [success=2 default=ignore] pam_krb5.so minimum_uid=1000 ignore_k5login
auth [success=1 default=ignore] pam_unix.so use_first_pass
auth requisite pam_deny.so
auth required pam_permit.so

MegaCLI for LSI MegaRAID on Linux

Here at SANBI we have a bunch of Dell blade servers, including Dell M620s. These have on-board LSI MegaRAID controllers (the SAS 2208 to be exact), and recently I had a need to pull some diagnostic info from the controller (because our drives mysteriously went offline, one after another, a couple of days apart). We're running Ubuntu 12.04 on the servers, and luckily there is a command line tool, MegaCLI, that you can use to interface with the RAID controller. You can get it from the LSI website, but its a bit of a hunt. Here is a link to version 8.05.71, which is a zip file that I simply extracted into /opt. The tool has a bewildering number of options (shown with MegaCli64 -h) but there are guides to useful commands here and here.