A Puppet definition for a Ensembl API server

At SANBI we use Puppet to manage system configuration for our servers. This significantly reduces the management headache, allowing us to make changes in a central location (e.g. what the DNS server IP addresses are) and also allows us to create “classes” of servers for different roles. Recently we hosted a course on the Ensembl Genome Browser¬†taught by Bert Overduin of the EBI. In addition to teaching people how to use the Ensembl website, Bert taught a number of students how to use the Ensembl Perl API. I set up a VM, using the web interface to SANBI’s private VM cloud, and created a puppet definition that would install the Ensembl API on the server. So here’s a commented version of the definition I created.

First, a note about puppet: Puppet configuration is declarative, in other words it defines what should be, not (necessarily) how to get there. Each configuration item creates a “resource”. Puppet provides a bunch of resource types out of the box and allows you to define your own types. For this server, I defined two types, the download and the unpack types, referring to a resource that required downloading and a resource that required unpacking respectively. These definitions went in my .pp file ahead of my server definition, along with a download_and_unpack type that combined the two definitions. The download_and_unpack type uses resource ordering, in its arrow (->) form. Since the Puppet configuration language is declarative, not imperative, you cannot assume that resources are created in the order that you specify, so if order is a requirement you need to specify it. Anyway here are these types:

define download( $url, $dist='defaultvalue', $download_dir='/var/tmp' ) {

    if $dist == 'defaultvalue' {
        $path_els = split($url, '/')
        $dist_file = $path_els[-1]
    } else {
        $dist_file = $dist
    $downloaded_dist = "$download_dir/$dist_file"
    exec { "download_$title":
        creates => $downloaded_dist,
        path => '/usr/bin',
        command => "wget -O $downloaded_dist $url",

define unpack ( $dist, $creates, $dest='/opt', $download_dir='/var/tmp' ) {
    $suffix = regsubst($dist, '^.*(gz|bz2)$', '\1', 'I')
    if $suffix == 'gz' {
         $comp_flag = 'z'
    } elsif $suffix == 'bz2' {
         $comp_flag = 'j'
    } else { 
         $comp_flag = ''

    exec { "unpack_$title":
         creates => "$dest/$creates",
         command => "tar -C $dest -${comp_flag}xf $download_dir/$dist",
         path => '/bin',

define download_and_unpack ( $url, $dist='defaultvalue', 
                             $creates, $dest='/opt',
                             $download_dir='/var/tmp' ) {
    if $dist == 'defaultvalue' {
        $path_els = split($url, '/')
        $dist_file = $path_els[-1]
    } else {
        $dist_file = $dist
    download { "get_$title":
        url => $url,
        dist => $dist_file, 
        download_dir => $download_dir 
    unpack { "install_$title":
        dist => $dist_file, 
        creates => $creates, 
        dest => $dest, 
        download_dir => $download_dir 

Just one last notes on these types: they use exec, that executes a command. In Puppet exec will be executed each time the config is run, unless you use a creates, onlyif or unless statement. I thus use knowledge of what the commands do to specify that they should NOT be run if certain files exist.

Then there is one more type I need: a Ensembl course user with a particular defined password (the password matches the username – yes, very insecure, but this is on a throwaway VM for a single course). This is defined in terms of a user and an exec resource. The exec resource checks for the presence of the username *without* a password in /etc/shadow, and if it exists uses usermod to set the password (first generating it using openssl). Note that the generate() function runs on the Puppet server, not the client, so anything you are using there needs to be installed on the server (in this case it was openssl that was installed on the server already).

define enscourse_createuser {
    $tmp = generate("/usr/bin/openssl","passwd","-1",$name)
    $password_hash = inline_template('<%= @tmp.chomp %>')
    user { "$name":
      require => Group['enscourse'],
      ensure => present,
      gid => 'enscourse',
      comment => "Ensembl Course User $name",
      home => "/home/$name",
      managehome => true,
      shell => '/bin/bash',
    exec { "/usr/sbin/usermod -p '${password_hash}' ${name}":
      onlyif => "/bin/egrep -q '^${name}:[*!]' /etc/shadow",
      require => User[$name],

With the custom types out of the way we can start looking at the Puppet node that defines the “enscourse.sanbi.ac.za” server configuration:

node 'enscourse.sanbi.ac.za' inherits 'sanbi-server-ubuntu1204' {
    network::interface { "eth0":
         ipaddr  => "",
         netmask => "",

We have an established “base machine definition” that we inherit from. This is *not* the recommended way to create Puppet configs, but we didn’t know that when we started using Puppet at SANBI. Puppet’s type system encourages a kind of mixin style programming, so there should be a set of Puppet classes e.g. sanbi-server or ubuntu-1204-server, and we should include them in the node definition. Just a quick note: Puppet classes are effectively singleton objects: they define a collection of resources that is declared once (as soon as the class is used in an include statement) in the entire Puppet catalog (a Puppet catalog is the collection of resources that will be applied to a particular system). Read Craig Dunn’s blog for a bit on the difference between Puppet defined types and classes.

We then define the network interface parameters (an entry on SANBI’s private Class C network). And then onwards to an Augeas definition that ensures that pam_mkhomedir is enabled. Augeas is a configuration management tool that parses text files and turns them into a tree that can be addressed and manipulating using a path specification language.

    augeas { 'mod_mkhomedir in pam':
        context => '/files/etc/pam.d/common-session',
        changes => [ 'ins 1000 after *[last()]',
                     'set 1000/type session',
                     'set 1000/control required',
					 'set 1000/module pam_mkhomedir.so',
					 'set 1000/argument umask=0022',
	    onlyif => "match *[module='pam_mkhomedir.so'] size == 0",

And now on to some package definitions. Ensembl requires a specific version of Bioperl (version 1.7.3) so we need to ensure that the Bioperl from the Ubuntu repositories is not installed. And then we provide a few text editors, the CVS version control system, and the mysql server.

    # pvh - 03/09/2013 - can't use bioperl from ubuntu repo. must be v 1.2.3
    package {['bioperl','bioperl-run']:
        ensure => "absent",

    package {['emacs23-nox', 'joe', 'jupp']:
        ensure => "present",

    package {'cvs':
        ensure => "present",

    package { 'mysql-server':
        ensure => "present",

Now we get to use our download_and_unpack resource type to download and unpack the modules, as specificed by the Ensembl API installation instructions. Then define a /etc/profile.d/ensembl.sh file so that the Ensembl stuff gets added to users’ PERL5LIB environment variables:

    download_and_unpack { 'bioperl':
        url => 'http://bioperl.org/DIST/old_releases/bioperl-1.2.3.tar.gz',
        creates => 'bioperl-1.2.3/t/trim.t',

    download_and_unpack { 'ensembl':
        url => 'http://www.ensembl.org/cvsdownloads/ensembl-72.tar.gz',
        creates => 'ensembl/sql/table.sql',

    download_and_unpack { 'ensembl-compara':
        url => 'http://www.ensembl.org/cvsdownloads/ensembl-compara-72.tar.gz',
        creates => 'ensembl-compara/sql/tree-stats.sql',

    download_and_unpack { 'ensembl-variation':
        url => 'http://www.ensembl.org/cvsdownloads/ensembl-variation-72.tar.gz',
        creates => 'ensembl-variation/sql/var_web_config.sql',

    download_and_unpack { 'ensembl-functgenomics':
        url => 'http://www.ensembl.org/cvsdownloads/ensembl-functgenomics-72.tar.gz',
        creates => 'ensembl-functgenomics/sql/trimmed_funcgen_schema.xls',

    file { '/etc/profile.d/ensembl.sh':
        content => '#!/bin/sh
export PERL5LIB
        owner => root,
        mode => 0644,

While much of the Ensembl API is pure Perl, Bert wanted the calc_genotypes tool compiled for use during the course, so we need a few more packages and an exec resource to do the compilation (with the associated creates statement to stop it being re-run on each puppet run):

    # for compiling calc_genotypes
    package { ['libipc-run-perl', 'build-essential']:
       ensure => present,

    exec { 'build_calc_genotypes':
       creates => '/opt/ensembl-variation/C_code/calc_genotypes',
       require => [Download_and_unpack['ensembl-variation'],
       command => 'make calc_genotypes',
       cwd => '/opt/ensembl-variation/C_code',
       user => 'root',
       path => '/bin:/usr/bin',


And finally some ugly hackery. I need a list of users to create, but Puppet doesn’t have an easy way to do this. So I wrote a little Python script that generates a list of usernames, separated by @. When I use this with generate() I need to get rid of the spurious newline, which I do using an inline template, and finally generate the list using split(). Yes I know, really ugly. Its this kind of stuff that is making us here at SANBI consider switching to Salt Stack (also because we love Python here).

Anyway, once we’ve got a list we can just pass it to define a collect of enscourse_createuser resources. The resource naming is a bit off, since “createuser” implies something imperative. I should have just called this enscourse_user or something. And finally close off the curly braces, our node definition is complete!

     $tmp = generate('/usr/local/bin/gen_user_list.py', 'user', 25)
     $user_string = inline_template('<%= @tmp.chomp %>')
     notice("user string :${user_string}:")
     $user_list = split($user_string, '@')

     group { 'enscourse':
       ensure => present

     enscourse_createuser { $user_list: }

Here is that little Python script by the way:


import sys

base = sys.argv[1]
limit = int(sys.argv[2])
num_list = [base + str(x) for x in range(1,limit+1)]
print "@".join(num_list),

Remember that generate() is run on the Puppet server, so this script is installed on there. Well that’s it! And here is the whole thing as one block in case you want to copy and paste it:

Continue reading