An experimental Ceph storage cluster for the Computer Science Netlab

Ceph is an open source distributed object store and filesystem developed originally developed by Sage Weil for his PhD. Saying that again, but slower this time, Ceph is an object store where the objects are **distributed over a collection of drives and serves. And there's a filesystem component too. The basic blocks of Ceph are object storage daemons (OSDs) and monitoring daemons (MONs). Each OSD manages a block of storage into which are placed objects, aggregated into placement groups (PGs). A datastructure called a

CRUSH map, which is a distributed hash table used to look up where particular data objects are stored. Distributed hashtables underlie most distributed storage software (e.g. you also find one in GlusterFS). Keeping track of the state of everything are the MONs that use the PAXOS algorithm to maintain a consistent set of knowledge about the state of the cluster. There are always an odd number of MONs: typically 3 for a small cluster.

Once you put it all together you get RADOS, the Reliable Autonomic Distributed Object Store (see the 2007 paper if you want). Storage in RADOS is divided into pools where you set per-pool policy for object size and striping and replication. So a pool might have say 4096 PGs containing objects of maximum 4 MB each filled with 64 KB stripes of data, replicated to ensure that there are 3 copies of each PG available at all times.

Ceph provides 3 options for accessing RADOS: first there is the RADOS object gateway, a RESTful service allowing individual objects (e.g. VM images) to be stored and retrieved. Then there's the RADOS Block Device (RBD), a iSCSI like block device that can be mapped to a virtual drive on the client computer and finally there is CephFS, a POSIX-compliant filesystem running on top of storage pools in RADOS (the filesystem uses an extra daemon the metadata server (MDS), to map filesystem locations such as /home to objects).

The Ceph architecture allows heterogenous disks and servers to be aggregated into a storage pool in a way that is more flexible than traditional RAID and with substantially shorter rebuild times.

So to demo Ceph I set it up in the Computer Science Netlab at UWC, using a fair sprinkling of ansible scripting and the ceph-deploy tool. The architecture is three MONs, normandy, netlab2-ws and netlab6-ws and three OSDs, netlab2-ws, netlab6-ws and netlab17-ws. On the OSDs I've used the existing filesystem as the Ceph store - Ceph should have its own storage partitions, but I didn't want to go to the trouble of repartitioning machines for the demo. The release used is Giant, the latest and as-yet-unreleased Ceph release.

TODO: detail the initial installation

I'll hopefully getting around to documenting the initial steps, but let me show how I add a new OSD to the Ceph cluster.

First, create the /dfs directory on the new OSD (in this case netlab10-ws). The "-s -k -K" arguments for ansible mean ask for SSH password, ask for sudo password and use sudo. Since the commands I'm running need to be run as root on the remote machine, I need all that.

ansible -s -k -K -m file -a "name=/dfs owner=root state=directory" netlab10-ws

Then add a user to the remote machine. The password set is never actually used, since we'll be using SSH public keys to do the log in and passwordless sudo.

ansible  -u pvh -k -K -m user -a 'name=netlab-ceph home=/dfs/netlab-ceph createhome=yes password="SOMEENCRYPTEDSTUFF" shell=/bin/bash comment="Ceph User" state=present' netlab10-ws

ansible -k -s -K -m authorized_key -a 'user=netlab-ceph key="SSH PUBLIC KEY GOES HERE"' netlab10-ws

ansible -k -K -s -m copy -a 'content="netlab-ceph ALL = (root) NOPASSWD:ALL\n" dest=/etc/sudoers.d/050_ceph mode=0444 owner=root' netlab10-ws 

Next we need to install a NTP daemon and synchronise time (we should actually run an in-lab NTP server, but for now we're using a public one to set the time). Ceph relies on everything being in close time synchronisation to operate.

ansible -s -k -K -m apt -a 'name=ntp state=present' netlab10-ws

ansible -s -k -K -m command -a 'ntpdate' netlab10-ws
ansible -s -k -K -m service -a 'name=ntp state=started enabled=true' netlab10-ws

Then use the ceph-deploy tool to install the actual ceph packages, create a directory for the OSD to put its data in, initialize, activate and we're done!

ceph-deploy install --release=giant netlab10-ws

ansible -s -k -K -m file -a 'name=/dfs/osd3 state=directory' netlab10-ws

ceph-deploy osd prepare netlab10-ws:/dfs/osd3
ceph-deploy osd activate netlab10-ws:/dfs/osd3

You can check on the state of the cluster with sudo ceph status, where you'll see something like this:

netlab-ceph@normandy:~$ sudo ceph status
    cluster 915d5e83-2950-4860-ba97-2118c061036f
     health HEALTH_WARN 18 pgs degraded; 220 pgs peering; 93 pgs stuck inactive; 93 pgs stuck unclean; recovery 296/2880 objects degraded (10.278%)
     monmap e1: 3 mons at {netlab2-ws=,netlab6-ws=,normandy=}, election epoch 14, quorum 0,1,2 normandy,netlab2-ws,netlab6-ws
     mdsmap e5: 1/1/1 up {0=normandy=up:active}
     osdmap e64: 4 osds: 4 up, 4 in
      pgmap v4413: 320 pgs, 3 pools, 3723 MB data, 960 objects
            63303 MB used, 799 GB / 907 GB avail
            296/2880 objects degraded (10.278%)
                  18 active+degraded
                 220 peering
                  82 active+clean
recovery io 11307 kB/s, 2 objects/s
  client io 6596 kB/s wr, 4 op/s

Or you can watch it rebalancing itself with ceph -w:

netlab-ceph@normandy:~$ sudo ceph -w
    cluster 915d5e83-2950-4860-ba97-2118c061036f
     health HEALTH_WARN 121 pgs degraded; 8 pgs recovering; 28 pgs stuck unclean; recovery 1308/3864 objects degraded (33.851%)
     monmap e1: 3 mons at {netlab2-ws=,netlab6-ws=,normandy=}, election epoch 14, quorum 0,1,2 normandy,netlab2-ws,netlab6-ws
     mdsmap e5: 1/1/1 up {0=normandy=up:active}
     osdmap e64: 4 osds: 4 up, 4 in
      pgmap v4445: 320 pgs, 3 pools, 4991 MB data, 1288 objects
            63424 MB used, 799 GB / 907 GB avail
            1308/3864 objects degraded (33.851%)
                 113 active+degraded
                 199 active+clean
                   8 active+recovering+degraded
recovery io 14070 kB/s, 3 objects/s

2014-10-16 11:44:39.069514 mon.0 [INF] pgmap v4445: 320 pgs: 113 active+degraded, 199 active+clean, 8 active+recovering+degraded; 4991 MB data, 63424 MB used, 799 GB / 907 GB avail; 1308/3864 objects degraded (33.851%); 14070 kB/s, 3 objects/s recovering
2014-10-16 11:44:41.178062 mon.0 [INF] pgmap v4446: 320 pgs: 113 active+degraded, 199 active+clean, 8 active+recovering+degraded; 4991 MB data, 63473 MB used, 799 GB / 907 GB avail; 1306/3864 objects degraded (33.799%); 9782 kB/s, 2 objects/s recovering

To remove an OSD, you can use these commands (using out newly create osd.3 as an example) - they take the OSD out of the storage cluster, stop the daemon, remove it from the CRUSH map, delete authentication keys and finally remove the OSD from the cluster's list of OSDs.

netlab-ceph@normandy:~$ sudo ceph osd out 3
marked out osd.3. 
netlab-ceph@normandy:~$ ssh netlab10-ws sudo stop ceph-osd-all
ceph-osd-all stop/waiting
netlab-ceph@normandy:~$ sudo ceph osd crush remove osd.3
removed item id 3 name 'osd.3' from crush map
netlab-ceph@normandy:~$ sudo ceph auth del osd.3
netlab-ceph@normandy:~$ sudo ceph osd rm 3
removed osd.3

Then, for good measure, you can remove the data:

netlab-ceph@normandy:~$ ssh netlab10-ws sudo rm -rf /dfs/osd3/\*

Also not covered in this blog is how I added a RBD device and how I created and mounted a CephFS filesystem. Well... bug me till I finish writing this thing.