This post appeared originally in our sysadvent series and has been moved here following the discontinuation of the sysadvent microsite
When building a Ceph-cluster, it was important for us to plan ahead. Not only does one usually start out with a minimum of ~5 servers, but one should also expect some growth in the cluster. Running the cluster also means patching the operating system and Ceph itself, and with Ceph being a crucial infrastructure component it’s also desirable to have a proper rollback procedure.
Using CI to maintain image
We’ve grown really fond of ram-disk nodes. Using a Jenkins instance, we run a build service that will boot up a VM using an updated version of CentOS 7, connect to Puppet and provision in all the basic components that we want the server to have. This image is converted into a root file-system, and uploaded to our redundant service hosts for availability.
Covering this build process could be an entire Sysadvent calendar on its own, but this is the gist of it.
Booting it up
There’s a lovely piece of software that’s incredibly powerful for booting things on the network - and it’s called iPXE. We chain-load iPXE in the iPXE/UEFI boot process, and it lets us run a small script before we decide what image to boot. In this script, we pass information about which Puppet-environment the server belongs to, how to configure bonding, and other pieces of information that is unique to that server to the kernel command line.
We then boot up the kernel/rootfs of the image, parse /proc/cmdline
, configure networking (change a dynamic lease on a single interface to a bonded interface with a static IP address), download Puppet certificates, and then run Puppet.
Puppet puts the last pieces in place
The image will have all the basic software in place, but it may or may not have the necessary keys in place to be allowed to talk to the cluster monitors. Puppet will ensure that ceph.conf / keys are placed where they should be in a secure manner. Puppet will also take care of day to day configuration changes - such as adding new monitoring probes and tools, as we build upon and improve them.
We continuously do changes on the servers based on our best practices through Puppet, and we try to not do updates to the image unless it’s an issue of security or availability.
What about OSD state?
And this is the point I’m trying to drive home with this post: I guess you’re wondering how we know which drive is which OSD in our cluster. There’s a really simple trick that we figured out:
- For each file-system (we use XFS in production), query OSD ID by using the UUID of said file-system
- If the UUID is unknown a new OSD ID is created - but for existing OSDs, the correct ID is returned.
From that point on, we have all information we need to mount the OSDs and start the systemd-unit for it.
Something along these lines will get us up and running:
#!/bin/bash
for a in /dev/disk/by-uuid/*; do
FS=$(blkid -o value -s TYPE $a)
if [ "$FS" == "xfs" ]; then
# Get OSD id from monitors, create if missing
UUID=$(basename $a)
OSDID=$(ceph osd create $UUID)
if [ ! -e "/var/lib/ceph/osd/ceph-$OSDID" ]; then
mkdir /var/lib/ceph/osd/ceph-$OSDID
fi
if mountpoint -q /var/lib/ceph/osd/ceph-$OSDID; then
echo "Already mounted, skipping. (OSD: $OSDID, UUID: $UUID)"
else
if grep -q $UUID /etc/fstab; then
echo "Mountpoint already in fstab, not adding"
else
echo "UUID=$UUID /var/lib/ceph/osd/ceph-$OSDID xfs defaults 0 0" >> /etc/fstab
fi
mount $a /var/lib/ceph/osd/ceph-$OSDID
fi
if [ ! -e "/var/lib/ceph/osd/ceph-$OSDID/keyring" ]; then
ceph-osd -i $OSDID --mkfs --mkkey --osd-uuid $UUID
ceph auth add osd.$OSDID osd 'allow *' mon 'allow profile osd' -i /var/lib/ceph/osd/ceph-$OSDID/keyring
ceph osd crush add osd.$OSDID 1.0 host=$(facter hostname)
else
echo "Already initialized, skipping. (OSD: $OSDID, UUID: $UUID)"
fi
RUNNING=$(pgrep -f "^ceph-osd -i $OSDID\$" -c)
if [ "$RUNNING" -eq 1 ]; then
echo "Already running, not starting."
else
systemctl enable ceph-osd@${OSDID}.service
systemctl start ceph-osd@${OSDID}.service
fi
fi
done
This code has the unintended side-effect that it can also be used to easily add new OSDs to the cluster; Just insert a new disk, mkfs.xfs
, start_osds
, and it will be added to the cluster. As one would usually do, just be careful about the performance impact of the backfill-operations that adding new OSDs may have.
For journals, we use raw partitions. To ensure that journals will work across boots - even if hardware would come up in different order than before (making sda to sdb, or the other way around) - we always use /dev/disk/by-wwn/
to look up the correct partition.
What did we really achieve?
This simplifies scaling up a lot.
- When hardware has arrived, and the physical part of the job is taken care of - initializing a new node is done by booting it, formatting the drives, and starting the OSDs.
- Adding new drives is solved by formatting the drives and starting the OSDs.
It also ensures that as long as all nodes have booted the same version of the image - we can also expect them to contain the same software versions and identical configuration.
Upgrading the image is just about doing ceph set osd noout
, and perform a rolling reboot of the cluster. We don’t like what we just patched into? Perform a rollback! We keep the old image, and we know exactly what to expect from it!
Thoughts on the CrowdStrike Outage
Unless you’ve been living under a rock, you probably know that last Friday a global crash of computer systems caused by ‘CrowdStrike’ led to widespread chaos and mayhem: flights were cancelled, shops closed their doors, even some hospitals and pharmacies were affected. When things like this happen, I first have a smug feeling “this would never happen at our place”, then I start thinking. Could it?
Broken Software Updates
Our department do take responsibility for keeping quite a lot ... [continue reading]