This post appeared originally in our sysadvent series and has been moved here following the discontinuation of the sysadvent microsite
As mentioned in the previous Ansible post, we use Ansible quite a lot for day to day operations. While we prefer Puppet for configuration management, Ansible is excellent for automation of maintenance procedures.
One such procedure is gracefully applying package upgrades, including any required reboot, of application servers. In this post we’ll take a look at upgrading a cluster of web application servers defined in the Ansible hostgroup “webservers”. They’re located behind a redundant pair of HAProxy load balancers running on the “loadbalancers” hostgroup. The web servers in this example are running Ubuntu 16.04.
The process
In short, we want to:
- Verify that the cluster is working as it should (we don’t want to bring down anything for maintenance if other parts of the cluster is already broken).
- For each web server, one at a time:
- Bring it out of the load-balanced cluster
- Upgrade packages and reboot if needed
- Add it back into the load-balanced cluster
Prerequisites
This playbook needs something like the “cut” UNIX program when massaging a list output in a Jinja2 template. To do this we create a new filter plugin, and tell Ansible where to find it. Create a directory, and tell Ansible about it through your ansible.cfg:
# in ansible.cfg:
filter_plugins = /some/path/filter_plugins/
Now put the following filter plugin into the file “splitpart.py” in the above directory:
def splitpart (value, index, char = ','):
if isinstance(value, (list, tuple)):
ret = []
for v in value:
ret.append(v.split(char)[index])
return ret
else:
return value.split(char)[index]
class FilterModule(object):
def filters(self):
return {'splitpart': splitpart}
The playbook
Let’s do a breakdown of the full playbook…
Before we do any modifications we want to verify that everything is already working as it should. This is monitored by the monitoring system as well, but sysadmins are paranoid. First the task-list header…
}
- name: Ensure services are up before doing anything
hosts: webservers
any_errors_fatal: true # stop if anything is wrong
serial: 1 # one server at a time
become: false # …no need for root
tasks:
Let’s say we have two virtualhosts that need probing (site1.example.com and site2.example.com).
- name: Verify that site1.example.com is up
ansible.builtin.uri:
url: "http://localhost/"
status_code: 200
follow_redirects: none
vars:
http_headers:
Host: site1.example.com
- name: Verify that site2.example.com is up
ansible.builtin.uri:
url: http://localhost/
status_code: 200
follow_redirects: none
vars:
http_headers:
Host: site2.example.com
This uses the Ansible uri module to fetch the frontpage of the two local sites on all the web servers, and verify that they yield a 200 response code.
The default of the “status_code” attribute is 200, but I included it for easy tuning.
Next, we’ll also make sure that all the web servers are in the load-balancing cluster. This will enable any web servers that were out of the cluster.
- name: Make sure all node is enabled and up in haproxy
delegate_to: "{{ item }}"
become: true # this part needs root
community.general.haproxy:
state: enabled
host: "{{ inventory_hostname | splitpart(0, '.') }}"
socket: /var/run/haproxy.sock
wait: yes
with_items: "{{ groups.loadbalancers }}"
Using with_items:
makes this whole task a loop that is executed once for each
host in the “loadbalancers” hostgroup. For each iteration, the variable “item”
is set to the current loadbalancer server, and we use this variable in
delegate_to
to tell Ansible to carry out the current task on each load
balancer in order. Since the task-list including this task is performed once for
every server in the “webservers” hostgroup, this task is in effect done for
every web server on every loadbalancer.
On the loadbalancers, the Ansible HAProxy module enables
us to ensure that each web server is enabled and “UP”. The wait: yes
ensures
that the task doesn’t finish before the server is actually in an “UP” state
according to the loadbalancer probes, as opposed to just enabling it if it was
in maintenance state.
The host
attribute takes the inventory_hostname
(the FQDN of the web server
in this case) and picks out the first element (the shortname of the host),
since that’s the name of the server in our HAProxy definition. The {{ … }}
is
a Jinja2
template, which opens up a lot of options when customisation is required.
The HAProxy socket needs to have the “admin” flag in haproxy.cfg on the loadbalancer servers. E.g.
global
stats socket /var/run/haproxy.sock user haproxy group haproxy mode 440 level admin
# […etc]
In addition to checking state, this authorises disabling/enabling web servers through the socket.
At this point we have confirmed that both websites are up on all the web servers, and that all the web servers are active on all the loadbalancers. It’s time to start doing the actual work. We need to start off a new task-list:
---
- name: Upgrade packages and reboot (if necessary)
hosts: webservers
serial: 1 # one host at a time
become: true # as root
any_errors_fatal: true
max_fail_percentage: 0
vars: # used by nagios-downtime/undowntime tasks
icinga_server: monitoring.example.com
tasks:
This task-list loops through the web servers, one after the other, as root, and will abort the whole playbook run if anything goes wrong at any point in this task.
The var “icinga_server” is used for setting/removing downtime in our Icinga monitoring system. If you haven’t got one, just remove that bit, along with the downtime tasks further down.
At this point we initially jumped straight to the apt-get upgrade
part. But over time, the effect of “it’d be handy if the automated
package update also did X and Y, wouldn’t it?” has evolved the task
list to something more complex and even more useful. We see this
effect on other Ansible playbooks as well.
Let’s first figure out what we want to upgrade…
---
# do an "apt-get update", to ensure latest package lists
- name: Apt-get update
ansible.builtin.apt:
update-cache: true
changed_when: 0
# get a list of packages that have updates
- name: Get list of pending upgrades
ansible.builtin.command: apt-get --simulate dist-upgrade
args:
warn: false # don't warn us about apt having its own plugin
register: apt_simulate
changed_when: 0
# pick out list of pending updates from command output. This essentially
# takes the above output from "apt-get --simulate dist-upgrade", and
# pipes it through "cut -f2 -d' ' | sort"
- name: Parse apt-get output to get list of changed packages
ansible.builtin.set_fact:
updates: '{{ apt_simulate.stdout_lines | select("match", "^Inst ") | list | splitpart(1, " ") | list | sort }}'
changed_when: 0
# tell user about packages being updated
- name: Show pending updates
ansible.builtin.debug:
var: updates
when: updates.0 is defined
…that was a handful. We first do an apt-get update
through the Ansible apt module. Even though
this changes files in /var/lib/apt/ we don’t really care – we only want
Ansible to mark a web server as changed if it actually upgraded any packages.
We therefore force the change flag to never be set by setting the changed_when
meta parameter. We do this in many tasks throughout this playbook for the same
reason.
Next we run an apt-get --simulate dist-upgrade
and store the command output in
a variable called “apt_simulate” for use by later tasks. We do this through the
Ansible command module since the
apt module does not have support for --simulate
. The command module will notice that
we’re running apt-get directly and warn us that we might want to use the
apt module instead. We tell it to skip that warning through the warn
option.
The next task then picks the lines of stdout that started with Inst
to get a full list of all
the packages that will be updated.
The list of packages is useful for the sysadmin to know, so we print it using the Ansible debug module.
When starting to use Ansible playbooks for routines like this, it can be quite useful to ask for sysadmin confirmation before doing any actual changes. If you want to request such a confirmation, this is a good place to do it.
# request manual ack before proceeding with package upgrade
- name: Pause
ansible.builtin.pause:
when: updates.0 is defined
We now know what will be updated (if anything), and we’ve got sysadmin confirmation if we’re about to do any changes. Let’s get to work!
# if a new kernel is incoming, remove old ones to avoid full /boot
- name: Apt-get autoremove
ansible.builtin.command: apt-get -y autoremove
args:
warn: false
when: '"Inst linux-image-" in apt_simulate.stdout'
changed_when: 0
Most Debian/Ubuntu admins have at some time ended up with a full /boot when upgrading kernels because of old kernel packages staying around. While there are other ways to avoid this (especially in newer distro versions), it doesn’t hurt to make sure to get rid of any old kernel packages that are no longer needed.
# do the actual apt-get dist-upgrade
- name: Apt-get dist-upgrade
ansible.builtin.apt:
upgrade: dist # upgrade all packages to latest version
Finally the actual command we set out to do! This is pretty self explanatory.
…but…what did we do? Did we upgrade libc? Systemd? The kernel? Something else
that needs a reboot? Newer Debian-based systems create the file
/var/run/reboot-required
if a reboot is necessary after a package upgrade.
Let’s look at that……
# check if we need a reboot
- name: Check if reboot needed
ansible.builtin.stat: path=/var/run/reboot-required
register: file_reboot_required
Using the Ansible stat module, the result of
a stat of the file /var/run/reboot-required
has now been stored in the
variable “file_reboot_required”.
We can now check the “exists” flag for the remaining commands that are about to do the server reboot etc, but that would be quite a lot of clutter. There is a more elegant way of skipping the rest of the task-list for the current web server to skip straight to the next.
# "meta: end_play" aborts the rest of the tasks in the current «tasks:»
# section, for the current webserver
# "when:" clause ensures that the "meta: end_play" only triggers if the
# current webserver does _not_ need a reboot
- name: Stop the play
ansible.builtin.meta: end_play
when: not file_reboot_required.stat.exists
In other words, we stop the current task-list for the current web server, unless
the file /var/run/reboot-required
exists. If the file exists we need a
reboot, but if not we can just skip the reboot and continue on with the next
web server.
This means that the rest of the task-list will only be executed if the current web server needs a reboot, so let’s start prepping just that.
# add nagios downtime for the webserver
- name: Set nagios downtime for host
delegate_to: "{{ icinga_server }}" # do this on the monitoring server
community.general.nagios:
action: downtime
comment: OS Upgrades
service: all
minutes: 30
host: '{{ inventory_hostname }}'
author: "{{ lookup('ansible.builtin.env','USER') }}"
{% raw %}
False positives in the monitoring system is bad, so we use the Ansible Nagios module to SSH to the Icinga server and set downtime for the web server we’re about to reboot, as well as all services on it.
Next we take the web server out of the loadbalancer cluster.
{% raw %}
- name: Disable haproxy backend {{ inventory_hostname }}
delegate_to: "{{ item }}"
community.general.haproxy:
state: disabled
host: "{{ inventory_hostname | splitpart(0, '.') }}"
socket: /var/run/haproxy.sock
wait: true
#drain: yes # requires ansible 2.4
loop: "{{ groups.loadbalancers }}"
Using the same HAProxy module that we earlier used to ensure that all HAProxy
backend servers were enabled, we now disable the web server we’re about to
reboot on all the loadbalancer servers. state: disabled
means we want the
server to end up in “MAINT” mode. Optimally we’d want the drain
parameter as
well, as the combination of the drain
and wait
flags ensure that all active
connections to the web server get to finish gracefully before proceeding to the
reboot. The drain
option was added in Ansible 2.4, and some of our management
nodes don’t have new enough Ansible versions to support that parameter. Use it
if you can.
Since Ansible re-uses SSH connections to servers for consecutive tasks, we need to jump through a couple of hoops when rebooting.
- name: Reboot node
ansible.builtin.shell: sleep 2 && shutdown -r now "Reboot triggered by ansible"
async: 1
poll: 0
ignore_errors: true
# poll SSH port until we get a tcp connect
- name: Wait for node to finish booting
become: false
local_action: wait_for host=
port=
state=started
delay=5
timeout=600
# give SSHD time to start fully
- name: Wait for SSH to start fully
ansible.builtin.pause:
seconds: 15
We first do a reboot through the Ansible shell module with a sleep and some flags to avoid getting an Ansible connection error.
The second block waits until the SSH port on the web server starts accepting connections. Before Ubuntu 16.04 this was enough, but in 16.04 SSH accepts connections before it properly accepts logins during boot, so we do an extra wait to ensure that we can log into the web server.
Ansible 2.3 has a wait_for_connection module which can probably replace the second and third block, but again some of our management nodes have older versions.
We’ve now rebooted the server. Before we re-add it to the loadbalancing cluster, we need to make sure that the applications work as they should.
# verify that services are back up
- name: Verify that site1.example.com is up
ansible.builtin.uri:
url: http://localhost/
status_code: 200
follow_redirects: none
retries: 60
delay: 2
register: probe
vars:
http_headers:
Host: site1.example.com
until: probe.status == 200
- name: Verify that site2.example.com is up
ansible.builtin.uri:
url: http://localhost/
status_code: 200
follow_redirects: none
retries: 60
delay: 2
register: probe
vars:
http_headers:
Host: site2.example.com
until: probe.status == 200
This is essentially what we did to ensure functioning servers before we did the package upgrade, except that this will keep retrying for 2 minutes if it does not get an immediate 200 response. It’s not uncommon for web servers to take a while to start. The uri module will be retried up to 60 times with 2 second delay between each retry, until it returns a 200 response code.
Now we’re pretty much done. We’ve upgraded and rebooted the web server, and have confirmed that the virtualhosts respond with 200. It’s time to clean up.
# reenable disabled services
- name: Re-enable haproxy backend {{ inventory_hostname }}
delegate_to: "{{ item }}"
community.general.haproxy:
state: enabled
host: "{{ inventory_hostname | splitpart(0, '.') }}"
socket: /var/run/haproxy.sock
wait: true
with_items: "{{ groups.loadbalancers }}"
# remove nagios downtime for the host
- name: Remove nagios downtime for host
delegate_to: "{{ icinga_server }}" # do this on the monitoring server
community.general.nagios:
action: delete_downtime
host: '{{ inventory_hostname }}'
service: all
These just undo what we did before the reboot. Note that the wait
flag on the
HAProxy module asserts that the web server actually ends up in an “UP” state in
the loadbalancers after it is brought out of maintenance mode. In other words
we’ll notice (and the Ansible playbook will abort) if the HAProxy probe thinks
the web server is unhealthy.
Heavily loaded web servers often need a bit of time to get “warm”. To ensure stability we wait a few minutes before we proceed to the next web server.
# wait a few minutes between hosts, unless we're on the last
- name: Waiting between hosts
ansible.builtin.pause:
minutes: 10
when: inventory_hostname != ansible_play_hosts[-1]
Result
The end result is a playbook that we can trust to do its own thing without much oversight. If anything fails it’ll stop in its tracks, meaning that at most one web server should end up in a failed state.
References
Thoughts on the CrowdStrike Outage
Unless you’ve been living under a rock, you probably know that last Friday a global crash of computer systems caused by ‘CrowdStrike’ led to widespread chaos and mayhem: flights were cancelled, shops closed their doors, even some hospitals and pharmacies were affected. When things like this happen, I first have a smug feeling “this would never happen at our place”, then I start thinking. Could it?
Broken Software Updates
Our department do take responsibility for keeping quite a lot ... [continue reading]