Systemd at 3am

This post appeared originally in our sysadvent series and has been moved here following the discontinuation of the sysadvent microsite

A few of systemd features that helps you and your fellow sysadmins.

At 3am, I want to sleep. I do not want SMS with “Service X is down”, and I do not want my systems to wake the on-call personnel, so they can scratch their heads and call me about “Service X is down, and I need help fixing it”.

There are a couple of things you can do to avoid this.

Automatic restarts

Sometimes processes die. Particularly at inconvenient times, it seems. In many cases, the fix is to “restart it, and figure out the cause later”. You can configure systemd to restart your service. If the restart is successful, the service is not unavailable, and no SMS is sent.

[Service]
Restart=always

The “Restart=” directive tells systemd to restart the service if the process terminates. You can set it to “always”, or read the manual page to see if the other values make sense for you.

Just ensure you follow up on unexpected service restarts. This is logged in the journal, and you should add this to your monitoring.

Improved documentation

Not all services are well known, or well documented. The on-call personnel may not be the one responsible for the architecture or the day-to-day operations for that server.

You don’t need to edit the original unit file, you can add a drop-in file in /etc/systemd/system/<yourservice>.d/<something>.conf:

$ mkdir /etc/systemd/system/mystery.service.d
$ cat > /etc/systemd/system/mystery.service.d/documentation.conf
[Unit]
Documentation=https://wiki.corp.example.org/SomeClient/CommonFailures \
  https://www.enterpricy.example.org/Documentation/ \
  man:mysteryd(8) \
  file:///opt/mystery/doc/index.html
^D

The content of the “Documentation=” directive is visible when running “systemctl status servicename”. This helps your on-call person, when the alarm goes off, to figure out what is wrong, and how to fix it. Add your own service documentation, and a link to the upstream documentation.

The output will look like this:

$ systemctl status mystery.service
● mystery.service - MYSTERY Scheduler
   Loaded: loaded (/lib/systemd/system/mystery.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/mystery.service.d
           └─documentation.conf
   Active: active (running) since Mon 2016-11-28 06:25:01 CET; 6h ago
     Docs: man:mysteryd(8)
           https://wiki.corp.example.org/SomeClient/CommonFailures
           https://www.enterpricy.example.org/Documentation/
           man:mysteryd(8)
           file:///opt/mystery/doc/index.html
 Main PID: 10015 (mysteryd)
      CPU: 251ms
   CGroup: /system.slice/mystery.service
           ├─10015 /usr/sbin/mysteryd -l
           └─10218 /usr/lib/mystery/notifier/dbus dbus://

Nov 28 06:25:01 turbotape systemd[1]: Started MYSTERY Scheduler.

Show connections for a service

Systemd tracks all processes per service by placing them in the same cgroup. Using “ps”, “awk” and “lsof”, we can print network connections for a single service, across multiple processes.

The one-liner

…ironically enough not on one line

ps -e -o pid,cgroup \
  | awk '$2 ~ /dovecot.service/ {print "-p", $1}' \
  | xargs -r lsof -n -i -a

What does it do?

The example lists all processes started by “dovecot.service”.

  • List all running processes, and print PID and cgroup on each line.
  • For each line, check if the “cgroup” matches our regular expression, and print the PID. Actually, print a “-p”, and the PID, since this is used by lsof.
  • Use “xargs” to take the “-p $pid” lines from STDIN, and add them to the “lsof” command line.

Example output

Here, we see that the “dovecot.service” unit has a number of listening ports, and one established session.

$ ps -e -o pid,cgroup \
    | awk '$2 ~ /dovecot.service/ {print "-p", $1}' \
    | xargs -r lsof -n -i -a
COMMAND   PID USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
dovecot 17335 root   31u  IPv4 11520166      0t0  TCP *:imap2 (LISTEN)
dovecot 17335 root   32u  IPv6 11520167      0t0  TCP *:imap2 (LISTEN)
dovecot 17335 root   33u  IPv4 11520168      0t0  TCP *:imaps (LISTEN)
dovecot 17335 root   34u  IPv6 11520169      0t0  TCP *:imaps (LISTEN)
imap-logi 17564 dovenull   18u  IPv6 25385800      0t0  TCP [2001:db8::de:caf:bad]:imaps->[2001:db8::c0:ff:ee]:55043 (ESTABLISHED)

Stig Sandbeck Mathisen

Former Senior Systems Architect at Redpill Linpro

Why automate Ansible

Ansible can be used for many things. There are only a few things I have on my bucket list of things I would like to do, where Ansible cannot help me.

One of my most urgent things to handle was the increasing complexity of Ansible, its configuration and in particular the role development. As I got deeper into Ansible, more and more factors needed to be taken into consideration when setting up a role: the role structure, linting issues, molecule ... [continue reading]

Comparison of different compression tools

Published on December 18, 2024

Why TCP keepalive may be important

Published on December 17, 2024