Systemd at 3am

This post appeared originally in our sysadvent series and has been moved here following the discontinuation of the sysadvent microsite

A few of systemd features that helps you and your fellow sysadmins.

At 3am, I want to sleep. I do not want SMS with “Service X is down”, and I do not want my systems to wake the on-call personnel, so they can scratch their heads and call me about “Service X is down, and I need help fixing it”.

There are a couple of things you can do to avoid this.

Automatic restarts

Sometimes processes die. Particularly at inconvenient times, it seems. In many cases, the fix is to “restart it, and figure out the cause later”. You can configure systemd to restart your service. If the restart is successful, the service is not unavailable, and no SMS is sent.

[Service]
Restart=always

The “Restart=” directive tells systemd to restart the service if the process terminates. You can set it to “always”, or read the manual page to see if the other values make sense for you.

Just ensure you follow up on unexpected service restarts. This is logged in the journal, and you should add this to your monitoring.

Improved documentation

Not all services are well known, or well documented. The on-call personnel may not be the one responsible for the architecture or the day-to-day operations for that server.

You don’t need to edit the original unit file, you can add a drop-in file in /etc/systemd/system/<yourservice>.d/<something>.conf:

$ mkdir /etc/systemd/system/mystery.service.d
$ cat > /etc/systemd/system/mystery.service.d/documentation.conf
[Unit]
Documentation=https://wiki.corp.example.org/SomeClient/CommonFailures \
  https://www.enterpricy.example.org/Documentation/ \
  man:mysteryd(8) \
  file:///opt/mystery/doc/index.html
^D

The content of the “Documentation=” directive is visible when running “systemctl status servicename”. This helps your on-call person, when the alarm goes off, to figure out what is wrong, and how to fix it. Add your own service documentation, and a link to the upstream documentation.

The output will look like this:

$ systemctl status mystery.service
● mystery.service - MYSTERY Scheduler
   Loaded: loaded (/lib/systemd/system/mystery.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/mystery.service.d
           └─documentation.conf
   Active: active (running) since Mon 2016-11-28 06:25:01 CET; 6h ago
     Docs: man:mysteryd(8)
           https://wiki.corp.example.org/SomeClient/CommonFailures
           https://www.enterpricy.example.org/Documentation/
           man:mysteryd(8)
           file:///opt/mystery/doc/index.html
 Main PID: 10015 (mysteryd)
      CPU: 251ms
   CGroup: /system.slice/mystery.service
           ├─10015 /usr/sbin/mysteryd -l
           └─10218 /usr/lib/mystery/notifier/dbus dbus://

Nov 28 06:25:01 turbotape systemd[1]: Started MYSTERY Scheduler.

Show connections for a service

Systemd tracks all processes per service by placing them in the same cgroup. Using “ps”, “awk” and “lsof”, we can print network connections for a single service, across multiple processes.

The one-liner

…ironically enough not on one line

ps -e -o pid,cgroup \
  | awk '$2 ~ /dovecot.service/ {print "-p", $1}' \
  | xargs -r lsof -n -i -a

What does it do?

The example lists all processes started by “dovecot.service”.

  • List all running processes, and print PID and cgroup on each line.
  • For each line, check if the “cgroup” matches our regular expression, and print the PID. Actually, print a “-p”, and the PID, since this is used by lsof.
  • Use “xargs” to take the “-p $pid” lines from STDIN, and add them to the “lsof” command line.

Example output

Here, we see that the “dovecot.service” unit has a number of listening ports, and one established session.

$ ps -e -o pid,cgroup \
    | awk '$2 ~ /dovecot.service/ {print "-p", $1}' \
    | xargs -r lsof -n -i -a
COMMAND   PID USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
dovecot 17335 root   31u  IPv4 11520166      0t0  TCP *:imap2 (LISTEN)
dovecot 17335 root   32u  IPv6 11520167      0t0  TCP *:imap2 (LISTEN)
dovecot 17335 root   33u  IPv4 11520168      0t0  TCP *:imaps (LISTEN)
dovecot 17335 root   34u  IPv6 11520169      0t0  TCP *:imaps (LISTEN)
imap-logi 17564 dovenull   18u  IPv6 25385800      0t0  TCP [2001:db8::de:caf:bad]:imaps->[2001:db8::c0:ff:ee]:55043 (ESTABLISHED)

Stig Sandbeck Mathisen

Former Senior Systems Architect at Redpill Linpro

Thoughts on the CrowdStrike Outage

Unless you’ve been living under a rock, you probably know that last Friday a global crash of computer systems caused by ‘CrowdStrike’ led to widespread chaos and mayhem: flights were cancelled, shops closed their doors, even some hospitals and pharmacies were affected. When things like this happen, I first have a smug feeling “this would never happen at our place”, then I start thinking. Could it?

Broken Software Updates

Our department do take responsibility for keeping quite a lot ... [continue reading]

Alarms made right

Published on June 27, 2024

Just-Make-toolbox

Published on March 22, 2024