This post appeared originally in our sysadvent series and has been moved here following the discontinuation of the sysadvent microsite
A few of systemd features that helps you and your fellow sysadmins.
At 3am, I want to sleep. I do not want SMS with “Service X is down”, and I do not want my systems to wake the on-call personnel, so they can scratch their heads and call me about “Service X is down, and I need help fixing it”.
There are a couple of things you can do to avoid this.
Automatic restarts
Sometimes processes die. Particularly at inconvenient times, it seems. In many cases, the fix is to “restart it, and figure out the cause later”. You can configure systemd to restart your service. If the restart is successful, the service is not unavailable, and no SMS is sent.
[Service]
Restart=always
The “Restart=” directive tells systemd to restart the service if the process terminates. You can set it to “always”, or read the manual page to see if the other values make sense for you.
Just ensure you follow up on unexpected service restarts. This is logged in the journal, and you should add this to your monitoring.
Improved documentation
Not all services are well known, or well documented. The on-call personnel may not be the one responsible for the architecture or the day-to-day operations for that server.
You don’t need to edit the original unit file, you can add a drop-in
file in /etc/systemd/system/<yourservice>.d/<something>.conf
:
$ mkdir /etc/systemd/system/mystery.service.d
$ cat > /etc/systemd/system/mystery.service.d/documentation.conf
[Unit]
Documentation=https://wiki.corp.example.org/SomeClient/CommonFailures \
https://www.enterpricy.example.org/Documentation/ \
man:mysteryd(8) \
file:///opt/mystery/doc/index.html
^D
The content of the “Documentation=” directive is visible when running “systemctl status servicename”. This helps your on-call person, when the alarm goes off, to figure out what is wrong, and how to fix it. Add your own service documentation, and a link to the upstream documentation.
The output will look like this:
$ systemctl status mystery.service
● mystery.service - MYSTERY Scheduler
Loaded: loaded (/lib/systemd/system/mystery.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/mystery.service.d
└─documentation.conf
Active: active (running) since Mon 2016-11-28 06:25:01 CET; 6h ago
Docs: man:mysteryd(8)
https://wiki.corp.example.org/SomeClient/CommonFailures
https://www.enterpricy.example.org/Documentation/
man:mysteryd(8)
file:///opt/mystery/doc/index.html
Main PID: 10015 (mysteryd)
CPU: 251ms
CGroup: /system.slice/mystery.service
├─10015 /usr/sbin/mysteryd -l
└─10218 /usr/lib/mystery/notifier/dbus dbus://
Nov 28 06:25:01 turbotape systemd[1]: Started MYSTERY Scheduler.
Show connections for a service
Systemd tracks all processes per service by placing them in the same cgroup. Using “ps”, “awk” and “lsof”, we can print network connections for a single service, across multiple processes.
The one-liner
…ironically enough not on one line
ps -e -o pid,cgroup \
| awk '$2 ~ /dovecot.service/ {print "-p", $1}' \
| xargs -r lsof -n -i -a
What does it do?
The example lists all processes started by “dovecot.service”.
- List all running processes, and print PID and cgroup on each line.
- For each line, check if the “cgroup” matches our regular expression, and print the PID. Actually, print a “-p”, and the PID, since this is used by lsof.
- Use “xargs” to take the “-p $pid” lines from STDIN, and add them to the “lsof” command line.
Example output
Here, we see that the “dovecot.service” unit has a number of listening ports, and one established session.
$ ps -e -o pid,cgroup \
| awk '$2 ~ /dovecot.service/ {print "-p", $1}' \
| xargs -r lsof -n -i -a
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
dovecot 17335 root 31u IPv4 11520166 0t0 TCP *:imap2 (LISTEN)
dovecot 17335 root 32u IPv6 11520167 0t0 TCP *:imap2 (LISTEN)
dovecot 17335 root 33u IPv4 11520168 0t0 TCP *:imaps (LISTEN)
dovecot 17335 root 34u IPv6 11520169 0t0 TCP *:imaps (LISTEN)
imap-logi 17564 dovenull 18u IPv6 25385800 0t0 TCP [2001:db8::de:caf:bad]:imaps->[2001:db8::c0:ff:ee]:55043 (ESTABLISHED)
Why automate Ansible
Ansible can be used for many things. There are only a few things I have on my bucket list of things I would like to do, where Ansible cannot help me.
One of my most urgent things to handle was the increasing complexity of Ansible, its configuration and in particular the role development. As I got deeper into Ansible, more and more factors needed to be taken into consideration when setting up a role: the role structure, linting issues, molecule ... [continue reading]