This post appeared originally in our sysadvent series and has been moved here following the discontinuation of the sysadvent microsite
I’m sure we all have had “that feeling once”. You patch your desktop or laptop, then type in reboot in a shell in order to boot your computer. And that crucial server you were working on starts shutting down.
But fear not - a solution exists for this and similar problems.
History
Molly-guard was (according to Internet) originally a improvised plexiglass cover shielding the kill switch on an IBM 4341. It was named after a programmers daughter - Molly - who tripped this kill switch repeatedly. The name has obviously stuck around.
Your own virtual Molly guard
molly-guard is a small program that tries to prevent you from shutting down or rebooting servers. On Debian and derivatives, it can be usually be installed by running
apt install molly-guard
while repackaged RPM packages can be found for RedHat and derivatives.
How it works
In a nutshell, this program works by forcing one or more checks before the commands halt, shutdown, poweroff or reboot are run. In order to achieve this, these commands are replaced with scripts invoking the molly-guard functionality (ie. the check scripts).
The checks resides in the /etc/molly-guard/run.d directory
- all
scripts in this directory are run and all needs to exit successfully,
that is returns a exit value of 0. After all scripts have exited
successfully, the original command is executed.
Typically, these checks includes you having to type in the name of the host you want to halt or boot when you have logged in via SSH. Entering the wrong name will abort your command.
If you are in a re-attached screen
session, molly-guard will not
find your SSH process and think you are on a local console where the
chance of a screw up is smaller, so it won’t ask.
If you set the ALWAYS_QUERY_HOSTNAME
variable in the
/etc/molly-guard/rc
configuration file, this script will also force
a check when in screen/imux, logged in via console etc.
Other hypothetical checks can force the user to give a reason or work-order for rebooting a server in order to comply with more strict operations regimes.
On RedHat
Since molly-guard isn’t packaged for EL systems, we use a simpler
approach, aliasing the commands reboot
, shutdown
etc.
It is as simple as this:
clumsy_protect() {
local cmd="$1"
shift
echo "Running $cmd on $HOSTNAME in 2 seconds!"
sleep 2 || return
command "$cmd" "$@"
}
alias reboot="clumsy_protect reboot"
alias shutdown="clumsy_protect shutdown"
alias poweroff="clumsy_protect poweroff"
Put the code in /etc/profile.d/clumsy-protect.sh
. This gives you 2
seconds to realise that you typed reboot on the wrong machine, and if
you hit ^C in time, you can heave a great sigh of relief.
(Aside: the fact that an interrupted sleep
returns failure can be
used in idioms like while sleep 1; do something; done
instead of
while true; do something; done
where you may have to mash ^C like a
madman to make it stop.)
The downside to the alias approach is that sudo
does not look for
aliases, so only sysadmins who like to do everything in a root shell
get this protection.
More on sudo
In bash, you can work around that alias problem. If a trailing space is added to the expanded alias value, bash will perform alias expansion on the rest of the command as well!
alias sudo="sudo "
Now, the problem becomes that the function clumsy_protect is not
available in root’s environment. The easy fix is to put the function
in a script file in $PATH
instead.
Going further
So you rebooted the correct server - but that quick reboot didn’t turn out to be so quick! That terabyte file-system needs fsck, or worse, your initrd was corrupt!
Introducing the all-singing, all-dancing clumsy_protect. It includes
a utility which checks your /etc/fstab
for typos like:
- does that LABEL exist?
- did you forget to remove mounting of that logical volume you deleted?
- did you update file-system type when you upgraded from ext3 to ext4?
clumsy_protect also check that your initrd has the correct format. On RedHat, it even reruns prelink (if needed) before the reboot so that you don’t get that annoying alert that your server is running outdated libraries.
Pulling it all together
So where can you get this wonder? Look no further - its on github!
Thoughts on the CrowdStrike Outage
Unless you’ve been living under a rock, you probably know that last Friday a global crash of computer systems caused by ‘CrowdStrike’ led to widespread chaos and mayhem: flights were cancelled, shops closed their doors, even some hospitals and pharmacies were affected. When things like this happen, I first have a smug feeling “this would never happen at our place”, then I start thinking. Could it?
Broken Software Updates
Our department do take responsibility for keeping quite a lot ... [continue reading]