Alarms made right

At my work, we’re very much dependent on alarms. The systems need to be operational 24/7. When unexpected issues arise, timely manual intervention may be essential. We need good monitoring systems to catch whatever is wrong and good alarm systems to make someone aware that something urgently needs attention. I would claim that we’re quite good at setting up, tuning, and handling alarms.

When I’m not at work, I’m often sailing, often single-handedly for longer distances. Alarms are important for me when sailing as well. I’m very annoyed with the alarms from my Raymarine navigation system, hence I felt the need to write a blog post (I really should look into replacing it with OpenCPN one of those days).

Observing maritime incidents, it’s evident that even premium alarm systems on commercial ships sometimes fall short from preventing fully preventable accidents.

Defining Alarms

An alarm is an alert that something is wrong and requires immediate attention, typically accompanied by loud or annoying sounds. This causes recipients to stop their current tasks or wake up and respond immediately. The urgency may vary. Per our standard contract our on-call technician should respond within half an hour (though we’re usually on the ball much faster). Alarms on a fast-moving boat may require a response in seconds. The principle remains the same - this definition even fits the fire alarm.

Monitoring systems detect various issues, but only the most urgent ones should trigger alarms. Less critical issues can be categorized as warnings, which don’t require immediate action but should still be addressed regularly. Unhandled warnings can escalate into alarms if neglected.

False positives are alarms that trigger without needing urgent attention, while false negatives are missed alarms for actual issues. Real positives are valid alarms requiring intervention.

Some Maritime Accidents

Every now and then, there is a shipwreck or near-accident, and I believe a well-designed alarm system onboard would have prevented many accidents from happening.

Goðafoss

In 2011, the cargo ship Goðafoss crashed right into a rock in Hvaler, Norway, leaking oil into the national park. The captain didn’t see the rock due to the cargo, and the pilot left the ship too early. Of course, the nav systems knew very well that the rock was there. Of course, the nav systems knew very well about the ship’s intended route, speed, and direction. It’s trivial to alert the captain that the current route is dangerous and that the ship is heading for the rocks.

In November 2018, the Norwegian warship Helge Ingstad ran straight into an oil tanker. It supposedly had some of the most advanced electronics and radar systems that money can buy. It had seven people on the bridge, six of whom didn’t see the tanker (allegedly because of too bright deck lights on the tanker), and the observant one was waiting for the command to turn the rudder. Like all big ships (and quite a few smaller boats), the oil tanker had an AIS transponder. It is a very computationally easy task to have an alarm system inform the people on the bridge, “Hey, do you realize that you’re on a collision course with an oil tanker? It can still be avoided if you turn the rudder hard now, but in ten seconds, it will be too late!” This can also be done using radar rather than AIS, albeit it’s more computationally difficult.

In March 2019, the cruise ship Viking Sky got engine problems on the Norwegian west coast. It was probably only minutes away from becoming a major catastrophe. There were multiple alarms that could have told the crew that the engines needed more oil, but they were dismissed. A little bit of waves, the oil pumps in all the engines started sucking air, and the engines automatically shut down to prevent destruction.

For at least the two latter cases, I consider the main problem to be alarm fatigue. Viking Sky had too many alarms (and also a lack of routing - alarms on low oil level in the engine and alarms on the temperature in the swimming pool on the deck went to the same people). The crew grew complacent and ignored important alarms. At Helge Ingstad, there was, on average, one alarm per minute in the hours before the accident, but they never got an alarm on the tanker. To avoid far too many alarms, they had chosen to get alarms only on ships they had manually chosen to track (aka the ships they were aware of). With alarms on all ships enabled, they would have problems with alarm floods every time they passed a port. This sounds very much like a tuning problem to me.

The Cost of Alarms

Alarms have hidden costs, like productivity loss (or fragile things breaking) as people drop whatever they have in their hands to deal with the alarm. Too many false alarms causes complacency, the risk that a high-asset warship sinks is to be counted as a cost. When I take my sailboat into a difficult harbour, I may easily do expensive mistakes as my attention gets diverted into turning off the annoying “low depth”-alarm.

I can say a lot of positive things about my previous workplace, but we definitely didn’t nail the alarm bit. Whenever something went wrong with the system, my boss would complain, “Why didn’t we have monitoring on that?” and we would put up yet another monitoring. However, we never cared much about the monitoring anyway. It was all clogged up with false positives, so the added monitoring wasn’t really useful. We relied on people calling in by telephone, telling us that things didn’t work.

Unlike at my previous workplace, we currently get overtime payments for handling alarms outside of office hours. Arguably, it may give us the perverse incentive to do shoddy daytime work to get extra payments for handling the alarms in the evenings, but on the other hand, the alarms actually trigger real costs that can easily be measured and metered. For us at Redpill Linpro CloudOps, it’s a department-wide priority to keep the number of alarms low. At the same time, we have very few false negatives - very few telephone calls; most of the time, we’re getting alarms and are on the ball long before the customer notices anything is wrong. I’d say that over the years we’ve become quite good at this, and we’re continuously trying to improve. The warnings have less visible costs; admittedly, we’re not equally good at handling warnings.

A key feature in almost any alarm system is the ability to mute alarms for certain periods, both if we expect it to be triggered (sometimes I move the fire alarms from the ceiling before frying food) and if it has already been triggered (I can push my fire alarm, and it stays muted for 15 minutes. My Raymarine system has several OK buttons that I can press to “clear” the alarm for a generous 15 seconds).

Ideally, there should be near-zero alarms (zero alarms would probably also cause complacency and a lack of preparation for alarms to come). In our team and department meetings, we have a fixed agenda point to talk about alarms. For false alarms, the alarm systems should probably be tuned and tweaked or perhaps even turned off. For real alarms, perhaps we need to do things differently so the problem does not appear out of the blue next time.

Conclusion

A well-working alarm system may save your day (or your ship from sinking). To be really useful, alarms should only happen exceptionally and only when urgent attention from the alarm recipients is needed. Almost regardless of the situation, no person should experience getting one alarm per minute. Every alarm should be considered to have a cost, even if it’s not directly visible in the accounting. Many alarm systems need to be improved and tuned continuously. It may be necessary to spend some time thinking thoroughly through and discussing every alarm/incident. Sometimes the alarm came (or didn’t come) because the monitoring and alarms weren’t well enough set up and tuned; other times the alarm could have been avoided by people anticipating and muting the alarm before it was triggered or by avoiding the dangerous situation in the first place.

My domain expertise is limited to system administration (and to some extent, sailing). I’m admittedly clueless when it comes to subjects like bridge management, and I didn’t do research into the subject of alarm management. It would be arrogant of me to claim that I know how to prevent ships from crashing. Still, I strongly believe that my statements above are universally true for most alarm systems - alarms on a big ship, fire alarms, health-care alarms, alarm systems in a nuclear reactor, and whatnot. Of course, incidents at our workplace rarely cause deaths or nuclear meltdowns, but even when - maybe particularly when the stakes are really high, it’s important to avoid alarm fatigue. The alarm system should be considered broken when the number of alarms is so high that people stop caring about them.

Credits

  • Goðafoss photo, Drdoht, CC BY-SA 3.0, via Wikimedia Commons
  • First round of spelling correcions - ChatGPT (new typos probably introduced after spell check)

Tobias Brox

Senior Systems Consultant at Redpill Linpro

Tobias started working as a developer when he finished his degree at The University of Tromsø. He joined Redpill Linpro as a system administrator a decade ago, and have embraced working with our customers, and maintaining/improving our internal tools.

Just-Make-toolbox

make is a utility for automating builds. You specify the source and the build file and make will determine which file(s) have to be re-built. Using this functionality in make as an all-round tool for command running as well, is considered common practice. Yes, you could write Shell scripts for this instead and they would be probably equally good. But using make has its own charm (and gets you karma points).

Even this ... [continue reading]

Containerized Development Environment

Published on February 28, 2024

Ansible-runner

Published on February 27, 2024