Article

How to deal with alert fatigue head-on

Picture of incident.ioincident.io

Everyone experiences stress at work—thankfully, it’s a topic folks aren’t shying away from anymore.

But for on-call engineers, alert fatigue is a phenomenon closer to home. Unfortunately, like stress, it can be just as insidious and drastically impact those it affects.

First discussed in the context of hospital settings, this phrase later entered engineering circles. Alert fatigue is when an excessive number of alerts overwhelms the individuals responsible for answering them, often over a prolonged period, resulting in missed or delayed responses, or them being ignored altogether

The impact of this fatigue can have an effect beyond the individual and can create significant risks for your organization.

But, if you approach on-call the right way, you can mitigate the impacts of alert fatigue or, better yet, avoid it altogether. Here, we'll dive into the tactics teams can implement to address alert fatigue and its underlying causes.

The issue of alert fatigue

Alert fatigue is a common issue impacting various sectors, including healthcare, aviation, IT, and engineering teams.

These days, teams are responsible for a vast web of complex products, all of which are used by user bases that are growing daily. Put together, the potential for incidents has increased significantly. And the impact these incidents can have can cause a lot of issues and pager noise.

By treating every alert (or group of alerts) as an incident you can start to build up a picture of what you're actually doing when they fire

The massive volume of alerts generated by the systems monitoring organizations’ products can overwhelm folks, leading to many ignored or dismissed alerts. This reality is precisely why it’s imperative for organizations to proactively address the causes and symptoms of alert fatigue before they spiral into something untenable.

Some causes of alert fatigue

To address alert fatigue effectively, it helps to first understand its many causes.

  • An overwhelming volume of alerts: When responders are being bombarded with alerts left and right, it can be easy to eventually drown them all out as noise.
  • A lack of ongoing alert management: Operationally mature teams do their best to ship new code with alerts to confirm it works as expected. Often these alerts are shipped, but not reviewed.
  • The absence of clarity around triggered alerts: When responders look at an alert and wonder why it made it through in the first place, it can be easy to dismiss it. This becomes an issue when one legitimately important incident comes through during a barrage of inconsequential ones
  • A lack of correlation between similar alerts: Several alerts being triggered for similar incidents is a common occurrence. But when these incidents aren’t appropriately grouped, it can lead to a feeling of “I’m dealing with multiple incidents at the same time,” even though they’re likely only dealing with one
  • Poorly configured alert thresholds: If your alert threshold is really low, meaning even low severity or non-impacting issues cause alerts, then it can result in folks on-call responding to many incidents during the course of a shift. Down the line, this can lead to desensitization and delayed responses

Practical approaches to combat alert fatigue

Preventing alert fatigue is possible but it requires a proactive approach. Here are some practical steps to tackle this widespread issue head-on:

Treat every alert as an incident

This is likely to feel a little uncomfortable, especially in a world where you’re drowning in alerts, but it’s the single most important change you can make to your workflow on the way to improving.

By treating every alert (or group of alerts) as an incident you can start to build up a picture of what you're actually doing when they fire. This is vital in completing the feedback loop, and will quickly help you spot the alerts that correlate with no action, and those that are commonly linked to high severity incidents.

Filter and prioritize

Another important change you can make is to ensure that alerts waking engineers out-of-hours genuinely require immediate attention. You want to avoid disturbing someone for low-priority alerts that could wait until their working day or be ignored altogether.

Regular reviews of alert effectiveness, informed by team feedback and system changes, can help ensure alerts remain relevant and actionable.

In fact, there’s a debate to be had around whether low severity alerts should exist at all. If they’re purely informational and don’t require action, you could make a case that they should be replaced by dashboards.

If you’re linking alerts to incidents, an easy way to stay on top of whether the alerts require action is to include an alert review step as part of your post-incident processes.

If alerts were fired but didn’t provide a useful signal, or they were false alarms, they should be considered immediate candidates for reducing priority, or removal entirely. Conversely, an alert review step is a good opportunity to add alerts that might be missing!

Introducing incident.io On-call

Connect all of your alerts, configure a schedule for every team, and have confidence that the right people will be notified every time 👇

Learn more

Tune your thresholds

Often your alerts will be directionally good, but poorly tuned. Thresholds can feel like a moving target, but a review of metrics to understand your systems' normal operating ranges can quickly surface misconfigurations.

Regular reviews of alert effectiveness, informed by team feedback and system changes, can help ensure alerts remain relevant and actionable.

We’ve seen success in the past by introducing a recurring meeting to review the alerts that have been fired. If you sort your alerts by how often they’ve fired over the review period, and work from top to bottom, you can make a huge step forward in reducing noise by tackling the worst offenders.

Look for patterns and groups

Effective alert consolidation and de-duplication can significantly reduce notification noise. By grouping alerts into incidents, you can start to track which ones fire for the same or similar underlying causes.

Many monitoring and alerting tools will allow you to configure alert suppression rules, which allow you to say “If alert X is firing, don’t tell me about alert Y.” A good example of this: If a database failure causes multiple services to alert, consider reconfiguring your alerts to avoid a barrage of noise.

Don’t be afraid to remove alerts!

In our experience, people are far happier to add alerts than to remove them. It stands to reason since adding an alert provides a sense of security: “I can worry less if I know an alert will fire here.”

And it’s the reverse of this situation that makes removing alerts a less comfortable practice. When you ask about it, you’ll almost certainly hear: “But what if X happens and we don’t know about it?”

When it comes to removing alerts, here are a few things to consider:

  • When did the alert last fire?: Any alerts that have fired frequently and not provided a signal on a genuine issue should be removed. Any alerts that have provided mixed signals on real issues and false alarms should be candidates for reconfiguration first, and then removal.
  • Do other alerts provide a higher signal?: If you have other alerts that provide more useful signals on genuine issues, then there’s little use to having duplicates. This will only contribute to the fatigue.
  • If the alert is mostly noise, is it ever going to be useful?: Even if the alert helps to catch an incident some proportion of the time, if it fires false alarms most of the time, people will tune it out. Keeping this alert around is a net negative.

Think about the humans

The scheduling of on-call rotations greatly affects exposure to alerts. Strive for balance to prevent burnout, ensuring enough responders are in rotation without losing familiarity with the system.

The best thing you can do is set up processes to ensure that, when folks are on call, they do so in a way that is fair for everyone involved.

It can be helpful to consider various scheduling types, like daily, weekly, or follow-the-sun rotations to accommodate geographical and workload diversity.

Here's a quick breakdown of other rotation types you'll come across, but there are many more!

  • Bi-weekly: The bi-weekly on-call schedule rotates team members every other week or twice a month
  • Week and weekends: In a week/weekend schedule, one set of team members is on-call during the week, while another set takes over during the weekend. This schedule is handy when overnight hours are involved, as it allows employees to have breaks from night shifts
  • Follow-the-sun: This schedule arranges on-call team members based on their work locations. Follow-the-sun schedules are ideal for remote teams with members across different geographic areas. It ensures that there is always an employee available during their regular work hours to handle incidents.

Alert fatigue doesn’t have to be the rule

It’s true that alert fatigue affects a lot of people on-call. But the good news is that more people are talking about it openly and are treating it as the exception, not the rule.

No one should be expected to be the hero for their organization and work days on end being on-call. No one should deal with dozens of low-severity alerts at 1 AM. The reality is that working these shifts is hard, disruptive, and can get in the way of personal life and responsibilities.

The best thing you can do is set up processes to ensure that, when folks are on-call, they do so in a way that is fair for everyone involved.

By making it a less stressful experience, you can help everyone do their best work, reduce operational risks, and ultimately deal with incidents much more effectively.

Operational excellence starts here