Principles for incident response
Join Sarah Manicke and me for a lightning-fast incident investigation that highlights the principles every incident responder should master.
I enjoy working with junior software engineers, they are eager to learn and a great reminder of the path behind us that we can help trace for them.
A few weeks ago, I was in a 1on1 with Sarah Manicke , a brilliant junior software engineer on my team when I was called in to help with an ongoing incident investigation. Although it interrupted us, Sarah was excited about the learning opportunity as she was looking forward to joining the incident response team. We joined the incident chat, got briefed and opened a graph of a non-critical operation of our Android app which was shared with us which showed a sharp increase in error rate, almost from 0 to 100%.
I opened our routing changelog dashboard to try to find a corresponding change in our skipper configuration, and found a changeset that matched the exact time of the error rate spike. It was not immediately obvious that this was the culprit but it was relatively harmless to revert, so we did. The graph we opened earlier began to show a sharp drop in error rate and it was indeed the root cause.
At this point, Sarah declared, “Wow, you’re so lucky, you immediately opened the right dashboard and found the issue.” I was a bit taken aback by this statement because when I was paged, people had been investigating for some time already and had not found the issue. I started reflecting with Sarah as to what had happened in my incident responder brain so that I could quickly find the root cause of the issue. There are 2 core principles that incident responders should master: 1) understanding the symptoms of an incident and 2) knowing where to look for changes. Incidents always happen as a result of a system change, which may even be out of your control.
1. Understanding the symptoms of an incident is important as it helps you understand what you are dealing with: do you have a complete disruption of service? A high error rate or a slow burn? Is it happening across all platforms? Is it a steady or sharp increase? In our case with Sarah, we saw a very sharp increase in errors at one point in time. That already ruled out a new app release, because adoptions take several days. It also ruled out service deployments because we use canary deployments (staged releases).
2. Knowing where to look is part of your incident training, you should have a high level idea of places where changes can lead to issues, this includes deployments of course, but also configuration and feature toggle changes.
Since the issue was coming from one of our Apps and with a sharp increase, we were probably looking at a configuration change. What tipped me off to the routing change was that we could not find any traces of network calls behind our reverse proxy. So I opened up my bookmarked routing changelog dashboard and we were able to resolve the issue. The best part? A junior software engineer learned the ropes and will be ready to be a great incident responder.