Like death and taxes, IT incidents are inevitable. Issues like server outages and broken code are so common, the data analytics company Splunk reports, many companies experience about five incidents per month—to the tune of more than $100,000 a pop. Annually, incidents cost enterprise businesses $700 billion in lost productivity, according to the incident response platform PagerDuty and research company IHS Markit. That’s why a solid incident management strategy is a must for any organization.
“People solve incidents, but we can’t do it alone,” said Ali Rayl, Slack’s vice president of customer experience, at our recent Frontiers conference. “We all need to coordinate across an evolving set of apps, data and information in order to get ourselves through an incident to a positive resolution.”
At Frontiers, we gathered specialists from PagerDuty, the telecom company T-Mobile and RPI Ambulance, an emergency medical services agency based at Rensselaer Polytechnic Institute, to talk about how they use Slack to anticipate, monitor and manage incidents. Here’s a look into some of their strategies and best practices.
Incident management 101: Using Slack channels to gather the right experts
“When something is going on, when something is burning, you need to do things in real time,” said Rachel Obstler, PagerDuty’s vice president of product. “You don’t have the time to send questions up the ranks and wait for someone to respond, so traditional command and control just doesn’t work.”
PagerDuty serves more than 11,000 companies in various industries, from fashion to finance. Its app for Slack connects our two platforms so that our shared users can analyze and resolve incidents within Slack channels, without having to switch tools. This helps the appropriate incident response team members organize their data in one place and collectively determine the best course of action.
Obstler said it’s best practice to limit channel participation to the people who are actively involved in resolving the incident. Anyone else may be tempted to ask questions about why the incident happened, which creates distractions and wastes valuable time.
“The best person to solve an incident is the person who last pushed the code or the person who solved the last incident,” Obstler said. “They need to be able to work with each other in a collaborative way and solve the problem.”
“That Slack channel may be for that technology team, so it’s kind of like their home. You don’t go in and start breaking all the china, right? Hopefully, if you’re a stakeholder, you can watch and be a part of it, which I think is really cool.”
Looping in leadership without creating interruptions
Jon Soini, a principal technology product manager at T-Mobile, described how his team uses Slack channels to keep leadership and stakeholders informed during an incident. While stakeholders are invited to watch activity in the channel, they understand when to ask questions and when to step back and let their teams resolve the incident.
“That [Slack] channel may be for that technology team, so it’s kind of like [their] home,” Soini said. “You don’t go in and start breaking all the china, right? Hopefully, if you’re a stakeholder, you can watch and be a part of it, which I think is really cool.”
He added that T-Mobile’s teams choose a scribe to capture notes as incidents are resolved. This practice creates transparent records for stakeholders, executives and other team members to review later.
“The incident commander outranks the CEO when you have an incident,” said Obstler, although it’s still critical to give stakeholders status updates.
“It’s almost like two parallel sets of work have to happen––thinking about how you have to communicate to everyone, while fixing the problem itself. You don’t want any distractions while fixing the problem.”
Using Slack apps and integrations across the incident management lifecycle
T-Mobile uses automated Slack integrations to look out for underlying issues within the network’s infrastructure. Soini said it’s important to use an alert system that does more than signal a potential weak link. Incident teams need to receive information about:
- The duration of the alert
- The behavior of the service before the alert
- The system activity after the alert triggered
Slack channels are a good place to capture all of this data, so the entire team shares visibility and can work off the same set of information.
But sometimes incidents happen despite a team’s best efforts to stay on top of the system’s health. After incidents are resolved, T-Mobile also turns to Slack for postmortem analysis.
Soini said, “Slack is fantastic for bringing everyone together in the same space so you have a written record of what’s been going on. You can see which alerts and integrations have fired.”
- Trigger, view, acknowledge, and resolve PagerDuty incidents directly in Slack
- Streamline collaboration during incidents
- Maintain transparent communications throughout the incident lifecycle
- Reduce resolution times
To help keep incident logs organized, T-Mobile is experimenting with uploading custom Slack emoji to its workspace that signify different milestones in the incident timeline. “It’s a way to pick out the signals from the noise over what could be many, many hours,” Soini said.
Incident management requires swift and smart teamwork. PagerDuty and T-Mobile address incidents in Slack by:
- Setting up dedicated channels for incident response the moment one happens. In these channels, teammates loop in the right people and data and assign tasks and follow up on incident status and next steps
- Adding apps to Slack, like PagerDuty, that integrate with their existing software stack and that send alerts to a Slack channel the second something is out of sync
- Reviewing channel history so that teams can host objective postmortems after an incident
As Rayl said, “Incidents do happen to everyone no matter what industry you’re in, no matter what business you’re in. They’re just part of what we do. Slack can speed up that time to resolution by bringing the right people, information and apps together in one place.”