转型

Stuff happens: using Slack for incident management

IT incidents cost enterprises billions of dollars in lost productivity each year. Here’s how Slack helps teams stay ahead

作者:Lauren Johnson2019 年 6 月 12 日

Like death and taxes, IT incidents are inevitable. Issues like server outages and broken code are so common, the data analytics company Splunk reports, many companies experience about five incidents per month—to the tune of more than $100,000 a pop. Annually, incidents cost enterprise businesses $700 billion in lost productivity, according to the incident response platform PagerDuty and research company IHS Markit. That’s why a solid incident management strategy is a must for any organization.

“People solve incidents, but we can’t do it alone,” said Ali Rayl, Slack’s vice president of  customer experience, at our recent Frontiers conference. “We all need to coordinate across an evolving set of apps, data and information in order to get ourselves through an incident to a positive resolution.”

At Frontiers, we gathered specialists from PagerDuty, the telecom company T-Mobile and RPI Ambulance, an emergency medical services agency based at Rensselaer Polytechnic Institute, to talk about how they use Slack to anticipate, monitor and manage incidents. Here’s a look into some of their strategies and best practices.

Incident management 101: Using Slack channels to gather the right experts

“When something is going on, when something is burning, you need to do things in real time,” said Rachel Obstler, PagerDuty’s vice president of product. “You don’t have the time to send questions up the ranks and wait for someone to respond, so traditional command and control just doesn’t work.”

PagerDuty serves more than 11,000 companies in various industries, from fashion to finance. Its app for Slack connects our two platforms so that our shared users can analyze and resolve incidents within Slack channels, without having to switch tools. This helps the appropriate incident response team members organize their data in one place and collectively determine the best course of action.

Obstler said it’s best practice to limit channel participation to the people who are actively involved in resolving the incident. Anyone else may be tempted to ask questions about why the incident happened, which creates distractions and wastes valuable time.

“The best person to solve an incident is the person who last pushed the code or the person who solved the last incident,” Obstler said. “They need to be able to work with each other in a collaborative way and solve the problem.”

“That Slack channel may be for that technology team, so it’s kind of like their home. You don’t go in and start breaking all the china, right? Hopefully, if you’re a stakeholder, you can watch and be a part of it, which I think is really cool.”

Principal Technology Product Manager at T-MobileJon Soini

Looping in leadership without creating interruptions

Jon Soini, a principal technology product manager at T-Mobile, described how his team uses Slack channels to keep leadership and stakeholders informed during an incident. While stakeholders are invited to watch activity in the channel, they understand when to ask questions and when to step back and let their teams resolve the incident.

“That [Slack] channel may be for that technology team, so it’s kind of like [their] home,” Soini said. “You don’t go in and start breaking all the china, right? Hopefully, if you’re a stakeholder, you can watch and be a part of it, which I think is really cool.”

He added that T-Mobile’s teams choose a scribe to capture notes as incidents are resolved. This practice creates transparent records for stakeholders, executives and other team members to review later.  

“The incident commander outranks the CEO when you have an incident,” said Obstler, although it’s still critical to give stakeholders status updates.

“It’s almost like two parallel sets of work have to happen––thinking about how you have to communicate to everyone, while fixing the problem itself. You don’t want any distractions while fixing the problem.”

Using Slack apps and integrations across the incident management lifecycle

T-Mobile uses automated Slack integrations to look out for underlying issues within the network’s infrastructure. Soini said it’s important to use an alert system that does more than signal a potential weak link. Incident teams need to receive information about:

  • The duration of the alert
  • The behavior of the service before the alert
  • The system activity after the alert triggered

Slack channels are a good place to capture all of this data, so the entire team shares visibility and can work off the same set of information.

But sometimes incidents happen despite a team’s best efforts to stay on top of the system’s health. After incidents are resolved, T-Mobile also turns to Slack for postmortem analysis.

Soini said, “Slack is fantastic for bringing everyone together in the same space so you have a written record of what’s been going on. You can see which alerts and integrations have fired.”

应用图标
安装
Use the PagerDuty app for Slack to:
  • Trigger, view, acknowledge, and resolve PagerDuty incidents directly in Slack
  • Streamline collaboration during incidents
  • Maintain transparent communications throughout the incident lifecycle
  • Reduce resolution times
注意:有些集成仅提供英文版本

To help keep incident logs organized, T-Mobile is experimenting with uploading custom Slack emoji to its workspace that signify different milestones in the incident timeline. “It’s a way to pick out the signals from the noise over what could be many, many hours,” Soini said.

Incident management requires swift and smart teamwork. PagerDuty and T-Mobile address incidents in Slack by:

  • Setting up dedicated channels for incident response the moment one happens. In these channels, teammates loop in the right people and data and assign tasks and follow up on incident status and next steps
  • Adding apps to Slack, like PagerDuty, that integrate with their existing software stack and that send alerts to a Slack channel the second something is out of sync
  • Reviewing channel history so that teams can host objective postmortems after an incident

As Rayl said, “Incidents do happen to everyone no matter what industry you’re in, no matter what business you’re in. They’re just part of what we do. Slack can speed up that time to resolution by bringing the right people, information and apps together in one place.”

Watch the full Frontiers 2019 session

这个帖子有用吗?

0/600

太棒了!

非常感谢你提供反馈!

收到!

感谢你提供反馈。

糟糕!我们遇到问题了。请稍后重试!

继续阅读

转型

Slack 助力客户支持:来自 Slack 纽约社区的专家建议

听取 Slack 专家的建议,了解如何充分利用 Slack 提升客户支持水平。

开发者

在 Slack 进行构建比以往更轻松 — 今日推出适用于开发者和管理员的全新工具

自助式沙盒、Bolt 自定义函数以及经过改进的软件堆栈集成,在 Slack 进行构建从未如此轻松

开发者

建立自动化的构建块

下一代平台现已结束测试阶段,可供所有开发者使用

新闻

帮助所有人使用新版工作流程构建器实现工作自动化

利用全新自动化功能创建更强大的工作流程,不受限于技术知识