轉型

Stuff happens: using Slack for incident management

IT incidents cost enterprises billions of dollars in lost productivity each year. Here’s how Slack helps teams stay ahead

作者:Lauren Johnson2019 年 6 月 12 日

Like death and taxes, IT incidents are inevitable. Issues like server outages and broken code are so common, the data analytics company Splunk reports, many companies experience about five incidents per month—to the tune of more than $100,000 a pop. Annually, incidents cost enterprise businesses $700 billion in lost productivity, according to the incident response platform PagerDuty and research company IHS Markit. That’s why a solid incident management strategy is a must for any organization.

“People solve incidents, but we can’t do it alone,” said Ali Rayl, Slack’s vice president of  customer experience, at our recent Frontiers conference. “We all need to coordinate across an evolving set of apps, data and information in order to get ourselves through an incident to a positive resolution.”

At Frontiers, we gathered specialists from PagerDuty, the telecom company T-Mobile and RPI Ambulance, an emergency medical services agency based at Rensselaer Polytechnic Institute, to talk about how they use Slack to anticipate, monitor and manage incidents. Here’s a look into some of their strategies and best practices.

Incident management 101: Using Slack channels to gather the right experts

“When something is going on, when something is burning, you need to do things in real time,” said Rachel Obstler, PagerDuty’s vice president of product. “You don’t have the time to send questions up the ranks and wait for someone to respond, so traditional command and control just doesn’t work.”

PagerDuty serves more than 11,000 companies in various industries, from fashion to finance. Its app for Slack connects our two platforms so that our shared users can analyze and resolve incidents within Slack channels, without having to switch tools. This helps the appropriate incident response team members organize their data in one place and collectively determine the best course of action.

Obstler said it’s best practice to limit channel participation to the people who are actively involved in resolving the incident. Anyone else may be tempted to ask questions about why the incident happened, which creates distractions and wastes valuable time.

“The best person to solve an incident is the person who last pushed the code or the person who solved the last incident,” Obstler said. “They need to be able to work with each other in a collaborative way and solve the problem.”

“That Slack channel may be for that technology team, so it’s kind of like their home. You don’t go in and start breaking all the china, right? Hopefully, if you’re a stakeholder, you can watch and be a part of it, which I think is really cool.”

Principal Technology Product Manager at T-MobileJon Soini

Looping in leadership without creating interruptions

Jon Soini, a principal technology product manager at T-Mobile, described how his team uses Slack channels to keep leadership and stakeholders informed during an incident. While stakeholders are invited to watch activity in the channel, they understand when to ask questions and when to step back and let their teams resolve the incident.

“That [Slack] channel may be for that technology team, so it’s kind of like [their] home,” Soini said. “You don’t go in and start breaking all the china, right? Hopefully, if you’re a stakeholder, you can watch and be a part of it, which I think is really cool.”

He added that T-Mobile’s teams choose a scribe to capture notes as incidents are resolved. This practice creates transparent records for stakeholders, executives and other team members to review later.  

“The incident commander outranks the CEO when you have an incident,” said Obstler, although it’s still critical to give stakeholders status updates.

“It’s almost like two parallel sets of work have to happen––thinking about how you have to communicate to everyone, while fixing the problem itself. You don’t want any distractions while fixing the problem.”

Using Slack apps and integrations across the incident management lifecycle

T-Mobile uses automated Slack integrations to look out for underlying issues within the network’s infrastructure. Soini said it’s important to use an alert system that does more than signal a potential weak link. Incident teams need to receive information about:

  • The duration of the alert
  • The behavior of the service before the alert
  • The system activity after the alert triggered

Slack channels are a good place to capture all of this data, so the entire team shares visibility and can work off the same set of information.

But sometimes incidents happen despite a team’s best efforts to stay on top of the system’s health. After incidents are resolved, T-Mobile also turns to Slack for postmortem analysis.

Soini said, “Slack is fantastic for bringing everyone together in the same space so you have a written record of what’s been going on. You can see which alerts and integrations have fired.”

應用程式圖示
安裝
Use the PagerDuty app for Slack to:
  • Trigger, view, acknowledge, and resolve PagerDuty incidents directly in Slack
  • Streamline collaboration during incidents
  • Maintain transparent communications throughout the incident lifecycle
  • Reduce resolution times
注意:部分整合服務僅支援英文

To help keep incident logs organized, T-Mobile is experimenting with uploading custom Slack emoji to its workspace that signify different milestones in the incident timeline. “It’s a way to pick out the signals from the noise over what could be many, many hours,” Soini said.

Incident management requires swift and smart teamwork. PagerDuty and T-Mobile address incidents in Slack by:

  • Setting up dedicated channels for incident response the moment one happens. In these channels, teammates loop in the right people and data and assign tasks and follow up on incident status and next steps
  • Adding apps to Slack, like PagerDuty, that integrate with their existing software stack and that send alerts to a Slack channel the second something is out of sync
  • Reviewing channel history so that teams can host objective postmortems after an incident

As Rayl said, “Incidents do happen to everyone no matter what industry you’re in, no matter what business you’re in. They’re just part of what we do. Slack can speed up that time to resolution by bringing the right people, information and apps together in one place.”

Watch the full Frontiers 2019 session

這則貼文有幫助嗎?

0/600

超讚!

非常感謝你提供意見回饋!

知道了!

感謝你提供意見回饋。

糟糕!我們遇到問題了。請稍後再試一次!

繼續閱讀

開發人員

在 Slack 進行建構變得好簡單:開發人員和管理員適用的全新工具於今日上線

自助沙箱、Bolt 適用的自訂函式加上改良版軟體堆疊整合,在 Slack 進行建構從未如此順利

開發人員

建立自動化構成元素

現已推出新一代平台 Beta 版供所有開發人員使用

新聞

運用 Slack 提升銷售業績

瞭解 Slack Sales Elevate 如何協助主管制定明智決策並獲得更多贏面

新聞

透過 Slack Sales Elevate 改變銷售團隊的工作方式並獲致成功

我們的全新解決方案利用 Slack 中強大的工具和 Sales Cloud 深入分析,使銷售生產力再創新高