혁신

Stuff happens: using Slack for incident management

IT incidents cost enterprises billions of dollars in lost productivity each year. Here’s how Slack helps teams stay ahead

작성자: Lauren Johnson2019년 6월 12일

Like death and taxes, IT incidents are inevitable. Issues like server outages and broken code are so common, the data analytics company Splunk reports, many companies experience about five incidents per month—to the tune of more than $100,000 a pop. Annually, incidents cost enterprise businesses $700 billion in lost productivity, according to the incident response platform PagerDuty and research company IHS Markit. That’s why a solid incident management strategy is a must for any organization.

“People solve incidents, but we can’t do it alone,” said Ali Rayl, Slack’s vice president of  customer experience, at our recent Frontiers conference. “We all need to coordinate across an evolving set of apps, data and information in order to get ourselves through an incident to a positive resolution.”

At Frontiers, we gathered specialists from PagerDuty, the telecom company T-Mobile and RPI Ambulance, an emergency medical services agency based at Rensselaer Polytechnic Institute, to talk about how they use Slack to anticipate, monitor and manage incidents. Here’s a look into some of their strategies and best practices.

Incident management 101: Using Slack channels to gather the right experts

“When something is going on, when something is burning, you need to do things in real time,” said Rachel Obstler, PagerDuty’s vice president of product. “You don’t have the time to send questions up the ranks and wait for someone to respond, so traditional command and control just doesn’t work.”

PagerDuty serves more than 11,000 companies in various industries, from fashion to finance. Its app for Slack connects our two platforms so that our shared users can analyze and resolve incidents within Slack channels, without having to switch tools. This helps the appropriate incident response team members organize their data in one place and collectively determine the best course of action.

Obstler said it’s best practice to limit channel participation to the people who are actively involved in resolving the incident. Anyone else may be tempted to ask questions about why the incident happened, which creates distractions and wastes valuable time.

“The best person to solve an incident is the person who last pushed the code or the person who solved the last incident,” Obstler said. “They need to be able to work with each other in a collaborative way and solve the problem.”

“That Slack channel may be for that technology team, so it’s kind of like their home. You don’t go in and start breaking all the china, right? Hopefully, if you’re a stakeholder, you can watch and be a part of it, which I think is really cool.”

Principal Technology Product Manager at T-MobileJon Soini

Looping in leadership without creating interruptions

Jon Soini, a principal technology product manager at T-Mobile, described how his team uses Slack channels to keep leadership and stakeholders informed during an incident. While stakeholders are invited to watch activity in the channel, they understand when to ask questions and when to step back and let their teams resolve the incident.

“That [Slack] channel may be for that technology team, so it’s kind of like [their] home,” Soini said. “You don’t go in and start breaking all the china, right? Hopefully, if you’re a stakeholder, you can watch and be a part of it, which I think is really cool.”

He added that T-Mobile’s teams choose a scribe to capture notes as incidents are resolved. This practice creates transparent records for stakeholders, executives and other team members to review later.  

“The incident commander outranks the CEO when you have an incident,” said Obstler, although it’s still critical to give stakeholders status updates.

“It’s almost like two parallel sets of work have to happen––thinking about how you have to communicate to everyone, while fixing the problem itself. You don’t want any distractions while fixing the problem.”

Using Slack apps and integrations across the incident management lifecycle

T-Mobile uses automated Slack integrations to look out for underlying issues within the network’s infrastructure. Soini said it’s important to use an alert system that does more than signal a potential weak link. Incident teams need to receive information about:

  • The duration of the alert
  • The behavior of the service before the alert
  • The system activity after the alert triggered

Slack channels are a good place to capture all of this data, so the entire team shares visibility and can work off the same set of information.

But sometimes incidents happen despite a team’s best efforts to stay on top of the system’s health. After incidents are resolved, T-Mobile also turns to Slack for postmortem analysis.

Soini said, “Slack is fantastic for bringing everyone together in the same space so you have a written record of what’s been going on. You can see which alerts and integrations have fired.”

앱 아이콘
설치
Use the PagerDuty app for Slack to:
  • Trigger, view, acknowledge, and resolve PagerDuty incidents directly in Slack
  • Streamline collaboration during incidents
  • Maintain transparent communications throughout the incident lifecycle
  • Reduce resolution times
참고: 일부 통합 항목은 영어로만 제공됩니다.

To help keep incident logs organized, T-Mobile is experimenting with uploading custom Slack emoji to its workspace that signify different milestones in the incident timeline. “It’s a way to pick out the signals from the noise over what could be many, many hours,” Soini said.

Incident management requires swift and smart teamwork. PagerDuty and T-Mobile address incidents in Slack by:

  • Setting up dedicated channels for incident response the moment one happens. In these channels, teammates loop in the right people and data and assign tasks and follow up on incident status and next steps
  • Adding apps to Slack, like PagerDuty, that integrate with their existing software stack and that send alerts to a Slack channel the second something is out of sync
  • Reviewing channel history so that teams can host objective postmortems after an incident

As Rayl said, “Incidents do happen to everyone no matter what industry you’re in, no matter what business you’re in. They’re just part of what we do. Slack can speed up that time to resolution by bringing the right people, information and apps together in one place.”

Watch the full Frontiers 2019 session

이 포스트가 유용했나요?

0/600

훌륭해요!

피드백을 주셔서 감사합니다.

알겠습니다!

피드백을 주셔서 감사합니다.

죄송합니다. 문제가 발생했습니다. 나중에 다시 시도해주세요.

계속 읽기

새 소식

Slack과 Salesforce의 새로운 통합으로 영업 팀의 역량을 강화하세요

Slack Sales Elevate로 고객 레코드, 계정, 기회, 주요 지표를 중앙 집중화하여 영업 프로세스의 모든 단계 혁신

새 소식

워크플로 빌더의 새로운 커넥터 65개를 통해 더욱 유용한 자동화를 만드세요

파트너 앱에 연계해 코딩이 아닌 클릭으로 업무를 자동화하는 새로운 방법을 소개합니다.

협업

자동화를 사용하여 프로세스를 간소화하고 생산성을 높인 Salesforce 마케팅 팀

워크플로를 통해 마케터는 새 캠페인을 더 빠르게 제작하고 출시할 수 있습니다

새 소식

완전히 새로워진 워크플로 빌더로 누구나 업무를 자동화하도록 지원하세요

기술적 전문 지식 수준과 상관없이 더 강력한 워크플로를 활용할 수 있는 새로운 자동화 기능