Andrew Marshall, the director of product marketing at PagerDuty and Marc Vilanova, a senior security engineer at Netflix
協作

How engineers at Netflix and PagerDuty outsmart incidents with Slack

An open source incident management tool and intuitive new Slack integration help these companies quickly respond to software issues

作者:Jess Dawson2021 年 1 月 8 日

Any company with an online platform or website will inevitably deal with an incident or outage. But what sets the best organizations apart is the ability to speedily resolve these issues. That’s why companies across industries are coming up with new and inspired ways to identify and resolve all manner of incidents with Slack, the secure channel-based messaging platform.

Take PagerDuty, which helps more than 12,000 companies around the world pinpoint and tackle incidents with its real-time operations platform. By combining PagerDuty with Slack, a new integration seamlessly connects stakeholders, enabling them to manage and track issues before they escalate.

Then there’s Netflix, where engineers used the Slack API platform to build Dispatch, an open source incident management tool that works with Slack to reduce response times—and is now available for anyone to use on development platform GitHub.

At Slack Frontiers, our annual conference focused on transforming how everyone works, we explored both approaches with Andrew Marshall, the director of product marketing at PagerDuty and Marc Vilanova, a senior security engineer at Netflix.

Leveling up incident response with PagerDuty and Slack

Traditionally, incident management is built on a command-and-control model: Decisions made at the top trickle down.

However, today’s incident response requires more of a swarm approach: connecting the right information to the right responders at the right time. During this response phase, teams rely on real-time communication to react to evolving issues, reassign or escalate incidents, and add responders. Enter the new PagerDuty and Slack integration, which was released at Slack Frontiers 2020.

“This was a major milestone for us because for many teams, Slack is where work happens,” Marshall says. “PagerDuty integrates with over 350 tools to ingest and contextualize signals. Slack then connects PagerDuty’s incident contact to the right team members so they can solve issues a lot faster.”

Marshall explains that with the integration, Slack essentially becomes a third interface for PagerDuty, along with the desktop and mobile experience. This frees developers from switching contexts unnecessarily, keeping teams engaged and unlocking productivity.

“You can drive PagerDuty actions directly through the Slack UI without wasting time toggling between apps,” Marshall says.

Andrew Marshall, the director of product marketing at PagerDuty

“Our integration connects Slack’s hub for communication with PagerDuty’s digital operation platform and the result powers real-time ops for modern businesses across the world.”

PagerDutyDirector of Product MarketingAndrew Marshall

Bringing key stakeholders together quickly

A number of PagerDuty customers have what Marshall describes as a “hybrid ops environment,” where the new integration connects disparate teams. When all is well, they use PagerDuty and Slack as part of a well-oiled ecosystem to collaborate and make quick decisions. But when something goes awry:

  1. PagerDuty detects the incident
  2. The team is notified in Slack, where engineers work on a resolution
  3. Other stakeholders—sales managers, customer service, executives—are updated directly in Slack as needed, without unnecessary disruption
  4. Post-incident, information can be pulled from Slack to complete a PagerDuty postmortem

“Slack and PagerDuty fulfill three major objectives,” Marshall explains. “Enabling efficient communication and collaboration, accelerating the rapid resolution of an incident and improving the overall resolution process.”

Equipping teams at Netflix to efficiently address incidents

At Netflix, Vilanova drives security incidents to resolution and develops programmatic solutions for crisis management. This includes Dispatch, a custom incident management automation framework that deeply integrates with Netflix’s existing tools, including Slack.

“The last part is particularly important, as we want to keep the learning curve for incident participants as flat as possible,” Vilanova says. “There’s no worse time to learn how to use a new tool than during an incident.”

It can take a lot of time just to engage the right people and bring them up to speed. With all this in mind, Vilanova knew Dispatch should accomplish four things:

  1. Reduce cognitive load on participants so they can focus on the resolution
  2. Maximize efficiency by providing a consistent experience
  3. Provide easy and intuitive ways to manage the incident
  4. Collect information for future learnings

Using a simple command in Slack, anyone at Netflix can instantly report a security incident with Dispatch. “The less friction the better. We want employees to report incidents as quickly as possible,” Vilanova says.

Marc Vilanova, a senior security engineer at Netflix

“At Netflix, we use Slack for real-time communications and to manage all aspects of an incident.”

NetflixSenior Security EngineerMarc Vilanova

Empowering teams to focus on the resolution, not the process

During an incident, the number of responders can grow exponentially, making it cumbersome to track who’s in the Slack channel. Dispatch announces participants as they join, including their name, team location and role in the incident.

Participants joining the channel receive a welcome message with all relevant information, including links to resources. “This frees the incident commander and other participants from having to provide context, and allows everyone to start contributing right away,” Vilanova says.

Through another Slack command, the incident commander and participants can engage and page the on-call team, which is defined in the Dispatch web UI in advance.

A consistent Dispatch experience translates to faster responses over time. Incident commanders can manage the entire incident lifecycle right in Slack, and collect metrics and metadata to inform reports and future decision-making.

Getting ahead of the incident curve

While different in their strategies, both Marshall and Vilanova integrate Slack with the best tools for their business, providing responders with exactly what they need to find a resolution, quickly. The beauty of these approaches lies in each team’s ability to gain insight with every incident—learning to solve future issues faster and get ahead of others before they even begin.

這則貼文有幫助嗎?

0/600

超讚!

非常感謝你提供意見回饋!

知道了!

感謝你提供意見回饋。

糟糕!我們遇到問題了。請稍後再試一次!

繼續閱讀

協作

小企業,大影響:加入專為你量身打造的社群

新聞

借助 Slack AI 的力量,大大小小的企業處理工作快速省力

Slack AI 可供所有付費方案使用者購買,有助於充分發揮資料的效用,協助客戶提升員工生產力