Andrew Marshall, the director of product marketing at PagerDuty and Marc Vilanova, a senior security engineer at Netflix
Collaboration

How engineers at Netflix and PagerDuty outsmart incidents with Slack

An open-source incident management tool and intuitive new Slack integration help these companies quickly respond to software issues

Author: Jess Dawson8th January 2021

Any company with an online platform or website will inevitably deal with an incident or outage. But what sets the best organisations apart is the ability to speedily resolve these issues. That’s why companies across industries are coming up with new and inspired ways to identify and resolve all manner of incidents with Slack, the secure channel-based messaging platform.

Take PagerDuty, which helps more than 12,000 companies around the world pinpoint and tackle incidents with its real-time operations platform. By combining PagerDuty with Slack, a new integration seamlessly connects stakeholders, enabling them to manage and track issues before they escalate.

Then there’s Netflix, where engineers used the Slack API platform to build Dispatch, an open-source incident management tool that works with Slack to reduce response times – and is now available for anyone to use on development platform GitHub.

At Slack Frontiers, our annual conference focused on transforming how everyone works, we explored both approaches with Andrew Marshall, the director of product marketing at PagerDuty and Marc Vilanova, a senior security engineer at Netflix.

Levelling up incident response with PagerDuty and Slack

Traditionally, incident management is built on a command-and-control model: Decisions made at the top trickle down.

However, today’s incident response requires more of a swarm approach: connecting the right information to the right responders at the right time. During this response phase, teams rely on real-time communication to react to evolving issues, reassign or escalate incidents and add responders. Enter the new PagerDuty and Slack integration, which was released at Slack Frontiers 2020.

‘This was a major milestone for us because, for many teams, Slack is where work happens,’ Marshall says. ‘PagerDuty integrates with over 350 tools to ingest and contextualise signals. Slack then connects PagerDuty’s incident contact to the right team members so they can solve issues a lot faster.’

Marshall explains that with the integration, Slack essentially becomes a third interface for PagerDuty, along with the desktop and mobile experience. This frees developers from switching contexts unnecessarily, keeping teams engaged and unlocking productivity.

‘You can drive PagerDuty actions directly through the Slack UI without wasting time toggling between apps,’ Marshall says.

Andrew Marshall, the director of product marketing at PagerDuty

‘Our integration connects Slack’s hub for communication with PagerDuty’s digital operation platform, and the result powers real-time ops for modern businesses across the world.’

Andrew MarshallDirector of Product Marketing, PagerDuty

Bringing key stakeholders together quickly

A number of PagerDuty customers have what Marshall describes as a ‘hybrid ops environment’, where the new integration connects disparate teams. When all is well, they use PagerDuty and Slack as part of a well-oiled ecosystem to collaborate and make quick decisions. But when something goes awry:

  1. PagerDuty detects the incident
  2. The team is notified in Slack, where engineers work on a resolution
  3. Other stakeholders – sales managers, customer service, executives – are updated directly in Slack as needed, without unnecessary disruption
  4. Post-incident, information can be pulled from Slack to complete a PagerDuty postmortem

‘Slack and PagerDuty fulfil three major objectives,’ Marshall explains. ‘Enabling efficient communication and collaboration, accelerating the rapid resolution of an incident and improving the overall resolution process.’

Equipping teams at Netflix to efficiently address incidents

At Netflix, Vilanova drives security incidents to resolution and develops programmatic solutions for crisis management. This includes Dispatch, a custom incident management automation framework that deeply integrates with Netflix’s existing tools, including Slack.

‘The last part is particularly important, as we want to keep the learning curve for incident participants as flat as possible,’ Vilanova says. ‘There’s no worse time to learn how to use a new tool than during an incident.’

It can take a lot of time just to engage the right people and bring them up to speed. With all this in mind, Vilanova knew Dispatch should accomplish four things:

  1. Reduce cognitive load on participants so that they can focus on the resolution
  2. Maximise efficiency by providing a consistent experience
  3. Provide easy and intuitive ways to manage the incident
  4. Collect information for future learnings

Using a simple command in Slack, anyone at Netflix can instantly report a security incident with Dispatch. ‘The less friction the better. We want employees to report incidents as quickly as possible,’ Vilanova says.

Marc Vilanova, a senior security engineer at Netflix

‘At Netflix, we use Slack for real-time communications and to manage all aspects of an incident.’

Marc VilanovaSenior Security Engineer, Netflix

Empowering teams to focus on the resolution, not the process

During an incident, the number of responders can grow exponentially, making it cumbersome to track who’s in the Slack channel. Dispatch announces participants as they join, including their name, team location and role in the incident.

Participants joining the channel receive a welcome message with all relevant information, including links to resources. ‘This frees the incident commander and other participants from having to provide context and allows everyone to start contributing right away,’ Vilanova says.

Through another Slack command, the incident commander and participants can engage and page the on-call team, which is defined in the Dispatch web UI in advance.

A consistent Dispatch experience translates to faster responses over time. Incident commanders can manage the entire incident lifecycle right in Slack and collect metrics and metadata to inform reports and future decision-making.

Getting ahead of the incident curve

While different in their strategies, both Marshall and Vilanova integrate Slack with the best tools for their business, providing responders with exactly what they need to find a resolution, quickly. The beauty of these approaches lies in each team’s ability to gain insight with every incident – learning to solve future issues faster and get ahead of others before they even begin.

Was this post useful?

0/600

Nice one!

Thanks a lot for your feedback!

Got it!

Thanks for your feedback.

Whoops! We’re having some problems. Please try again later.

Keep reading

News

Introducing Data Residency for Slack in Singapore

Transformation

Slack for customer support: Expert tips from the Slack Community in NYC

Hear from Slack experts on how you can get the most out of Slack for customer support.

Developers

Building on Slack just got a lot easier – New tools for developers and admins available today

Self-service sandboxes, custom functions for Bolt and improved integration with your software stack make building for Slack better than ever

Developers

Creating the building blocks of automation

The next-generation platform is now out of beta and available to all developers