For most SaaS companies, a single software incident can send a shock wave through the ranks, igniting a flurry of late-night pages, phone calls and emails aimed at mitigating negative impact. Maybe a payment system has gone down, and customer orders aren’t going through. Or even worse, the whole service has gone offline, and entire teams need to mobilize.
Launched just last year, Incident.io, an incident-response platform that lets you manage and respond right within Slack, gives companies complete, premeditated control over their incident-response protocol. From what needs to be done to who needs to be brought into the conversation (or just kept in the loop), everything can be automated up front, or as this co-founder put it, “in the calm of day.”
In this installment of our On the Platform interview series, Slack open-source engineer Alissa Renz connected with Incident.io co-founder and chief product officer Chris Evans to talk about the company, his background as a software engineer and his experience building an incident-response tool specifically for the Slack platform.
The following is a condensed transcript; answers have been edited for length and clarity.
Building Incident.io: “a sensible place to converge around an incident”
Alissa Renz: Incident.io was conceptualized from a team you worked with during your time at Monzo Bank. Can you tell us what led to the idea for Incident.io and how you evolved it from an idea into a company?
Chris Evans: I joined Monzo in the relatively early days. Back then they had a very small on-call rotation, only four engineers. They were the same engineers that had been there since almost day one, and they were all pretty burned out. On top of the small rotation, there was a lot of extra pressure that came from looking after a bank. When things go wrong, it could mean customers aren’t able to access their money, and there could be regulatory impacts.
There was also complexity introduced by the way the organization was set up. Many incidents required that customer support folks be pulled into the loop, or risk and compliance. What they needed was a tool that could be used more widely across their organization for incident response.
I was tasked with taking over on-call. I ended up building what you could say was an early prototype for what we have now with Incident.io. It was purely to make on-call easier for engineers, a sensible place for people to converge around an incident. By the time I left in 2021, it was used in every corner of the company as a single source of truth for incidents across the board.
“Philosophically, we have this approach that the only time you should ever leave Slack is when you’re actually fixing the thing.”
Renz: Can you tell us more about Incident.io? How has the company grown and evolved since it first launched?
Evans: The first and most obvious thing is that it’s built on top of Slack. Users can use a single Slack slash command to declare an incident and manage the entirety of their response, including their existing tools, without having to leave their incident channel. Whether it’s PagerDuty or Opsgenie for getting a hold of people, Jira for logging actions or Statuspage and email for external communications, Incident.io pulls all of these different silos into a single layer within Slack.
For example, if you have a partner document that says, “If this incident is a data breach, then page the data protection officer,” or, if it’s a situation when execs need to be pulled in, you can configure those things in the calm of day. Then, when it’s 2 a.m. and there are pages going off, nobody has to think about the process. Incidient.io makes sure that everyone gets into the right room and instructs you along the way.
Renz: An incident can arise across many different services. Which services does Incident.io offer support for?
Evans: We’re integrated with all the major paging providers. That’s PagerDuty, Opsgenie for getting people together in the room. Then there’s Jira, GitHub and Clubhouse. Statuspage, so you can keep the public updated without leaving Slack. Zoom and Google Media if you need to jump right into a face-to-face chat.
With Incident.io, you don’t have to jump off of Slack to do things. Philosophically, we have this approach that the only time you should ever leave Slack is when you’re actually fixing the thing. Everything else should be possible without leaving Slack.
On choosing Slack and building for the Slack platform
Renz: What aspects of the Slack platform make it a particularly good fit for something as critical as incident response?
Evans: From the beginning, we knew what kind of companies we wanted to go after. The kind of companies that value the rich context in communication that Slack offers. As founders and engineers, we were all really familiar with Slack already, and we knew that it would be a natural fit to build on the platform.
When you look more generally, great incidents are founded on great communication. That’s the key tenet, that everyone is able to communicate really clearly. Slack was a communication-first tool that was already ubiquitous across so many organizations, as much as calendars or email.
But we came to realize that Slack could give us a lot more than just communications. It’s a genuinely powerful platform that makes it easy to build applications that can be used all across the organization. It’s frictionless for engineers, who can get started with a single Slack command. Meanwhile, folks who have never used it before can jump right into the channel and we’ll help them along with tips and messages.
Slack was simply the most sensible place to build.
“We came to realize that Slack could give us a lot more than just communications.”
Renz: What’s been an interesting technical challenge or problem you’ve faced and what role has Slack had in overcoming it?
Evans: The constant challenge is making technical integrations more friendly for non-technical folks. Our ambition is to become this tool for entire organizations. You need to be able to take someone who has never used a paging program like PagerDuty and give them a crystal-clear UX. Inside of Slack, everything looks similar and familiar, which makes it easier for everyone.
The pandemic and why the future is about “more tools, playing more nicely together”
Renz: As the way we work continues to evolve, how has this affected the nature of incidents and how organizations should approach their response strategies?
Evans: Before the pandemic, jumping into an incident typically meant jumping into a meeting room in an office together. We were still using Slack but were sitting across the desk from one another. Then a lot of companies went 100% remote, and many are staying that way. So there’s an increasing pressure on tooling like ours to bridge that gap.
Flipping to the customer side of that lens, online services have become much more critical. Grocery shopping, Amazon, banking … customers are more dependent on these services than they were before, and the providers of those services have recognized that. There’s a strong driver from both sides to make incidents as low-impact as possible, make them happen as little as possible and get them recovered as quickly as possible.
“Suddenly, we’re in a world where these tools are like Lego blocks that you can use to build an incredible organization.”
Renz: At Slack, we’ve evolved the platform to meet the needs of today’s developers. Looking forward, are there any tools or technologies you’re excited about?
Evans: For me, what’s incredible is the idea of more tools playing more nicely together.
When you look at the old SaaS models, it was all about building a thing and trying to get everyone to gravitate towards it, log in and then make it the center of their universe. Now, there’s such a rich ecosystem of SaaS products, and people want to use them all.
Suddenly, we’re in a world where these tools are like Lego blocks that you can use to build an incredible organization. Even 10 years ago you’d have to build something from scratch or just paper over the cracks by hiring more people. Now you can take the integrations you need and plug these things together in Slack.
Renz: Which tools or technologies do you find most useful when it comes to making work more efficient, effective or fun?
Evans: In terms of making life easier, it’s the fun things. 99.9% of people probably have Giphy installed in Slack. Being able to inject some levity into serious situations is great.
But the real superpower for people building on Slack is using Block Kit Builder to collaborate on design. It’s incredible. We all go into Block Kit Builder and decide exactly how we want everything to look and work. That way, you don’t have to write a single line of code until you’re really happy with it.
You can also use Block Kit Builder to compose really nice messages or announcements to your organization. You can embed nice images, section headings and design elements, then send it out straight from Block Kit.
Renz: What advice do you have for folks that are interested in working with or building on the Slack platform?
Evans: My advice would be to find all the things that are causing you even tiny little micro frictions. Then figure out if those things can go into Slack.
As an example, at Monzo we wanted to keep a record of all the decisions we made over time, so we could go back and take a look. Originally, we logged all that in a Notion database. But it didn’t really get used because it was a micro friction.
So I built a super-trivial little app that just brings up a modal that asks, “What was the context? What led you to this decision? Who was involved?” After you submitted, everything was submitted to GitHub, and we didn’t have to worry about it anymore. With just one little bit of code.
Explore more On the Platform conversations:
¡Muchísimas gracias por tus comentarios!
Gracias por tus comentarios.
Vaya. Estamos teniendo dificultades. ¡Inténtalo de nuevo más tarde!