Using Slack to resolve incidents faster at England’s Department for Education

“We knew we needed a new incident-response process, and we wanted something that was efficient, effective and allowed us to better serve our users.”

Department for EducationSoftware DeveloperJames Glenn

If you’ve applied for a highly selective job or training program, you know that the application process can be nerve-racking. You spend hours tailoring your CV, gathering references, writing the perfect cover letter and making sure everything is perfectly organized up to the deadline. The last thing you want to see, then, is an error saying that the site is down and your materials can’t be submitted on time. 

Part of the mission at England’s Department for Education (DfE) is to help great candidates find the perfect teacher training course. It’s essential work, which is why the DfE’s developers work hard to avoid site outages like the hypothetical one above.

The department’s online application process collects information from candidates, such as a personal statement, their education, work history and other qualifying information. When incidents occurred in the past—for example, a failed deployment or a third-party integration going down—developers dropped their work and self-organized to respond. But the response process was documented in various Google docs, and depending on who kicked off an incident, teams could find themselves doing things in a variety of different ways.

“We knew we needed a new incident-response process, and we wanted something that was efficient, effective and allowed us to better serve our users,” says James Glenn, a software developer at the DfE.

Slack’s API helped them do just that by allowing Glenn and his team to code a custom bot that creates dedicated incident channels, labels them according to severity and loops in stakeholders automatically.

“We knew we needed a new incident-response process, and we wanted something that was efficient, effective and allowed us to better serve our users.”

Department for EducationSoftware DeveloperJames Glenn

Solving a clunky incident resolution process

The DfE’s previous method for resolving incidents was not user-friendly for developers and warranted improvement. “Our services can be quite time-critical. We need to respond and get problems resolved quickly because they stop people from becoming teachers,” Glenn says.

Devs would flag an issue by starting a thread in a general channel, but it could quickly get lost among unrelated conversations, resulting in stakeholders losing visibility. Incident-response instructions lived inside a Google Doc that had to be manually updated; plus, people needed to be reminded of its whereabouts.

After resolving an incident, it was still a challenge to document the process in a way that set the stage for a thorough review. The team realized too much time was being spent remembering what to do. 

They needed a single source of truth that was automated and accessible to everyone.

Building a bespoke bot with Ruby 

Glenn and his team had a couple of options: they could choose a well-designed, off-the-shelf solution in the Slack App Directory or they could use the Slack API to build a custom solution tailored to their needs. To improve their incident response, the developer team elected to build a custom bot that could accomplish three distinct goals:

  1. Automate a consistent process
  2. Reflect the DfE’s transparent and blameless culture
  3. Improve and standardize documentation to facilitate effective postmortems

At the DfE, devs mainly work with the C# and Ruby programming languages, and they decided to use Ruby to build the bot. “We reviewed Slack’s API documentation and could see it offered plenty of functionality and could provide everything we needed,” Glenn says.

Enabling faster, more efficient responses

Because they created their own bot, Glenn and his colleagues were able to streamline incident responses and eliminate difficulties from the process by capturing all essential information in a purpose-built channel that alerted everyone who needed to see it.

Here’s how it works: A single slash command /incident open summons the bot. A window appears and asks for the following information:

  • Title
  • Description
  • Impacted service
  • Priority level

The bot also asks for three incident leads, each focused on a different aspect of the response. The comms lead communicates with internal and external stakeholders and advises them on policy. The tech lead addresses the nuts and bolts of the resolution effort. And the support lead manages the help desk by wrangling incoming tickets.

Using the information provided, the bot spins up a new, dedicated channel that’s easy to find because it follows a defined incident-service-date-title naming convention. Stakeholders are automatically invited, and key documentation is pinned to the top of the channel.

If any of the channel’s key data needs to be revised in real time, it’s as easy as typing /incident update. Once the incident is over, one last /incident close command will pin a summary report and ping leads to let them know the situation has been resolved.

“In the end—after designing, building and thoroughly testing the bot in a separate space—we released it into the wilds of our main Slack workspace,” Glenn says. “Since then, we’ve used it to respond to live incidents faster and more successfully.”

 

 

Building better relationships with partners using Slack Connect

While the custom bot is the pièce de résistance of the DfE’s incident response strategy in Slack, that’s not where its resolution efforts end.

A handful of software providers integrate with the application process and help surface aspiring teachers to the roughly 70 universities across the country that offer training courses. In the old days, it wasn’t always easy to tell if the APIs were functioning properly or if the universities were seeing the relevant candidates. Now, thanks to automated daily monitoring, the support team can promptly alert partners about any outages.

“We send them a report every morning that might say, ‘All but one of your universities connected successfully in the last 24 hours and got the latest applications,’ ” says Duncan Brown, lead developer for teacher services.

This kicks off a collaborative discussion that diagnoses the problem and prescribes a solution, all within a dedicated Slack channel using Slack Connect, that’s shared with the external vendor. There’s also the added benefit of having a linkable record that can be sent to internal DfE stakeholders (instead of having to bring them up to speed with a manual recap).

“It’s a massive improvement on the emails and phone calls going back and forth before. Being able to resolve it there and then in Slack really improves our relationships with our partners.”

Department for EducationLead Developer for Teacher ServicesDuncan Brown

“It’s a massive improvement on the emails and phone calls going back and forth before,” Brown says. “Being able to resolve it there and then in Slack really improves our relationships with our partners.”

Sharing key learnings

Robust digital services don’t just mean a smoother experience for candidates. They also help DfE achieve its goal of getting more teachers into the profession where they’re most needed.

It’s also a chance for developers to thrive in their roles behind the scenes. The bot has been such a success—attracting attention from across the department and raising awareness of how successful digital services operate—that Glenn wants to spread the word about how Slack can enable a cleaner, more efficient response system and help other DfE teams implement similar processes.

“Unfortunately, incidents do happen,” he says. “You can’t stop that, but you can change how you deal with them.”