A digital transformation is a big undertaking for any organization taking the plunge. But the process can be especially challenging for those in more traditional industries with substantial workforces or that depend on an unusually high level of security to protect their product and customers. For leaders in this space, it can seem like the transparency of other companies is incompatible with these exceptional circumstances.
British Gas and Starling Bank
Secure, on-demand credentialing and self-correcting documentation
Energy provider British Gas needs a high level of secure, constant outage monitoring for a simple reason: It cannot afford to jeopardize its ability to provide consistent heat to customers throughout the cold months of the year. Its development teams interact with more than 100 Amazon Web Services (AWS) accounts; to manage those credentials without worrying about accidentally leaked passwords, British Gas’s employees use the AWS Security Token Service’s short-lived tokens to access the system in lieu of logins. In order to be issued a token, though, employees need another authorization token—which is where Slack came in.
“One of my engineers said, ‘Everybody who needs these tokens is in Slack,’ ” recalled Chris Livermore, the head of operations for British Gas’s division Connected Homes. So the team built a Slack integration that allows engineers to request a token. The request is cross-checked against a database of authorized employees, and a token to enable the AWS login is automatically issued.
The mobile-only Starling Bank also gives its engineers AWS access through a temporary escalation of permissions, which is activated through the bank’s custom Slack chatbot, Starbot. Starbot asks for the person’s role and the reason he or she needs the permission, typically with a Jira ticket number. If the bot denies that access, the engineer can escalate the request to a human, who can enable access directly. This process creates complete documentation of every incident for auditors.
“It also provides great visibility, which discourages security abuse,” added Jason Maude, Starling Bank’s chief technology advocate. “Loads of people are watching the [Slack] channel where all of these things are going on. Eyebrows will be raised if you ask for a very high level of privilege. People will say, ‘What’s going on? Why do you need that?’ ”
Starling is able to mitigate the pressure of incident management with another custom Slack integration, called IncidentBot. IncidentBot asks for various data points on a new incident and then spins up a new channel where everyone can come together to resolve it. This provides that same consistent documentation, both for people coming online mid-incident and for reference later on, after the heat of the moment has passed. Over time, when teams come to rely on the value of this record, they naturally start adding more and more documentation in subsequent incidents, building a better security operation in the long term.[simplequote source="Emma Bellamy" source_title="Project Manager, Nationwide”]”It’s been really great to see the human connection come back to incident management.”[/simplequote]
Nationwide Building Society: compliance-friendly incident management that stays human
Nationwide Building Society’s central concern in digital transformation is DevOps alerts. Last year, when the U.K. financial institution announced a £1.3 billion investment in its technology strategy, one of its first priorities was to find a platform that would enable global collaboration and 24/7 monitoring of those alerts while maintaining the due diligence required of the banking sector.
“We needed a tool that could support all of our engineers, no matter where they were, no matter what time it was, and bring them into the right space,” Emma Bellamy, a project manager at Nationwide, told Frontiers London. “All of the security documentation that Slack provided really gave our governance and risk control teams the confidence that Slack was really invested in proving out its security program.”
Because incident management can be a high-stress operation (justifiably, considering that a one-hour outage for Amazon on its Prime Day recently cost the company $100 million), it may be easy to forget that it’s people who are working to resolve a given incident. With Slack, Bellamy said, Nationwide is able to keep a human face on its support teams, improving collaboration and morale across the organization.
“It’s been really great to see the human connection come back to incident management,” she said. “In the morning, our support teams will announce to relevant channels who’s on support that day: ‘Hi, I’m here to help you today.’ So the customer on the other side of the product or the application knows who they’re interacting with on that day and can really have a conversation, work together to resolve the problem, rather than throwing things over the fence or submitting a ticket.”
Like Starling, Monzo is a digital, cloud-based bank. As the organization grows, its mission is to leverage its mobile-only capabilities to become a hub for its more than 3 million customers’ entire financial life. To get to a place where customers can count on the company across the board requires constant, reliable monitoring and resolution of incidents, so it depends on Slack to help integrate, assemble and analyze its incident management process on a large scale.
Monzo’s tool AlertManager monitors every facet of the business, from CPU, memory usage, and network connections to customer support queues and wait times. Originally, alerts would be directed into a Slack channel but would then require engineers to fully diagnose and resolve an issue. The bank has since built a uniquely robust version of the tool in Slack, changing it from a simple reporting function into one that initiates incident resolution. Now AlertManager delivers a notification that includes direct links to the apps Runbook, Query and Dashboard, along with context and possible diagnoses.
“This version essentially allows us to be much more proactive about responding to alerts,” said Chris Evans, the platform team lead at Monzo. “Before, the system just relied on people knowing what to do, and we don’t want to be in that position.”
As it’s scaled and the incident pipeline has increased, Monzo has built more integrations into its Slack workspace to create a streamlined incident management process. One is /incident, a slash command that opens a new investigation, prompting the user to add a description with additional information and links, and provides several button options like “create comms channel” or “page on-caller.” A summary of the incident is also cross-posted to an #incidents channel that serves as a running feed for anyone in the organization.
Completing the full cycle of transparency, the slash command relays a message to an outside service that updates the external status page Monzo customers see more directly, so customers are updated in as close to real time as possible.
Engineers can revisit the #incidents channel to confirm the details that were ported over and pin messages in this feed to automatically add them to their official incident document. Evans describes the channel as “a fantastic, rich resource for us when we’re doing postmortems, so we have a really concise timeline of the whole event.
“The value there is that we can then deal with an incident, we can be integrated with tools like this, without having to leave the conversation at all. So there’s none of that context-switching cost that teams would otherwise incur.”