Senior Engineer - Safety Engineering - Incident Management
About the Team
Slack is seeking an experienced systems engineer and incident management specialist to improve incident management and command as part of our Safety Engineering program in Slack Service Engineering.
The mission of Safety Engineering is to empower and enable Slack to move fast and grow big with confidence and industry-leading reliability. We use the disciplines of resilience engineering, chaos engineering, incident management and response, and site reliability and systems engineering.
Safety Engineering’s goals are:
- Make Slack's incident command and response world-class.
- Nurture the development of a learning organization, maximizing the depth of learning created from post-incident analysis and maximizing the breadth of absorption of lessons that reduce response time or prevent repeat incidents.
- Discover people and engineering failure modes before they impact service to customers or cascade into full outages.
As the incident response expert, you will partner with development teams and engineering leadership to foster a community of practice and continuous improvement for incident response. During an incident engineers should feel confident in their response and incident commanders should be effective leaders of the situation.
No matter how much process and practice we do, bad things will still happen. As part of Safety Engineering you will make sure that our organization is effectively learning from our post-incident reviews. You will embrace digging deep into the “Whys” as part of our blameless post-mortem process, especially around the incident response process itself. You will develop programs to ensure lessons learned are absorbed by the rest of the organization to improve response.
Your responsibilities may include:
- Design and construct frameworks and process to ensure a fast, organized, and effective incident response.
- Train individuals and groups in incident response and incident command.
- Facilitate post-incident briefings to drive deep thinking about human factors and learning beyond basic Root Cause Analysis and Action Items.
- Ensure lessons that either improve response time for on-call responders or prevent repeat incidents are spread across Engineering.
You might be a good fit if:
- You enjoy innovating new engineering programs focused on improving reliability and incidents.
- You are biased towards action and incremental progress you can observe and learn from.
- You are an effective facilitator capable of asking important questions about Human Factors and process rather than just technical cause and effect.
- You communicate well in writing, presentations, leading classes, and of course Slack.
- Ability to work independently and communicate across multiple time zones.
- Demonstrated track record of improving the reliability of distributed systems through SRE, Chaos or Production Engineering.
- Demonstrated track record of designing and implementing incident response policies.
- You have experience leading incident response for cloud based distributed systems.
- Experience with 24/7 on-call rotation.
- AWS experience a plus.
- Specific background in SRE or Resilience Engineering.
Slack is the collaboration hub of choice for companies of all sizes, all across the world. By using Slack, they ensure that the right people are always in the loop, that key information is always at their fingertips, and new team members can get up to speed easily. With Slack, teams are better connected.
Launched in February 2014, Slack is the fastest growing business application ever and is used by thousands of teams and millions of users every day. We currently have nine offices worldwide, in San Francisco, Vancouver, Dublin, Melbourne, New York, London, Tokyo, Toronto and Denver.
Ensuring a diverse and inclusive workplace where we learn from each other is core to Slack's values. We welcome people of different backgrounds, experiences, abilities and perspectives. We are an equal opportunity employer and a pleasant and supportive place to work. Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.
Come do the best work of your life here at Slack.