Technical Lead Manager - Safety Engineering
About the Team
Slack is seeking an experienced systems engineer to lead our new Safety Engineering program in Slack’s Service Engineering Team.
Safety Engineering at Slack is the combination of Resilience Engineering and Chaos Engineering disciplines. Safety Engineering’s goals are:
- Nurture the development of a learning organization, maximizing the depth of learning created from post-incident analysis and maximizing the breadth of absorption of lessons that improve response time or prevent repeat incidents.
- Discover failure modes before they impact service to customers or cascade into full outages.
You will partner with our Production Engineering team, working with our software development teams to build the tooling necessary to make their software resilient in a complex distributed cloud environment. Through experimentation and testing you will build a hypothesis around steady state behavior, inject real world events theorized to disrupt that steady state to observe behavior and define mitigating software design for what you discover. You will experiment with Production systems in a safe way that prevents cascading or catastrophic failures. You will create experimentation frameworks and automation that provides key failure visibility to developers automatically on a continuous basis.
No matter how much tooling, visibility or experimentation we do, bad things will still happen. As part of Safety Engineering you will make sure that our organization is effectively learning from our post-incident reviews. This includes effective facilitation that looks beyond root cause and explores human factors involved in the event. You will embrace digging deep into the “Whys” as part of our blameless post-mortem process. You will develop programs to ensure lessons learned are absorbed by the rest of the organization to prevent repeat incidents or improve response times.
Your responsibilities may include:
- Develop tooling that tests failure modes of service software in controlled production experiments
- Identify necessary monitoring and alerting points for known failure modes where software cannot recover automatically
- Participate in the Production Engineering on-call rotation
- Embed with software teams to enhance failure detection and recovery features they add to their software.
- Design and construct frameworks and automation to continuously test for failures and notify team members.
- Facilitate post-incident briefings to drive deep thinking about human factors and learning beyond basic Root Cause Analysis and Action Items.
- Ensure lessons that either improve response time for on-call responders or prevent repeat incidents are spread across Engineering.
You might be a good fit if:
- You enjoy innovating new engineering programs focused on improving reliability
- You are biased towards action and incremental progress you can observe and learn from
- You are an effective facilitator capable of asking important questions about Human Factors and process rather than just technical cause and effect.
- You communicate well helping develop the charter of the Safety Engineering program and demonstrating the value it creates towards improving the reliability of Slack
- Ability to work independently and communicate across multiple time zones
- Demonstrated track record of improving the reliability of distributed systems through SRE, Chaos or Production Engineering.
- You have experience leading network and system design for cloud based distributed systems
- Experience with 24/7 on-call rotation
- AWS experience a plus
- Specific background in Chaos Engineering or Resilience Engineering
Slack is the collaboration hub of choice for companies of all sizes, all across the world. By using Slack, they ensure that the right people are always in the loop, that key information is always at their fingertips, and new team members can get up to speed easily. With Slack, teams are better connected.
Launched in February 2014, Slack is the fastest growing business application ever and is used by thousands of teams and millions of users every day. We currently have nine offices worldwide, in San Francisco, Vancouver, Dublin, Melbourne, New York, London, Tokyo, Toronto and Denver.
Ensuring a diverse and inclusive workplace where we learn from each other is core to Slack's values. We welcome people of different backgrounds, experiences, abilities and perspectives. We are an equal opportunity employer and a pleasant and supportive place to work. Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.
Come do the best work of your life here at Slack.