About the job
Join XBOW and help shape the future of offensive security. In an era where attackers leverage AI to outpace defenders, we are at the forefront of creating a security platform that ensures organizations stay one step ahead. Our cutting-edge, AI-driven system autonomously identifies, validates, and exploits vulnerabilities, providing proof-backed results in mere hours instead of weeks.
Founded by Oege de Moor, the visionary behind GitHub Copilot, and supported by top-tier investors such as Sequoia and Altimeter, XBOW is tackling one of the most pressing challenges in cybersecurity. Over the past year, our exceptional AI team, comprised of leading AI experts and renowned security researchers, has discovered thousands of real-world zero-days in the software that billions depend on, securing the top position on HackerOne’s global leaderboard.
We are a dynamic group of innovators, hackers, and researchers who thrive on addressing complex challenges. If you are eager to explore the limits of AI, redefine cybersecurity, and be part of a team that is paving the way for a new era of defense, we would love to hear from you.
Your Role: Site Reliability Engineer (SRE) focused on Automation and Incident Response
As a Site Reliability Engineer at XBOW, you will play a crucial role in maintaining the stability, observability, and resilience of our production systems as we scale. You will be responsible for developing and maintaining automated reliability tools that encompass monitoring, alerting, and self-healing capabilities, while also setting and tracking service level objectives for both production and development environments.
This position requires close collaboration with infrastructure and feature teams to manage cloud systems through Infrastructure as Code (IaC), assess architectural changes for their impact on reliability and capacity, and respond to incidents during local working hours as part of a “follow the sun” model.
When incidents arise, you will lead or assist in root cause investigations, analyze incident trends across the organization, and implement improvements to mitigate future risks. Additionally, you will help maintain internal and customer-facing status dashboards to effectively communicate system health and uptime.
Responsibilities:
Automating site reliability infrastructure, monitoring, and self-healing systems.
Defining and owning Service Level Objectives for production and development deployments.
Implementing Infrastructure as Code for production and development systems in collaboration with the infrastructure engineering team.
