About the job
About Ditto:
Ditto is revolutionizing data movement at the edge, empowering developers to create resilient, real-time applications irrespective of varying network conditions. Whether in a stadium, on an airplane, or at a remote military base, Ditto’s peer-to-peer synchronization engine guarantees continuous device connectivity and consistent data integrity, even in the absence of internet access. Backed by over $145 million in funding and trusted by esteemed organizations such as Chick-fil-A, Delta Airlines, and the U.S. military, Ditto facilitates mission-critical operations across sectors including aviation, retail, travel, hospitality, and defense. As a rapidly expanding, globally distributed startup, we are devoted to fostering a diverse and inclusive team that encapsulates a myriad of perspectives essential for addressing the world’s most complex connectivity challenges.
About the Position:
At this pivotal moment, Ditto is scaling to meet the needs of its enterprise clientele, necessitating skilled Site Reliability Engineers to uphold enterprise-grade reliability within our infrastructure.
This role presents a remarkable opportunity to become part of a specialized team dedicated to observability, system reliability, and operational excellence for our innovative edge-to-cloud database technology.
As a Site Reliability Engineer, you will be instrumental in ensuring the reliability, performance, and scalability of Ditto's cloud infrastructure. You will collaborate with product engineering teams to enhance system resilience, spearhead incident management processes, and develop observability solutions tailored for our distinct distributed architecture.
Key Responsibilities:
Develop and maintain observability solutions leveraging platforms such as Datadog, Prometheus, and Grafana.
Lead incident management efforts, coordinating response strategies, troubleshooting issues, and determining follow-up actions.
Collaborate with product engineering teams to design reliable systems, recover from incidents, and derive insights from failures.
Work with teams to establish and uphold SLOs, monitoring frameworks, and alerting mechanisms that ensure reliability at scale.
Design and implement automation and support tools to enhance system resilience, maintain operational safety, and minimize operational overhead.
Lead the creation and upkeep of runbooks, alert definitions, and incident response protocols.
Participate in on-call rotations to deliver 24/7 support for critical production systems.

