About the job
ABOUT TALON.ONE:
Talon.One is a leading incentives engine that integrates loyalty, promotions, and gamification into a single, robust platform. With enterprise-grade security and scalability, Talon.One enables businesses to create tailored, profitable promotional and loyalty programs utilizing any data available.
Currently, over 250 esteemed brands, including Adidas, Sephora, and Carlsberg, leverage Talon.One to enhance customer engagement and foster lasting loyalty.
ABOUT THE TEAM:
As a Senior Site Reliability Engineer, you will take ownership of reliability across the Talon.One platform, making a significant impact in this hands-on senior role. You will influence our approach to designing, measuring, and enhancing reliability throughout the engineering organization.
Your responsibilities will include building and refining our reliability foundations, including observability architecture, SLO frameworks, incident management, and production standards. You will proactively address incidents by eliminating their root causes, automate operational tasks to minimize toil, enhance the quality of signals across our monitoring systems, and guide engineering teams in creating resilient, scalable services by design.
If you are passionate about constructing practical systems, setting technical direction, and achieving measurable improvements in reliability across a complex distributed platform, this opportunity is perfect for you.
ONCE YOU ARE HERE YOU WILL:
- Take ownership of reliability metrics: availability, latency, error rates, and overall operational health.
- Define and implement SLOs and error budgets to establish clear reliability objectives and drive engineering focus.
- Provide guidance to the engineering organization on designs, standards, and best practices to ensure reliability and stability across the Talon.One product.
- Develop and enhance observability through metrics, logs, and traces, ensuring the system is comprehensible, not merely monitored.
- Design and improve our end-to-end monitoring and observability architecture, including data pipelines, signal quality, alert strategies, dashboards, and SLO implementation, while considering cost-aware scalability.
- Reduce operational toil by creating reliability tools and automations that decrease repetitive tasks and enhance system resilience.
- Drive structural enhancements by identifying and addressing the root causes of incidents, rather than just managing their symptoms.
- Lead and continually refine incident management practices: on-call readiness, severity handling, stakeholder communication, and post-incident reviews.

