About the job
Electrum is a pioneering payment software technology firm.
Since our inception in 2012, we have consistently provided trusted, enterprise-grade, cloud-native solutions to enhance financial transaction processing. Our extensive expertise has positioned us as a reputable partner in high-volume, low-value payment schemes, enabling our clients to deliver services to millions of South Africans every day.
At Electrum, our mission is driven by impact – we prioritize designing solutions that matter, acting with urgency, and fostering continuous learning as we scale. We stand by the principle of collaboration – working closely with our clients and teams to create meaningful, sustainable solutions. Safety is paramount; we promote open communication, smart risk-taking, and trust, ensuring that creativity and alignment can flourish. We believe in empowering strong teams – we hire exceptional talent, collaborate vigorously, and hold one another to high standards while leading with empathy and kindness.
The Role
As a Core Reliability Engineer, you will be at the forefront, acting as a central software team enabler. Your responsibilities will include defining standards, implementing observability tools, and establishing automation frameworks that empower our product teams to independently manage their service health.
In our unique FinTech environment, reliability transcends mere server uptime; it encompasses the processing of high-volume, impactful financial transactions where even a single dropped message can have significant real-world implications. We seek an innovative systems thinker eager to tackle challenging industry problems, architect solutions for scalability while ensuring reliability, and help us set new benchmarks for reliability in payments.
Your primary objective will be to ensure that building reliable software is straightforward, and to be alerted before our clients notice any failures.
Responsibilities
Enablement & RelOps Culture
- Implement the Observability Ladder: Guide teams from basic monitoring to advanced metric tracking. Collaborate with product teams to define SLAs, SLIs, and SLOs, while creating dashboards that monitor error budgets effectively.
- Empower Product Teams: Develop frameworks and deployment tools (e.g., CI/CD, internal tool integrations) that enable teams to make informed, data-driven decisions regarding deployment safety, and automate rollbacks when error budgets are exceeded.
- Champion Reliability: Foster a blameless post-mortem culture focused on actionable insights, system enhancements, and quantifiable metrics (MTBF, MTTR).
Frameworks & Automation
- Standardised Alerting & On-Call: Continuously refine our company-wide alerting and on-call frameworks to minimize alert fatigue and ensure clarity when alerts are triggered.
