About the job
About Our Team:
Join the Core Platform team at Sift, where you'll play a pivotal role in sustaining and enhancing the data, infrastructure, messaging, and service platforms that power our online systems. We are dedicated to ensuring these systems are consistently available, reliable, and performing optimally to satisfy customer requirements. In the event of disruptions, we implement well-established recovery protocols to restore services swiftly. Managing such intricate, large-scale systems necessitates ongoing monitoring and proactive maintenance to maintain our high standards.
Your Responsibilities:
Design and construct immutable infrastructure and fault-tolerant, multi-AZ/multi-region systems that are resilient and self-healing.
Implement multi-region deployments, including BigTable clusters that span multiple regions, ensuring specific customers are routed to designated areas (e.g., sticky sessions at the regional level).
Streamline local development and testing processes to be quick, efficient, and seamless.
Create dynamic environments that allow real-time interactions between specific services and other environments.
Develop automated bot solutions for deployment and monitoring, including integration with Slack for streamlined updates.
Engage in on-call support and incident response activities, providing 12/7 coverage for one calendar week approximately every 3-4 weeks.
Technical Stack: GCP, AWS, Terraform, Kubernetes, Vault, Jenkins, Kafka, Snowflake, Spark, Java, Python 3
Ideal Candidate:
You possess a robust understanding of large-scale computing and view infrastructure as code. You are passionate about constructing immutable infrastructure and resilient, multi-AZ/multi-region systems capable of withstanding failures. While you appreciate the significance of monitoring and alerting, your ultimate ambition is to design self-healing systems. Collaboration is essential to you, as you endeavor to act as a force multiplier by making thoughtful trade-offs that drive success.

