About the job
Senior DevOps / SRE Engineer
Location: Based in US to GMT timezones
Remote | Full-time
Compensation: $120K - $150K
Join a pioneering client at the nexus of decentralized finance and artificial intelligence as a Senior DevOps / SRE Engineer. This innovative company is dedicated to creating cutting-edge infrastructure for autonomous AI trading agents operating on the Hyperliquid network. Your role will be vital in ensuring the reliability of high-stakes environments where infrastructure resilience is crucial for capital protection.
As the successful candidate, you will be responsible for architecting and maintaining robust systems that support numerous concurrent AI agents, ensuring they remain operational, swift, and secure. In this dynamic environment, downtime translates to unprotected financial positions, making your expertise in building resilient, zero-downtime systems for real-time financial workloads essential.
Key Responsibilities
- Agent Infrastructure Management: Develop and sustain the infrastructure for concurrent AI trading agents, overseeing intricate cron schedules, state files, and trailing stop processes.
- Deployment & Orchestration: Implement and manage agent environments, ensuring workspace persistence, isolated session management, and Model Context Protocol (MCP) server connectivity.
- CI/CD Pipeline Development: Architect and manage pipelines for deploying trading skills and plugins to production seamlessly without interrupting live trading.
- Zero-Downtime Operations: Employ deployment strategies (blue/green, canary) to safeguard active financial positions throughout infrastructure changes.
- Observability & Monitoring: Establish extensive alerting across the entire stack using metrics, logs, and traces to identify agent failures, state file corruption, or infrastructure regressions proactively.
- Cloud & Database Scaling: Manage and scale core platform infrastructure, including Kubernetes (EKS) clusters, Redis, Postgres, ClickHouse, and Kafka.
- Blockchain Reliability: Ensure the stability of blockchain node infrastructure and connectivity to exchange APIs and on-chain transaction systems.
- Incident Leadership: Guide incident response and on-call practices, including debugging, mitigation, and post-mortems to enhance long-term platform reliability.
Interview Process
The interview process aims to assess both technical expertise and the capacity to navigate high-pressure production scenarios.
- Initial Technical Screening: A conversation focused on your technical proficiency and experience...
