About the job
Endava is on the lookout for a proactive Site Reliability Engineer (SRE) / AI Platform DevOps Engineer to take charge of infrastructure provisioning, CI/CD automation, telemetry pipelines, and the deployment of AI-powered services, agents, and orchestration systems.
This role is infrastructure-intensive and SRE-focused, dedicated to maintaining the reliability, observability, scalability, security, cost-efficiency, and safe operational standards of AI systems in production.
In this position, you will be instrumental in establishing and sustaining the foundational platform that allows AI services to function securely and efficiently at scale.
Key Responsibilities:
- Infrastructure Provisioning & Automation:
- Design and manage cloud infrastructure using Infrastructure as Code (Terraform or similar).
- Provision and maintain Kubernetes clusters and associated services.
- Automate the setup of environments across development, staging, and production.
- Oversee networking, IAM, secrets, storage, and compute scaling.
- Guarantee high availability, resilience, and disaster recovery readiness.
- CI/CD & Deployment Engineering:
- Develop and maintain CI/CD pipelines for AI services, agent frameworks, orchestrators, and model artifacts.
- Implement automated testing and reliability validation gates.
- Facilitate blue/green and canary deployments.
- Create safe rollback mechanisms for services and models.
- Incorporate reliability and health checks into deployment workflows.
- Model & Agent Deployment Governance:
- Package, version, and deploy models into containerized environments.
- Manage model artifact storage and promotion across environments.
- Monitor model performance and detect degradation.
- Support retraining cycles and model refresh workflows.
- Ensure safe rollout and rollback of model versions.
- Implement monitoring for inference latency, throughput, and cost.
- Data Pipelines for Telemetry & Observability:
- Design and maintain data pipelines to ingest, clean, and process high-volume telemetry (logs, metrics, traces, events).
- Enable structured telemetry for AI and orchestration workflows.
- Ensure reliability for real-time and batch processing.
- Optimize pipeline scalability and performance.
- AIOps Platform Integration:
- Evaluate, deploy, and integrate AIOps platforms.
- Enhance anomaly detection, correlation, and alert intelligence.
- Minimize alert noise and improve signal quality.
- Integrate AIOps outputs into operational workflows.

