About the job
At Tensorwave Cloud, we are dedicated to creating a seamless, secure, and resilient AI infrastructure on a large scale, breaking down barriers and redefining the norms to empower innovators and nurture AI advancements.
Role Overview
We are looking for a proactive Site Reliability Engineer with a robust software engineering background, tasked with the design, construction, and maintenance of highly scalable, secure, and resilient infrastructure.
In this pivotal role, you will engage in low-level systems design, automate infrastructure using contemporary tools, and ensure platform reliability.
This position is perfect for individuals who thrive at the intersection of systems programming and DevOps, proficient in coding with Go, JavaScript, Rust, C, or Zig while managing infrastructure with NixOS, Kubernetes, and Terraform.
Key Responsibilities
Design, build, and sustain infrastructure systems utilizing Linux and NixOS.
Utilize Terraform for infrastructure-as-code to provision and scale resources effectively.
Architect and operate Kubernetes clusters with an emphasis on performance, security, and automation.
Develop high-performance tools and internal utilities in Go or Rust.
Create and manage CI/CD pipelines for infrastructure and code deployments.
Monitor system performance, troubleshoot issues, and enhance reliability through observability tools.
Work collaboratively with engineering teams to support deployment strategies and development workflows.
Required Qualifications
Bachelor's degree in Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience.
5+ years of experience in DevOps, Site Reliability, or Infrastructure Engineering roles.
Proficiency in one or more low-level programming languages such as Rust or Go.
Extensive experience with Linux systems and configuration management.
Hands-on experience with Terraform, Kubernetes, and containerized environments.
Strong understanding of systems programming, performance tuning, and operating system internals.
Familiarity with CI/CD practices and infrastructure monitoring/alerting tools.

