About the job
About TensorWave
At TensorWave, our mission is to provide seamless, secure, and reliable AI compute at scale. Our innovative cloud platform removes infrastructure barriers, allowing developers to focus on creativity rather than technical limitations. We believe that groundbreaking AI should progress at the pace of innovation, not be hindered by infrastructure challenges.
Role Overview
We are seeking an Operations Engineer to join our Global Operations Center team, which serves as the backbone of TensorWave’s commitment to customer infrastructure reliability. This pivotal role involves real-time monitoring of customer environments, proactive issue detection to avert workload disruptions, and acting as the first line of response to customer-reported incidents. Located at our Las Vegas headquarters, you will be part of our dedicated 24/7 team, responsible for system monitoring, runbook execution, and collaboration with onsite teams and engineering during escalations.
As an early member of our Operations Center, you will play a crucial role in shaping how we ensure the stability of our customers’ most critical workloads. This position is perfect for individuals who thrive under pressure, have a keen eye for detail, and find motivation in knowing their contributions safeguard positive customer outcomes.
Key Responsibilities
Monitor customer environments across TensorWave data centers utilizing advanced monitoring and observability tools
Track critical health metrics, including GPU usage, node availability, network performance, storage health, and Kubernetes cluster status
Detect anomalies and potential issues proactively before they escalate into customer-impacting events
Maintain situational awareness of active workloads, scheduled maintenance, and known issues across the fleet
Provide regular health updates and identify trends that may indicate systemic risks to customer environments
Act as the initial responder to customer-reported incidents and alerts, performing initial triage and classification
Execute established procedures to diagnose and resolve common infrastructure problems such as node failures and connectivity issues
Escalate complex issues to engineering or onsite teams with clear context for effective resolution

