TensorWave logoTensorWave logo

Operations Engineer at TensorWave | Las Vegas, NV

TensorWaveLas Vegas, Nevada
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Entry Level

Qualifications

QualificationsStrong analytical and problem-solving skillsDetail-oriented with a focus on maintaining high standards in customer serviceAbility to work effectively under pressure in a fast-paced environmentExperience with monitoring tools and observability platforms is a plusFamiliarity with cloud services and infrastructure managementExcellent communication skills, both written and verbal

About the job

About TensorWave

At TensorWave, our mission is to provide seamless, secure, and reliable AI compute at scale. Our innovative cloud platform removes infrastructure barriers, allowing developers to focus on creativity rather than technical limitations. We believe that groundbreaking AI should progress at the pace of innovation, not be hindered by infrastructure challenges.

Role Overview

We are seeking an Operations Engineer to join our Global Operations Center team, which serves as the backbone of TensorWave’s commitment to customer infrastructure reliability. This pivotal role involves real-time monitoring of customer environments, proactive issue detection to avert workload disruptions, and acting as the first line of response to customer-reported incidents. Located at our Las Vegas headquarters, you will be part of our dedicated 24/7 team, responsible for system monitoring, runbook execution, and collaboration with onsite teams and engineering during escalations.

As an early member of our Operations Center, you will play a crucial role in shaping how we ensure the stability of our customers’ most critical workloads. This position is perfect for individuals who thrive under pressure, have a keen eye for detail, and find motivation in knowing their contributions safeguard positive customer outcomes.

Key Responsibilities

  • Monitor customer environments across TensorWave data centers utilizing advanced monitoring and observability tools

  • Track critical health metrics, including GPU usage, node availability, network performance, storage health, and Kubernetes cluster status

  • Detect anomalies and potential issues proactively before they escalate into customer-impacting events

  • Maintain situational awareness of active workloads, scheduled maintenance, and known issues across the fleet

  • Provide regular health updates and identify trends that may indicate systemic risks to customer environments

  • Act as the initial responder to customer-reported incidents and alerts, performing initial triage and classification

  • Execute established procedures to diagnose and resolve common infrastructure problems such as node failures and connectivity issues

  • Escalate complex issues to engineering or onsite teams with clear context for effective resolution

About TensorWave

TensorWave is at the forefront of revolutionizing AI compute with a commitment to delivering a seamless, secure, and resilient cloud platform. Our goal is to empower innovators by removing infrastructure barriers, enabling them to focus on their groundbreaking ideas.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.