About the job
About TensorWave
At TensorWave, our goal is clear: to provide seamless, secure, and reliable AI compute at scale. We've developed a flexible cloud platform that removes infrastructure barriers, allowing innovators to concentrate on creativity rather than technical hurdles. We believe that transformational AI should operate at the pace of ideas, not infrastructure.
About the Role
As a leading GPU cloud infrastructure provider, TensorWave delivers high-performance computing to the most demanding AI and machine learning applications. With data centers strategically located across the United States, we are rapidly expanding to meet the surging demand for GPU computing.
We are in the process of establishing dedicated Data Center Operations teams for each of our facilities, and we are in search of a dynamic leader to oversee this essential function. This position is crucial for maintaining the integrity of our physical infrastructure. You will be tasked with ensuring 24/7 availability, security, and operational efficiency within our data center environment. This role is not merely about maintaining the status quo; it involves optimizing high-density computing environments, leading a proficient technical team, and managing the hardware lifecycle seamlessly. You will connect high-level operational strategy with hands-on execution, transforming our uptime goals into reality.
This is a distinctive opportunity to shape a function that directly influences our customers' experiences. In this role, you will not just manage equipment; you will be the custodian of the data powering our business. We provide a fast-paced environment where your expertise in optimizing the RMA process and nurturing technical talent will significantly impact our organizational success.
What You’ll Do
Team Leadership: Guide, mentor, and schedule a team of 8 Data Center Technicians, promoting a culture of technical excellence and responsibility.
Performance Management: Conduct regular one-on-one meetings, performance evaluations, and skill-gap assessments to ensure the team remains ahead of evolving technologies, such as liquid cooling and AI-optimized racking.
Process Improvement: Supervise the Inventory/RMA Specialist to streamline hardware replacement cycles and ensure that 'dead on arrival' (DOA) equipment is managed with minimal downtime.
Infrastructure Management: Oversee the installation, cabling, and decommissioning of server, storage, and networking hardware.
Uptime Management: Serve as the primary escalation point for operational issues and drive initiatives to maintain our high availability standards.
