Sciforium logoSciforium logo

Senior HPC & GPU Infrastructure Engineer

SciforiumSan Francisco
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Senior

Qualifications

The ideal candidate will have a strong background in high-performance computing and GPU technologies, coupled with hands-on experience in Linux system administration. Proficiency in managing machine learning frameworks and tools such as CUDA, PyTorch, and JAX is essential. You should also have excellent problem-solving skills and the ability to work collaboratively in a fast-paced environment. A Bachelor's degree in Computer Science, Engineering, or a related field is preferred.

About the job

At Sciforium, we are at the forefront of AI infrastructure, pioneering advanced multimodal AI models and an innovative, high-efficiency serving platform. With substantial backing from AMD and a dedicated team of engineers, we are rapidly expanding our capabilities to support the next generation of frontier AI models and real-time applications.

About the Role

We are looking for a highly skilled Senior HPC & GPU Infrastructure Engineer who will be responsible for ensuring the health, reliability, and performance of our GPU compute cluster. As the primary custodian of our high-density accelerator environment, you will serve as the crucial link between hardware operations, distributed systems, and machine learning workflows. This position encompasses a range of responsibilities, from hands-on Linux systems engineering and GPU driver setup to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you are passionate about optimizing hardware performance, enjoy troubleshooting GPUs at scale, and aspire to create world-class AI infrastructure, we would love to hear from you.

Your Responsibilities

1. System Health & Reliability (SRE)

  • On-Call Response: Be the primary responder for system outages, GPU failures, node crashes, and other cluster-wide incidents, ensuring rapid issue resolution to minimize downtime.

  • Cluster Monitoring: Develop and maintain monitoring protocols for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and general system load.

  • Vendor Liaison: Collaborate with data center personnel, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster.

2. Linux & Network Administration

  • OS Management: Oversee the installation, patching, and maintenance of Linux distributions (Ubuntu / CentOS / RHEL), ensuring consistent configuration, kernel tuning, and automation for large node fleets.

  • Security & Access Controls: Set up VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computing infrastructure.

  • Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity and administer distributed file systems like NFS, GPFS, or Lustre.

3. GPU & ML Stack Engineering

  • Deployment & Bring-Up: Spearhead the deployment of new GPU nodes, including BIOS configuration and software integration to ensure optimal performance.

About Sciforium

Sciforium is a cutting-edge AI infrastructure company committed to developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. With significant investment backing and direct support from AMD, we are rapidly growing our team to build the comprehensive stack that powers advanced AI models and real-time applications.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.