Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Experience Level
Senior
About the job
Join Archil as a Senior Distributed Systems Engineer, where you will play a critical role in developing our innovative storage solutions. You'll engage with technologies across the entire stack to tackle challenges and contribute to building Archil volumes, significantly influencing both technical design and product strategy.
Key Responsibilities
Provide on-call support for our production systems, ensuring customer satisfaction in case of issues.
Innovate and implement unprecedented capabilities within our storage services.
Design interactions in distributed systems focusing on atomicity and idempotency.
Deploy and generalize infrastructure across multiple cloud environments.
Adapt to evolving customer needs amidst ambiguity.
Lead engineering teams through complex decisions and provide insightful PR feedback.
Role OverviewJoin Archil as a Senior Distributed Systems Engineer, where you will play a critical role in developing our innovative storage solutions. You'll engage with technologies across the entire stack to tackle challenges and contribute to building Archil volumes, significantly influencing both technical design and product strategy.Key Responsibilities…
Join Krea's Innovative TeamAt Krea, we are at the forefront of developing next-generation AI creative tools. Our commitment lies in making AI an intuitive and controllable medium for creatives. We aspire to create tools that enhance human creativity rather than replace it.We view AI as a transformative medium that enables expressions across diverse formats—text, images, video, sound, and even 3D. Our focus is on creating smarter, more adaptable tools that leverage this medium effectively.The Role of Supercomputing and AI Infrastructure at KreaOur team is responsible for building and managing the foundational infrastructure that supports Krea's research and inference processes. This includes distributed training systems, over 1000 Kubernetes GPU clusters, and extensive petabyte-scale data pipelines. Much of our work involves creating bespoke solutions, such as custom distributed datastores, job orchestration systems, and advanced streaming pipelines, which are designed to handle modern AI workloads efficiently.Key Projects You Will Contribute To:Distributed Data Systems: Design and implement multi-stage pipelines to transform petabytes of raw data into clean, annotated datasets; run classification models across billions of images; deploy and integrate large language models to caption extensive multimedia data.GPU Infrastructure: Manage distributed training and inference across 1000+ GPU Kubernetes clusters; address orchestration and scaling challenges for large-scale GPU job processing; optimize research workflows across multiple datacenters.Distributed Training: Profile and enhance dataloaders streaming thousands of images per second; troubleshoot InfiniBand networking during extensive training runs; develop fault tolerance systems for large-scale pretraining; collaborate with researchers to refine reinforcement learning infrastructure.Applied ML Pipelines: Identify clean scenes in millions of videos utilizing distributed shot-boundary detection; tailor and train models to sift through billions of images for specific queries; construct systems that link raw cluster capacity with research outcomes.
About Our TeamThe Platform Systems team at OpenAI is at the forefront of innovation, merging advanced AI technologies with large-scale distributed systems. We are tasked with creating the engineering and research infrastructure essential for training OpenAI's premier models on some of the most powerful, custom-built supercomputers globally.Our team is dedicated to developing the core software for model training, delving deep into the technological stack. This encompasses collective communication, compute efficiency, parallelism strategies, fault tolerance, failure detection, and observability. The systems we design are pivotal to enhancing OpenAI's research capabilities, facilitating reliable and efficient training at the leading edge of technology.We work in close partnership with researchers across the organization, continuously integrating insights from various OpenAI projects to advance our training platform.About the RoleAs a Software Engineer specializing in Platform Systems, you will architect and develop distributed systems that enhance visibility into large-scale training operations, ensuring their dependable operation at scale.Your responsibilities will include designing systems for failure detection, tracing, and observability that pinpoint slow or malfunctioning nodes, identify performance bottlenecks, and assist engineers in optimizing extensive distributed training tasks. This infrastructure is integral to the functionality of OpenAI's training stack and is continuously evolving to accommodate new use cases and increasingly intricate workloads.This position is central to our training infrastructure, merging systems engineering, performance analysis, and large-scale debugging.Key ResponsibilitiesDesign and develop distributed failure detection, tracing, and profiling systems tailored for large-scale AI training jobs.Create tools to identify slow, faulty, or errant nodes and deliver actionable insights into system behavior.Enhance observability, reliability, and performance across OpenAI's training platform.Troubleshoot and resolve issues within complex, high-throughput distributed systems.Collaborate effectively with systems, infrastructure, and research teams to advance platform capabilities.Adapt and expand failure detection and tracing systems to support new training paradigms and workloads.Ideal Candidate ProfilePossesses a deep passion for performance, stability, and observability in distributed systems.Demonstrates proficiency in systems engineering and performance analysis.Has experience in debugging high-throughput distributed systems.Exhibits strong collaboration skills with a track record of working with cross-functional teams.Shows adaptability and eagerness to embrace new technologies and methodologies.
Full-time|$170K/yr - $210K/yr|On-site|South San Francisco, California, USA
Software Engineer, Delivery Network Platform Join Zipline, where we are revolutionizing logistics with an autonomous delivery network. As part of the Delivery Network Platform team, you will develop the foundational systems that enable aircraft, sites, and infrastructure to operate seamlessly in live delivery scenarios. Your work will involve creating software solutions that provide operators with real-time insights and control, designing orchestration systems that manage fleet movements, and developing validation platforms to ensure the network's reliability as it scales. Your Responsibilities You will be responsible for software systems that are pivotal to fleet operations, including: Network Operating Center software for real-time visibility and interventions across aircraft, sites, missions, weather, and demand. Fleet orchestration systems for assignment, routing, scheduling, and rebalancing tasks. Maintenance and asset health systems linking issue detection to service readiness. Simulation and validation platforms to assess topology, load, and policy changes prior to production. Platform interfaces and configurable control planes that empower other teams to safely extend the network. Tackling Complex Challenges Unlike typical software roles focused on digital experiences, this position plays a critical role in managing a live autonomous logistics network. You'll address challenges such as: Maintaining an accurate real-time view of aircraft and essential site assets across the network. Ensuring the network remains operational amidst shifting demand, changing weather conditions, infrastructure issues, or capacity constraints. Creating user-friendly operator control interfaces that facilitate quick and accurate decision-making under pressure. Simulating potential future network behaviors to mitigate risks before they impact production. These systems directly affect operational performance. You will own significant components of the platform, make critical technical and product decisions, and have a substantial impact on the network's effectiveness. Team Dynamics Our team operates with a strong emphasis on ownership, trust, and high technical standards. Engineers are expected to identify significant problems, develop a clear vision for system functionality, and drive solutions from conception to production. Additionally, we encourage engineers to leverage AI tools to enhance exploration, implementation, and debugging processes while upholding strong engineering principles, judgment, and accountability.
About GranicaGranica is an innovative AI research and infrastructure firm dedicated to creating reliable and steerable representations of enterprise data.We build trust through our product Crunch, a policy-driven health layer that ensures large tabular datasets remain efficient, reliable, and reversible. On this solid foundation, we are developing Large Tabular Models—systems designed to learn cross-column and relational structures in order to provide trustworthy answers and automation with inherent provenance and governance.Our MissionAI is currently hampered not only by the design of models but also by the inefficiencies of the data that supports them. Every redundant byte, poorly organized dataset, and inefficient data pathway contributes to significant costs, latency, and energy waste as we scale.Granica aims to eliminate these inefficiencies. We merge cutting-edge research in information theory, probabilistic modeling, and distributed systems to craft self-optimizing data infrastructures: systems that consistently enhance the representation and utilization of information by AI.Our engineering team collaborates closely with the Granica Research group led by Prof. Andrea Montanari of Stanford University, bridging advancements in information theory and learning efficiency with large-scale distributed systems. Together, we firmly believe that the next major advancement in AI will stem from breakthroughs in efficient systems rather than merely larger models.Your ContributionsGlobal Metadata Substrate: Design a transactional and metadata substrate that facilitates time-travel, schema evolution, and atomic consistency across massive petabyte-scale tabular datasets.Adaptive Engines: Develop systems that autonomously reorganize data, learning from access patterns and workloads to maintain peak efficiency without the need for manual tuning.Intelligent Data Layouts: Optimize bit-level organization (including encoding, compression, and layout) to maximize signal extraction per byte read.Autonomous Compute Pipelines: Create distributed compute systems that scale predictably, adapt to dynamic loads, and ensure reliability under failure conditions.Research to Production: Apply new algorithms in compression, representation, and optimization that emerge from ongoing research. We encourage opportunities to publish and open-source your work.Latency as Intelligence: Design systems that inherently minimize latency as a measure of intelligence.
At Exa, we are on a mission to create a cutting-edge search engine from the ground up, tailored specifically for AI applications. Our team is dedicated to developing large-scale infrastructure that efficiently crawls the internet, trains advanced embedding models for indexing, and constructs high-performance vector databases in Rust for optimized searching. We also manage a state-of-the-art $5M H200 GPU cluster that activates thousands of machines simultaneously.As a Software Engineer specializing in Distributed Data Systems, you will be responsible for designing and implementing the data infrastructure that drives our operations—from crawling billions of web pages to training sophisticated embedding models and delivering real-time search functionalities. You will enjoy significant autonomy in creating systems capable of scaling to hundreds of petabytes. This is your opportunity to work on data pipelines at an unprecedented scale.
At Genmo, we are at the forefront of advancing artificial intelligence through innovative research in video generation. Our mission is to construct open, cutting-edge models that will ultimately contribute to the realization of Artificial General Intelligence (AGI). As part of our dynamic team, you will play a pivotal role in redefining the future of AI and expanding the horizons of video creation.We are looking for a skilled GPU Performance Engineer who can extract maximum performance from our H100 infrastructure and fine-tune our model serving stack to achieve unparalleled efficiency. If you are passionate about optimizing performance, particularly at the microsecond level, and thrive on pushing hardware to its limits, this is the perfect opportunity for you.Key ResponsibilitiesUtilize advanced profiling tools such as Nsight Systems and nvprof to analyze and enhance GPU workloads.Develop high-performance CUDA and Triton kernels to optimize essential model functions.Reduce cold start latency from seconds to mere milliseconds in our serving infrastructure.Optimize memory access patterns, implement kernel fusion, and maximize GPU utilization.Collaborate closely with machine learning engineers to optimize model implementations.Diagnose and resolve performance issues throughout the application and hardware stack.Implement custom memory pooling and allocation strategies to enhance performance.Promote performance optimization techniques and foster a culture of excellence across teams.
Role overview Meter Inc. is developing tools to capture and preserve the expertise of network engineers. The team’s goal is to build systems that document how experts diagnose network issues, making it possible for future models to manage networks with less manual effort. This work will help Meter support many customer networks while reducing the need for direct engineer intervention. What makes this work unique Network engineering lacks the structured archives found in software development. While Git and GitHub record software decisions, the reasoning behind network troubleshooting often disappears once a problem is fixed. This role centers on building a structured, searchable system for network operations, a kind of GitHub for network engineering. The system will capture network state, expert observations, and the logic behind key decisions. Your first 90 days First 30 days: Meet with network engineers to learn their workflows. Study what effective diagnostic documentation looks like and identify the necessary data. Review telemetry (ClickHouse), configurations (Postgres), and support history (Salesforce). By 60 days: Deliver a working annotation interface. Network engineers should be able to review past support tickets, view the network’s state during incidents, and record their reasoning. The tool should be practical and encourage regular use. By 90 days: Network engineers will be able to create training data independently. Initial model benchmarks from your pipeline will be live, showing how your work improves the process. Technical stack TypeScript React Go GraphQL Kafka Postgres Collaboration This role works closely with Meter’s co-founder and CEO, who will help guide the product roadmap and set priorities. Location This position is based in San Francisco.
At Sciforium, we are at the forefront of AI infrastructure, innovating next-generation multimodal AI models and a proprietary high-efficiency serving platform. With substantial funding and direct collaboration from AMD, supported by their engineers, our team is rapidly expanding to develop the complete stack that powers cutting-edge AI models and real-time applications.About the RoleWe are on the lookout for a talented GPU Kernel Engineer who is eager to explore and maximize performance on modern accelerators. In this role, you will be responsible for designing and optimizing custom GPU kernels that drive our advanced large-scale AI systems. You will navigate the hardware-software stack, engaging in low-level kernel development and integrating optimized operations into high-level machine learning frameworks for large-scale training and inference.This position is perfect for someone who excels at the intersection of GPU programming, systems engineering, and state-of-the-art AI workloads, and aims to contribute significantly to the efficiency and scalability of our machine learning platform.Key ResponsibilitiesDevelop, implement, and enhance custom GPU kernels utilizing C++, PTX, CUDA, ROCm, Triton, and/or JAX Pallas.Profile and fine-tune the end-to-end performance of machine learning operations, particularly for large-scale LLM training and inference.Integrate low-level GPU kernels into frameworks such as PyTorch, JAX, and our proprietary internal runtimes.Create performance models, pinpoint bottlenecks, and deliver kernel-level enhancements that significantly boost AI workloads.Collaborate with machine learning researchers, distributed systems engineers, and model-serving teams to optimize computational performance across the entire stack.Engage closely with hardware vendors (NVIDIA/AMD) and stay updated on the latest GPU architecture and compiler/toolchain advancements.Contribute to the development of tools, documentation, benchmarking suites, and testing frameworks ensuring correctness and performance reproducibility.Must-Haves5+ years of industry or research experience in GPU kernel development or high-performance computing.Bachelor’s, Master’s, or PhD in Computer Science, Computer Engineering, Electrical Engineering, Applied Mathematics, or a related discipline.Strong programming proficiency in C++, Python, and familiarity with machine learning frameworks.
Join Cloudflare as a Distributed Systems Engineer focusing on our Data Platform, where you will play a pivotal role in developing analytics and alert systems that enhance our services. You will collaborate with a talented team to design scalable and efficient systems to manage and analyze vast amounts of data. Your work will directly impact the performance and reliability of our offerings, ensuring our customers have the best possible experience.
Full-time|$180K/yr - $200K/yr|Remote|New York, New York, United States; Remote; San Francisco, California, United States; Seattle, Washington, United States
About UsLightning AI, the innovative force behind PyTorch Lightning, is revolutionizing the AI landscape since 2019. We provide an all-encompassing platform designed to streamline the development, training, and deployment of AI systems, facilitating the transition from research to production effortlessly.Following our merger with Voltage Park, a cutting-edge neocloud and AI Factory, we unite developer-centric software with cost-effective, large-scale computing solutions. Our tools are tailored for experimentation, training, and production inference, incorporating built-in security, observability, and control.We cater to various clients, from individual researchers to startups and large enterprises, operating globally with offices in key cities including New York, San Francisco, Seattle, and London. We're proud to be backed by prestigious investors like Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.Our Core ValuesMove Fast: We prioritize speed and accuracy, breaking down complex challenges into manageable tasks.Focus: We aim to achieve one goal at a time, working collaboratively to deliver precise features.Balance: We believe sustained performance comes from adequate rest and recovery, ensuring a healthy work-life balance.Craftsmanship: We strive for excellence in every detail, taking pride in our work and its impact.Minimal: We embrace simplicity to drive innovation, eliminating unnecessary complexity and focusing on what truly matters.Role OverviewWe are on the lookout for a GPU & Compute Infrastructure Engineer to become a vital member of our Infrastructure Engineering team. In this pivotal role, you will manage image systems, diagnostics, and validation across expansive bare-metal computing infrastructure, particularly for GPU-optimized systems. You will work at the crossroads of hardware, systems, and software, developing automation, enhancing reliability, and facilitating efficient cluster setups for AI/ML and HPC workloads.Your responsibilities will include overseeing our image pipeline, running validation environments and test clusters, and supporting GPU hardware qualification. This role is essential for maintaining the integrity of our infrastructure, ensuring consistency, performance, and reliability.
ABOUT BASETENAt Baseten, we empower the world's leading AI firms—such as Cursor, Notion, and OpenEvidence—by delivering mission-critical inference solutions. Our unique blend of applied AI research, robust infrastructure, and user-friendly developer tools enables AI pioneers to effectively deploy groundbreaking models. With our recent achievement of a $300M Series E funding round supported by esteemed investors like BOND and IVP, we're on an exciting growth trajectory. Join our dynamic team and contribute to the platform that drives the next generation of AI products.THE ROLEWe are looking for an experienced Senior GPU Kernel Engineer to join our innovative team at the forefront of AI acceleration. In this role, your programming expertise will directly enhance the performance of cutting-edge machine learning models. You'll be responsible for developing highly efficient GPU kernels that optimize computational processes, allowing for transformative AI applications.You'll thrive in a fast-paced, intellectually challenging environment where your technical skills are pivotal. Your contributions will directly affect production systems that serve millions of users across various platforms. This position offers exceptional opportunities for career advancement for engineers enthusiastic about low-level optimization and impactful systems engineering.EXAMPLE INITIATIVESAs part of our Model Performance team, you will engage in projects like:Baseten Embeddings Inference: The quickest embeddings solution availableThe Baseten Inference StackEnhancing model performance optimizationRESPONSIBILITIESCore Engineering ResponsibilitiesDesign and develop high-performance GPU kernels for essential machine learning operations, including matrix multiplications and attention mechanisms.Collaborate with cross-functional teams to drive performance improvements and implement optimizations.Debug and refine kernel code to achieve maximal efficiency and reliability.Stay abreast of the latest advancements in GPU technology and machine learning frameworks.
About Our TeamJoin the innovative Sora team at OpenAI, where we are at the forefront of developing multimodal capabilities for our foundation models. Our hybrid research and product team is dedicated to seamlessly integrating multimodal functionalities into our AI solutions, ensuring they are dependable, user-centric, and aligned with our vision of benefiting society at large.Role OverviewAs a Machine Learning Engineer specializing in Distributed Data Systems, you will be instrumental in designing and scaling the infrastructure that facilitates large-scale multimodal training and evaluation at OpenAI. Your role will involve managing complex distributed data pipelines, collaborating closely with researchers to convert their requirements into robust, production-ready systems, and enhancing pipelines that are essential for Sora's rapid iteration cycles.We are seeking detail-oriented engineers with extensive experience in distributed systems who thrive in high-stakes environments and excel in building resilient infrastructure.This position is located in San Francisco, CA, and follows a hybrid work model, requiring three days in the office each week. We also provide relocation assistance for new team members.Key Responsibilities:Design, implement, and maintain data infrastructure systems, including distributed computing, data orchestration, distributed storage, streaming infrastructure, and machine learning systems, with a focus on scalability, reliability, and security.Ensure our data platform can scale exponentially while maintaining high reliability and efficiency.Collaborate with researchers to gain a deep understanding of their requirements, translating them into production-ready systems.Strengthen, optimize, and manage critical data infrastructure systems that support multimodal training and evaluation.You Will Excel in This Role If You:Possess strong experience with distributed systems and large-scale infrastructure, coupled with a keen interest in data.Exhibit meticulous attention to detail and a commitment to building and maintaining reliable systems.Demonstrate solid software engineering fundamentals and effective organizational skills.Thrive in environments characterized by ambiguity and rapid change.About OpenAIOpenAI is a trailblazing AI research and deployment organization committed to ensuring that general-purpose artificial intelligence serves humanity. We continuously push the boundaries of AI capabilities and strive to create technology that benefits everyone.
At sfcompute, we are pioneering a transformative approach to GPU cluster financing, enabling the largest infrastructure build-out in history while effectively mitigating risk.In the ever-evolving landscape of GPU technology, securing financing for GPU clusters and the essential infrastructure they require involves inherent risks. Our innovative model ensures that developers can lease clusters through fixed-price long-term contracts, thus offloading risk to the customer while maintaining financial stability.As AI and computational demands grow, our mission is to democratize access to powerful computing resources. We aim to create a liquid market for GPU offtake, allowing startups and smaller enterprises to thrive without the burden of long-term contracts that aren't feasible for them.Role OverviewJoin our dynamic infrastructure team, responsible for architecting and deploying cutting-edge GPU clusters globally. You'll play a crucial role in maintaining operational excellence, engaging in on-call rotations, and driving automation to facilitate large-scale deployments. As a key member of our small but ambitious team, you will help shape our culture, mentor junior engineers, and learn directly from our customers.
Join Cloudflare as a Distributed Systems Engineer and help us build and maintain our innovative Data Platform. In this role, you'll be working on our Analytical Database Platform, focusing on enhancing data processing and storage technologies to support our global client base. If you are passionate about distributing systems and enjoy solving complex problems, this is the perfect opportunity for you!
About Our TeamJoin OpenAI as we seek talented software engineers to enhance the productivity of our networking teams. These teams are responsible for designing and managing high-performance networking systems that underpin OpenAI's training and inference infrastructure at the cutting edge of technology.About This RoleWe are looking for a dedicated individual who is passionate about improving the developer experience for engineers working on intricate infrastructure systems, specifically focusing on build systems, testing architecture, release pipelines, and efficient development workflows.This role is integral to OpenAI’s networking team, aimed at streamlining the processes for engineers to build, test, validate, and deploy changes in multi-server, networked, and hardware-adjacent environments.Key Responsibilities:Enhance development workflows for engineers tasked with building and operating OpenAI’s networking systems.Design and refine continuous deployment, release, and validation pipelines.Develop and sustain test harnesses for multi-server, networked, and hardware-backed environments.Accelerate iteration speed across codebases in C++, Python, and build-system-centric environments.Collaborate with engineers to uncover and resolve friction points in CI, testing, debugging, and deployment workflows.Lead the testing and reliability strategy for infrastructure components that support extensive training and inference workloads.Work closely with centralized developer experience teams while remaining deeply integrated with networking engineers who are closest to the systems.
About Our TeamAt OpenAI, our Storage Infrastructure team is at the forefront of enabling data accessibility, placement, and lifecycle management through advanced APIs. We prioritize scalability, reliability, security, and usability to meet the demands of our pioneering AI research.Role OverviewWe are seeking a talented Software Engineer to join our Storage Infrastructure team, where you will architect and maintain Exascale systems designed to efficiently and reliably manage research data across multiple regions.The ideal candidate will have extensive experience in distributed systems, particularly in developing exascale data management solutions or distributed filesystems.Your ResponsibilitiesDesign and develop software solutions to manage exascale data, ensuring accessibility for researchers.Enhance the reliability, predictability, and cost efficiency of our storage systems.Collaborate with researchers to understand and address diverse data use cases.Implement robust security measures to protect our critical datasets.Ideal Candidate ProfileStrong foundation in distributed systems principles with a proven ability to design and implement scalable, reliable, and secure storage architectures.Proficiency in programming languages relevant to storage systems development.Experience with cloud platforms, particularly Azure.Familiarity with AI/ML data access patterns.A proactive approach and adaptability in a fast-paced, dynamic environment.About OpenAIOpenAI is a cutting-edge AI research and deployment organization committed to ensuring that general-purpose artificial intelligence benefits all of humanity. We strive to push the boundaries of AI capabilities while ensuring safety and human-centric development. Our mission is to encompass and appreciate diverse perspectives, voices, and experiences that reflect the full spectrum of humanity.We are proud to be an equal opportunity employer, committed to fostering an inclusive workplace where all individuals are respected and valued.
Join Cloudflare as a Distributed Systems Engineer specializing in our Data Platform. In this role, you will be at the forefront of building and optimizing systems that enhance data delivery, database management, and retrieval processes. Collaborate with cross-functional teams to innovate and improve our platform, ensuring seamless data access and performance.This position offers a unique opportunity to work in a dynamic environment, leveraging cutting-edge technologies to impact the way data is processed and utilized across our platform.
We are seeking a talented and driven Distributed Systems Engineer to join our dynamic Data Platform team at Cloudflare. In this role, you will have the opportunity to work on cutting-edge technologies and help shape the future of data delivery, database management, and retrieval systems. You will collaborate with cross-functional teams to build scalable, reliable, and efficient distributed systems that power our services.
At Sciforium, we are at the forefront of AI infrastructure, pioneering advanced multimodal AI models and an innovative, high-efficiency serving platform. With substantial backing from AMD and a dedicated team of engineers, we are rapidly expanding our capabilities to support the next generation of frontier AI models and real-time applications.About the RoleWe are looking for a highly skilled Senior HPC & GPU Infrastructure Engineer who will be responsible for ensuring the health, reliability, and performance of our GPU compute cluster. As the primary custodian of our high-density accelerator environment, you will serve as the crucial link between hardware operations, distributed systems, and machine learning workflows. This position encompasses a range of responsibilities, from hands-on Linux systems engineering and GPU driver setup to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you are passionate about optimizing hardware performance, enjoy troubleshooting GPUs at scale, and aspire to create world-class AI infrastructure, we would love to hear from you.Your Responsibilities1. System Health & Reliability (SRE)On-Call Response: Be the primary responder for system outages, GPU failures, node crashes, and other cluster-wide incidents, ensuring rapid issue resolution to minimize downtime.Cluster Monitoring: Develop and maintain monitoring protocols for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and general system load.Vendor Liaison: Collaborate with data center personnel, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster.2. Linux & Network AdministrationOS Management: Oversee the installation, patching, and maintenance of Linux distributions (Ubuntu / CentOS / RHEL), ensuring consistent configuration, kernel tuning, and automation for large node fleets.Security & Access Controls: Set up VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computing infrastructure.Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity and administer distributed file systems like NFS, GPFS, or Lustre.3. GPU & ML Stack EngineeringDeployment & Bring-Up: Spearhead the deployment of new GPU nodes, including BIOS configuration and software integration to ensure optimal performance.
Role OverviewJoin Archil as a Senior Distributed Systems Engineer, where you will play a critical role in developing our innovative storage solutions. You'll engage with technologies across the entire stack to tackle challenges and contribute to building Archil volumes, significantly influencing both technical design and product strategy.Key Responsibilities…
Join Krea's Innovative TeamAt Krea, we are at the forefront of developing next-generation AI creative tools. Our commitment lies in making AI an intuitive and controllable medium for creatives. We aspire to create tools that enhance human creativity rather than replace it.We view AI as a transformative medium that enables expressions across diverse formats—text, images, video, sound, and even 3D. Our focus is on creating smarter, more adaptable tools that leverage this medium effectively.The Role of Supercomputing and AI Infrastructure at KreaOur team is responsible for building and managing the foundational infrastructure that supports Krea's research and inference processes. This includes distributed training systems, over 1000 Kubernetes GPU clusters, and extensive petabyte-scale data pipelines. Much of our work involves creating bespoke solutions, such as custom distributed datastores, job orchestration systems, and advanced streaming pipelines, which are designed to handle modern AI workloads efficiently.Key Projects You Will Contribute To:Distributed Data Systems: Design and implement multi-stage pipelines to transform petabytes of raw data into clean, annotated datasets; run classification models across billions of images; deploy and integrate large language models to caption extensive multimedia data.GPU Infrastructure: Manage distributed training and inference across 1000+ GPU Kubernetes clusters; address orchestration and scaling challenges for large-scale GPU job processing; optimize research workflows across multiple datacenters.Distributed Training: Profile and enhance dataloaders streaming thousands of images per second; troubleshoot InfiniBand networking during extensive training runs; develop fault tolerance systems for large-scale pretraining; collaborate with researchers to refine reinforcement learning infrastructure.Applied ML Pipelines: Identify clean scenes in millions of videos utilizing distributed shot-boundary detection; tailor and train models to sift through billions of images for specific queries; construct systems that link raw cluster capacity with research outcomes.
About Our TeamThe Platform Systems team at OpenAI is at the forefront of innovation, merging advanced AI technologies with large-scale distributed systems. We are tasked with creating the engineering and research infrastructure essential for training OpenAI's premier models on some of the most powerful, custom-built supercomputers globally.Our team is dedicated to developing the core software for model training, delving deep into the technological stack. This encompasses collective communication, compute efficiency, parallelism strategies, fault tolerance, failure detection, and observability. The systems we design are pivotal to enhancing OpenAI's research capabilities, facilitating reliable and efficient training at the leading edge of technology.We work in close partnership with researchers across the organization, continuously integrating insights from various OpenAI projects to advance our training platform.About the RoleAs a Software Engineer specializing in Platform Systems, you will architect and develop distributed systems that enhance visibility into large-scale training operations, ensuring their dependable operation at scale.Your responsibilities will include designing systems for failure detection, tracing, and observability that pinpoint slow or malfunctioning nodes, identify performance bottlenecks, and assist engineers in optimizing extensive distributed training tasks. This infrastructure is integral to the functionality of OpenAI's training stack and is continuously evolving to accommodate new use cases and increasingly intricate workloads.This position is central to our training infrastructure, merging systems engineering, performance analysis, and large-scale debugging.Key ResponsibilitiesDesign and develop distributed failure detection, tracing, and profiling systems tailored for large-scale AI training jobs.Create tools to identify slow, faulty, or errant nodes and deliver actionable insights into system behavior.Enhance observability, reliability, and performance across OpenAI's training platform.Troubleshoot and resolve issues within complex, high-throughput distributed systems.Collaborate effectively with systems, infrastructure, and research teams to advance platform capabilities.Adapt and expand failure detection and tracing systems to support new training paradigms and workloads.Ideal Candidate ProfilePossesses a deep passion for performance, stability, and observability in distributed systems.Demonstrates proficiency in systems engineering and performance analysis.Has experience in debugging high-throughput distributed systems.Exhibits strong collaboration skills with a track record of working with cross-functional teams.Shows adaptability and eagerness to embrace new technologies and methodologies.
Full-time|$170K/yr - $210K/yr|On-site|South San Francisco, California, USA
Software Engineer, Delivery Network Platform Join Zipline, where we are revolutionizing logistics with an autonomous delivery network. As part of the Delivery Network Platform team, you will develop the foundational systems that enable aircraft, sites, and infrastructure to operate seamlessly in live delivery scenarios. Your work will involve creating software solutions that provide operators with real-time insights and control, designing orchestration systems that manage fleet movements, and developing validation platforms to ensure the network's reliability as it scales. Your Responsibilities You will be responsible for software systems that are pivotal to fleet operations, including: Network Operating Center software for real-time visibility and interventions across aircraft, sites, missions, weather, and demand. Fleet orchestration systems for assignment, routing, scheduling, and rebalancing tasks. Maintenance and asset health systems linking issue detection to service readiness. Simulation and validation platforms to assess topology, load, and policy changes prior to production. Platform interfaces and configurable control planes that empower other teams to safely extend the network. Tackling Complex Challenges Unlike typical software roles focused on digital experiences, this position plays a critical role in managing a live autonomous logistics network. You'll address challenges such as: Maintaining an accurate real-time view of aircraft and essential site assets across the network. Ensuring the network remains operational amidst shifting demand, changing weather conditions, infrastructure issues, or capacity constraints. Creating user-friendly operator control interfaces that facilitate quick and accurate decision-making under pressure. Simulating potential future network behaviors to mitigate risks before they impact production. These systems directly affect operational performance. You will own significant components of the platform, make critical technical and product decisions, and have a substantial impact on the network's effectiveness. Team Dynamics Our team operates with a strong emphasis on ownership, trust, and high technical standards. Engineers are expected to identify significant problems, develop a clear vision for system functionality, and drive solutions from conception to production. Additionally, we encourage engineers to leverage AI tools to enhance exploration, implementation, and debugging processes while upholding strong engineering principles, judgment, and accountability.
About GranicaGranica is an innovative AI research and infrastructure firm dedicated to creating reliable and steerable representations of enterprise data.We build trust through our product Crunch, a policy-driven health layer that ensures large tabular datasets remain efficient, reliable, and reversible. On this solid foundation, we are developing Large Tabular Models—systems designed to learn cross-column and relational structures in order to provide trustworthy answers and automation with inherent provenance and governance.Our MissionAI is currently hampered not only by the design of models but also by the inefficiencies of the data that supports them. Every redundant byte, poorly organized dataset, and inefficient data pathway contributes to significant costs, latency, and energy waste as we scale.Granica aims to eliminate these inefficiencies. We merge cutting-edge research in information theory, probabilistic modeling, and distributed systems to craft self-optimizing data infrastructures: systems that consistently enhance the representation and utilization of information by AI.Our engineering team collaborates closely with the Granica Research group led by Prof. Andrea Montanari of Stanford University, bridging advancements in information theory and learning efficiency with large-scale distributed systems. Together, we firmly believe that the next major advancement in AI will stem from breakthroughs in efficient systems rather than merely larger models.Your ContributionsGlobal Metadata Substrate: Design a transactional and metadata substrate that facilitates time-travel, schema evolution, and atomic consistency across massive petabyte-scale tabular datasets.Adaptive Engines: Develop systems that autonomously reorganize data, learning from access patterns and workloads to maintain peak efficiency without the need for manual tuning.Intelligent Data Layouts: Optimize bit-level organization (including encoding, compression, and layout) to maximize signal extraction per byte read.Autonomous Compute Pipelines: Create distributed compute systems that scale predictably, adapt to dynamic loads, and ensure reliability under failure conditions.Research to Production: Apply new algorithms in compression, representation, and optimization that emerge from ongoing research. We encourage opportunities to publish and open-source your work.Latency as Intelligence: Design systems that inherently minimize latency as a measure of intelligence.
At Exa, we are on a mission to create a cutting-edge search engine from the ground up, tailored specifically for AI applications. Our team is dedicated to developing large-scale infrastructure that efficiently crawls the internet, trains advanced embedding models for indexing, and constructs high-performance vector databases in Rust for optimized searching. We also manage a state-of-the-art $5M H200 GPU cluster that activates thousands of machines simultaneously.As a Software Engineer specializing in Distributed Data Systems, you will be responsible for designing and implementing the data infrastructure that drives our operations—from crawling billions of web pages to training sophisticated embedding models and delivering real-time search functionalities. You will enjoy significant autonomy in creating systems capable of scaling to hundreds of petabytes. This is your opportunity to work on data pipelines at an unprecedented scale.
At Genmo, we are at the forefront of advancing artificial intelligence through innovative research in video generation. Our mission is to construct open, cutting-edge models that will ultimately contribute to the realization of Artificial General Intelligence (AGI). As part of our dynamic team, you will play a pivotal role in redefining the future of AI and expanding the horizons of video creation.We are looking for a skilled GPU Performance Engineer who can extract maximum performance from our H100 infrastructure and fine-tune our model serving stack to achieve unparalleled efficiency. If you are passionate about optimizing performance, particularly at the microsecond level, and thrive on pushing hardware to its limits, this is the perfect opportunity for you.Key ResponsibilitiesUtilize advanced profiling tools such as Nsight Systems and nvprof to analyze and enhance GPU workloads.Develop high-performance CUDA and Triton kernels to optimize essential model functions.Reduce cold start latency from seconds to mere milliseconds in our serving infrastructure.Optimize memory access patterns, implement kernel fusion, and maximize GPU utilization.Collaborate closely with machine learning engineers to optimize model implementations.Diagnose and resolve performance issues throughout the application and hardware stack.Implement custom memory pooling and allocation strategies to enhance performance.Promote performance optimization techniques and foster a culture of excellence across teams.
Role overview Meter Inc. is developing tools to capture and preserve the expertise of network engineers. The team’s goal is to build systems that document how experts diagnose network issues, making it possible for future models to manage networks with less manual effort. This work will help Meter support many customer networks while reducing the need for direct engineer intervention. What makes this work unique Network engineering lacks the structured archives found in software development. While Git and GitHub record software decisions, the reasoning behind network troubleshooting often disappears once a problem is fixed. This role centers on building a structured, searchable system for network operations, a kind of GitHub for network engineering. The system will capture network state, expert observations, and the logic behind key decisions. Your first 90 days First 30 days: Meet with network engineers to learn their workflows. Study what effective diagnostic documentation looks like and identify the necessary data. Review telemetry (ClickHouse), configurations (Postgres), and support history (Salesforce). By 60 days: Deliver a working annotation interface. Network engineers should be able to review past support tickets, view the network’s state during incidents, and record their reasoning. The tool should be practical and encourage regular use. By 90 days: Network engineers will be able to create training data independently. Initial model benchmarks from your pipeline will be live, showing how your work improves the process. Technical stack TypeScript React Go GraphQL Kafka Postgres Collaboration This role works closely with Meter’s co-founder and CEO, who will help guide the product roadmap and set priorities. Location This position is based in San Francisco.
At Sciforium, we are at the forefront of AI infrastructure, innovating next-generation multimodal AI models and a proprietary high-efficiency serving platform. With substantial funding and direct collaboration from AMD, supported by their engineers, our team is rapidly expanding to develop the complete stack that powers cutting-edge AI models and real-time applications.About the RoleWe are on the lookout for a talented GPU Kernel Engineer who is eager to explore and maximize performance on modern accelerators. In this role, you will be responsible for designing and optimizing custom GPU kernels that drive our advanced large-scale AI systems. You will navigate the hardware-software stack, engaging in low-level kernel development and integrating optimized operations into high-level machine learning frameworks for large-scale training and inference.This position is perfect for someone who excels at the intersection of GPU programming, systems engineering, and state-of-the-art AI workloads, and aims to contribute significantly to the efficiency and scalability of our machine learning platform.Key ResponsibilitiesDevelop, implement, and enhance custom GPU kernels utilizing C++, PTX, CUDA, ROCm, Triton, and/or JAX Pallas.Profile and fine-tune the end-to-end performance of machine learning operations, particularly for large-scale LLM training and inference.Integrate low-level GPU kernels into frameworks such as PyTorch, JAX, and our proprietary internal runtimes.Create performance models, pinpoint bottlenecks, and deliver kernel-level enhancements that significantly boost AI workloads.Collaborate with machine learning researchers, distributed systems engineers, and model-serving teams to optimize computational performance across the entire stack.Engage closely with hardware vendors (NVIDIA/AMD) and stay updated on the latest GPU architecture and compiler/toolchain advancements.Contribute to the development of tools, documentation, benchmarking suites, and testing frameworks ensuring correctness and performance reproducibility.Must-Haves5+ years of industry or research experience in GPU kernel development or high-performance computing.Bachelor’s, Master’s, or PhD in Computer Science, Computer Engineering, Electrical Engineering, Applied Mathematics, or a related discipline.Strong programming proficiency in C++, Python, and familiarity with machine learning frameworks.
Join Cloudflare as a Distributed Systems Engineer focusing on our Data Platform, where you will play a pivotal role in developing analytics and alert systems that enhance our services. You will collaborate with a talented team to design scalable and efficient systems to manage and analyze vast amounts of data. Your work will directly impact the performance and reliability of our offerings, ensuring our customers have the best possible experience.
Full-time|$180K/yr - $200K/yr|Remote|New York, New York, United States; Remote; San Francisco, California, United States; Seattle, Washington, United States
About UsLightning AI, the innovative force behind PyTorch Lightning, is revolutionizing the AI landscape since 2019. We provide an all-encompassing platform designed to streamline the development, training, and deployment of AI systems, facilitating the transition from research to production effortlessly.Following our merger with Voltage Park, a cutting-edge neocloud and AI Factory, we unite developer-centric software with cost-effective, large-scale computing solutions. Our tools are tailored for experimentation, training, and production inference, incorporating built-in security, observability, and control.We cater to various clients, from individual researchers to startups and large enterprises, operating globally with offices in key cities including New York, San Francisco, Seattle, and London. We're proud to be backed by prestigious investors like Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.Our Core ValuesMove Fast: We prioritize speed and accuracy, breaking down complex challenges into manageable tasks.Focus: We aim to achieve one goal at a time, working collaboratively to deliver precise features.Balance: We believe sustained performance comes from adequate rest and recovery, ensuring a healthy work-life balance.Craftsmanship: We strive for excellence in every detail, taking pride in our work and its impact.Minimal: We embrace simplicity to drive innovation, eliminating unnecessary complexity and focusing on what truly matters.Role OverviewWe are on the lookout for a GPU & Compute Infrastructure Engineer to become a vital member of our Infrastructure Engineering team. In this pivotal role, you will manage image systems, diagnostics, and validation across expansive bare-metal computing infrastructure, particularly for GPU-optimized systems. You will work at the crossroads of hardware, systems, and software, developing automation, enhancing reliability, and facilitating efficient cluster setups for AI/ML and HPC workloads.Your responsibilities will include overseeing our image pipeline, running validation environments and test clusters, and supporting GPU hardware qualification. This role is essential for maintaining the integrity of our infrastructure, ensuring consistency, performance, and reliability.
ABOUT BASETENAt Baseten, we empower the world's leading AI firms—such as Cursor, Notion, and OpenEvidence—by delivering mission-critical inference solutions. Our unique blend of applied AI research, robust infrastructure, and user-friendly developer tools enables AI pioneers to effectively deploy groundbreaking models. With our recent achievement of a $300M Series E funding round supported by esteemed investors like BOND and IVP, we're on an exciting growth trajectory. Join our dynamic team and contribute to the platform that drives the next generation of AI products.THE ROLEWe are looking for an experienced Senior GPU Kernel Engineer to join our innovative team at the forefront of AI acceleration. In this role, your programming expertise will directly enhance the performance of cutting-edge machine learning models. You'll be responsible for developing highly efficient GPU kernels that optimize computational processes, allowing for transformative AI applications.You'll thrive in a fast-paced, intellectually challenging environment where your technical skills are pivotal. Your contributions will directly affect production systems that serve millions of users across various platforms. This position offers exceptional opportunities for career advancement for engineers enthusiastic about low-level optimization and impactful systems engineering.EXAMPLE INITIATIVESAs part of our Model Performance team, you will engage in projects like:Baseten Embeddings Inference: The quickest embeddings solution availableThe Baseten Inference StackEnhancing model performance optimizationRESPONSIBILITIESCore Engineering ResponsibilitiesDesign and develop high-performance GPU kernels for essential machine learning operations, including matrix multiplications and attention mechanisms.Collaborate with cross-functional teams to drive performance improvements and implement optimizations.Debug and refine kernel code to achieve maximal efficiency and reliability.Stay abreast of the latest advancements in GPU technology and machine learning frameworks.
About Our TeamJoin the innovative Sora team at OpenAI, where we are at the forefront of developing multimodal capabilities for our foundation models. Our hybrid research and product team is dedicated to seamlessly integrating multimodal functionalities into our AI solutions, ensuring they are dependable, user-centric, and aligned with our vision of benefiting society at large.Role OverviewAs a Machine Learning Engineer specializing in Distributed Data Systems, you will be instrumental in designing and scaling the infrastructure that facilitates large-scale multimodal training and evaluation at OpenAI. Your role will involve managing complex distributed data pipelines, collaborating closely with researchers to convert their requirements into robust, production-ready systems, and enhancing pipelines that are essential for Sora's rapid iteration cycles.We are seeking detail-oriented engineers with extensive experience in distributed systems who thrive in high-stakes environments and excel in building resilient infrastructure.This position is located in San Francisco, CA, and follows a hybrid work model, requiring three days in the office each week. We also provide relocation assistance for new team members.Key Responsibilities:Design, implement, and maintain data infrastructure systems, including distributed computing, data orchestration, distributed storage, streaming infrastructure, and machine learning systems, with a focus on scalability, reliability, and security.Ensure our data platform can scale exponentially while maintaining high reliability and efficiency.Collaborate with researchers to gain a deep understanding of their requirements, translating them into production-ready systems.Strengthen, optimize, and manage critical data infrastructure systems that support multimodal training and evaluation.You Will Excel in This Role If You:Possess strong experience with distributed systems and large-scale infrastructure, coupled with a keen interest in data.Exhibit meticulous attention to detail and a commitment to building and maintaining reliable systems.Demonstrate solid software engineering fundamentals and effective organizational skills.Thrive in environments characterized by ambiguity and rapid change.About OpenAIOpenAI is a trailblazing AI research and deployment organization committed to ensuring that general-purpose artificial intelligence serves humanity. We continuously push the boundaries of AI capabilities and strive to create technology that benefits everyone.
At sfcompute, we are pioneering a transformative approach to GPU cluster financing, enabling the largest infrastructure build-out in history while effectively mitigating risk.In the ever-evolving landscape of GPU technology, securing financing for GPU clusters and the essential infrastructure they require involves inherent risks. Our innovative model ensures that developers can lease clusters through fixed-price long-term contracts, thus offloading risk to the customer while maintaining financial stability.As AI and computational demands grow, our mission is to democratize access to powerful computing resources. We aim to create a liquid market for GPU offtake, allowing startups and smaller enterprises to thrive without the burden of long-term contracts that aren't feasible for them.Role OverviewJoin our dynamic infrastructure team, responsible for architecting and deploying cutting-edge GPU clusters globally. You'll play a crucial role in maintaining operational excellence, engaging in on-call rotations, and driving automation to facilitate large-scale deployments. As a key member of our small but ambitious team, you will help shape our culture, mentor junior engineers, and learn directly from our customers.
Join Cloudflare as a Distributed Systems Engineer and help us build and maintain our innovative Data Platform. In this role, you'll be working on our Analytical Database Platform, focusing on enhancing data processing and storage technologies to support our global client base. If you are passionate about distributing systems and enjoy solving complex problems, this is the perfect opportunity for you!
About Our TeamJoin OpenAI as we seek talented software engineers to enhance the productivity of our networking teams. These teams are responsible for designing and managing high-performance networking systems that underpin OpenAI's training and inference infrastructure at the cutting edge of technology.About This RoleWe are looking for a dedicated individual who is passionate about improving the developer experience for engineers working on intricate infrastructure systems, specifically focusing on build systems, testing architecture, release pipelines, and efficient development workflows.This role is integral to OpenAI’s networking team, aimed at streamlining the processes for engineers to build, test, validate, and deploy changes in multi-server, networked, and hardware-adjacent environments.Key Responsibilities:Enhance development workflows for engineers tasked with building and operating OpenAI’s networking systems.Design and refine continuous deployment, release, and validation pipelines.Develop and sustain test harnesses for multi-server, networked, and hardware-backed environments.Accelerate iteration speed across codebases in C++, Python, and build-system-centric environments.Collaborate with engineers to uncover and resolve friction points in CI, testing, debugging, and deployment workflows.Lead the testing and reliability strategy for infrastructure components that support extensive training and inference workloads.Work closely with centralized developer experience teams while remaining deeply integrated with networking engineers who are closest to the systems.
About Our TeamAt OpenAI, our Storage Infrastructure team is at the forefront of enabling data accessibility, placement, and lifecycle management through advanced APIs. We prioritize scalability, reliability, security, and usability to meet the demands of our pioneering AI research.Role OverviewWe are seeking a talented Software Engineer to join our Storage Infrastructure team, where you will architect and maintain Exascale systems designed to efficiently and reliably manage research data across multiple regions.The ideal candidate will have extensive experience in distributed systems, particularly in developing exascale data management solutions or distributed filesystems.Your ResponsibilitiesDesign and develop software solutions to manage exascale data, ensuring accessibility for researchers.Enhance the reliability, predictability, and cost efficiency of our storage systems.Collaborate with researchers to understand and address diverse data use cases.Implement robust security measures to protect our critical datasets.Ideal Candidate ProfileStrong foundation in distributed systems principles with a proven ability to design and implement scalable, reliable, and secure storage architectures.Proficiency in programming languages relevant to storage systems development.Experience with cloud platforms, particularly Azure.Familiarity with AI/ML data access patterns.A proactive approach and adaptability in a fast-paced, dynamic environment.About OpenAIOpenAI is a cutting-edge AI research and deployment organization committed to ensuring that general-purpose artificial intelligence benefits all of humanity. We strive to push the boundaries of AI capabilities while ensuring safety and human-centric development. Our mission is to encompass and appreciate diverse perspectives, voices, and experiences that reflect the full spectrum of humanity.We are proud to be an equal opportunity employer, committed to fostering an inclusive workplace where all individuals are respected and valued.
Join Cloudflare as a Distributed Systems Engineer specializing in our Data Platform. In this role, you will be at the forefront of building and optimizing systems that enhance data delivery, database management, and retrieval processes. Collaborate with cross-functional teams to innovate and improve our platform, ensuring seamless data access and performance.This position offers a unique opportunity to work in a dynamic environment, leveraging cutting-edge technologies to impact the way data is processed and utilized across our platform.
We are seeking a talented and driven Distributed Systems Engineer to join our dynamic Data Platform team at Cloudflare. In this role, you will have the opportunity to work on cutting-edge technologies and help shape the future of data delivery, database management, and retrieval systems. You will collaborate with cross-functional teams to build scalable, reliable, and efficient distributed systems that power our services.
At Sciforium, we are at the forefront of AI infrastructure, pioneering advanced multimodal AI models and an innovative, high-efficiency serving platform. With substantial backing from AMD and a dedicated team of engineers, we are rapidly expanding our capabilities to support the next generation of frontier AI models and real-time applications.About the RoleWe are looking for a highly skilled Senior HPC & GPU Infrastructure Engineer who will be responsible for ensuring the health, reliability, and performance of our GPU compute cluster. As the primary custodian of our high-density accelerator environment, you will serve as the crucial link between hardware operations, distributed systems, and machine learning workflows. This position encompasses a range of responsibilities, from hands-on Linux systems engineering and GPU driver setup to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you are passionate about optimizing hardware performance, enjoy troubleshooting GPUs at scale, and aspire to create world-class AI infrastructure, we would love to hear from you.Your Responsibilities1. System Health & Reliability (SRE)On-Call Response: Be the primary responder for system outages, GPU failures, node crashes, and other cluster-wide incidents, ensuring rapid issue resolution to minimize downtime.Cluster Monitoring: Develop and maintain monitoring protocols for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and general system load.Vendor Liaison: Collaborate with data center personnel, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster.2. Linux & Network AdministrationOS Management: Oversee the installation, patching, and maintenance of Linux distributions (Ubuntu / CentOS / RHEL), ensuring consistent configuration, kernel tuning, and automation for large node fleets.Security & Access Controls: Set up VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computing infrastructure.Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity and administer distributed file systems like NFS, GPFS, or Lustre.3. GPU & ML Stack EngineeringDeployment & Bring-Up: Spearhead the deployment of new GPU nodes, including BIOS configuration and software integration to ensure optimal performance.