Site Reliability Engineer Sre At Mithril San Francisco jobs in San Francisco – Browse 11,331 openings on RoboApply Jobs

Site Reliability Engineer Sre At Mithril San Francisco jobs in San Francisco

Open roles matching “Site Reliability Engineer Sre At Mithril San Francisco” with location signals for San Francisco. 11,331 active listings on RoboApply Jobs.

11,331 jobs found

1 - 20 of 11,331 Jobs
Apply
Mithril logo
Full-time|$170K/yr - $230K/yr|On-site|Palo Alto / San Francisco Bay Area

Mithril develops AI infrastructure aimed at making GPU computing more accessible and affordable for enterprises, AI startups, and researchers. Clients include LG AI Research, Saronic, and the Broad Institute. The company was founded by a former Google DeepMind research scientist and a Stanford CS PhD. Mithril has secured $80M in seed and Series A funding fro…

Apr 22, 2026
Apply
Mithril logo
Full-time|$170K/yr - $230K/yr|On-site|San Francisco Bay Area

Mithril is actively seeking skilled professionals for the position of Senior Site Reliability Engineer in our Supply division. We are open to candidates ranging from Senior to Principal level, with specific leveling contingent upon individual experience and proven expertise. Our ideal candidate will possess profound technical insight, strategic acumen, and a consistent record of delivering impactful results. We customize roles to align with each candidate’s distinct strengths and aspirations. About MithrilAt Mithril, we are revolutionizing the way AI enterprises access computational power. Our mission is to orchestrate global compute capacity, simplifying usage and optimizing it for AI workloads. We are developing a pioneering public cloud specifically designed for AI applications, making high-performance compute as accessible and reliable as flipping a switch.Founded just over a year ago from a Stanford PhD lab, Mithril has already attracted the support of premier investors, including Sequoia, Lightspeed, Jeff Dean, and Eric Schmidt, amassing $80 million in funding. We have already secured a strong customer base and are generating revenue.Current machine learning infrastructure is excessively complicated, often forcing engineers to grapple with hardware and capacity challenges when their focus should be on innovation and addressing significant problems. Our Omnicloud Platform simplifies this by abstracting hardware management and offering seamless access to computing resources, enabling engineers to concentrate on developing transformative AI solutions.We are not merely making compute accessible; we are ensuring it is versatile, scalable, and tailored to meet the unique needs of advanced AI. Our infrastructure marketplace combines state-of-the-art hardware with Mithril’s Omnicloud Platform and software capabilities, empowering AI companies to innovate at an accelerated pace. Working at MithrilAs a Senior Site Reliability Engineer at Mithril, you will play a crucial role in building the infrastructure that drives the future of AI. Your contributions will be vital not only for scaling our systems but also for ensuring their reliability and security at every level.You will be instrumental in developing and maintaining solutions that harness global compute resources, making them available to clients within the evolving AI/ML ecosystem—from high-performance clusters to foundational models, fine-tuning, inferencing, and agentic application workflows. You will tackle world-class technical challenges that enable cutting-edge AI workloads to utilize humanity's collective knowledge through planet-scale computing, helping our clients leverage AI's potential to enhance the world.If you are inspired by the challenge of scaling and securing systems while driving innovation in AI, we encourage you to apply.

Nov 20, 2025
Apply
Sierra logo
Full-time|On-site|San Francisco, CA

About UsAt Sierra, we are pioneering a transformative platform that empowers businesses to forge authentic customer experiences through AI technology. Headquartered in the vibrant city of San Francisco, we also boast a dynamic presence in Atlanta, New York, London, France, Singapore, and Japan.Our operations are anchored in core values that shape our culture: Trust, Customer Obsession, Craftsmanship, Intensity, and Family. These principles guide our actions and are integral to our mission.Our visionary founders, Bret Taylor and Clay Bavor, bring unparalleled expertise. Bret, currently the Board Chair of OpenAI, previously co-led Salesforce and served as CTO at Facebook, while Clay led numerous initiatives at Google, including AR/VR projects and Google Workspace.Your RoleIn your capacity as a Software Engineer on the Site Reliability team, you will play a crucial role in establishing and enhancing the reliability, observability, and scalability of Sierra’s AI-centric infrastructure. Collaborating closely with our engineering and product teams, your goal is to ensure our systems remain highly available, efficient, and primed for growth.Lead the development of Sierra’s observability stack—including monitoring, alerting, logging, and tracing—to provide engineers with critical insights into system health and performance.Collaborate with product and platform engineers to architect systems that prioritize reliability and scalability from the outset, not as an afterthought.Design and implement robust, scalable, and secure cloud infrastructure on AWS, employing Terraform and cutting-edge DevOps tools.Enhance the reliability and scalability of our LLM deployments, ensuring they operate efficiently and cost-effectively.Drive improvements in deployment pipelines, CI/CD tooling, and incident management processes to minimize downtime and accelerate response times.Define and cultivate SRE practices within Sierra, shaping culture, tooling, and best practices across the engineering organization.QualificationsBachelor's degree in Computer Science or a related field, or equivalent experience.Proven experience in Site Reliability Engineering or a similar role, with a strong understanding of cloud infrastructure (AWS).Proficiency in Terraform and modern DevOps practices.Experience with observability tools and techniques—monitoring, alerting, logging, and tracing.Strong problem-solving skills with a focus on scalability and performance optimization.Excellent collaboration and communication skills, with the ability to work effectively in a team environment.

Oct 21, 2025
Apply
Mercor logo
Full-time|On-site|San Francisco

Join the Mercor TeamAt Mercor, we stand at the dynamic intersection of labor markets and AI research. Collaborating with premier AI labs and enterprises, we empower the human intelligence that is crucial for AI's evolution.Our expansive talent network plays a vital role in training cutting-edge AI models, akin to the way educators impart knowledge to their students—by sharing insights, experiences, and contextual understanding that code alone cannot convey. Currently, our network of over 30,000 experts generates more than $2 million daily.We are pioneering a novel category of work where expertise fuels AI progress. Achieving this vision necessitates an ambitious, fast-paced, and deeply dedicated team. You will collaborate with researchers, operators, and AI firms that are at the forefront of transforming societal structures.Mercor is a thriving Series C company with a valuation of $10 billion. We operate five days a week in-person at our new headquarters in San Francisco.About the RoleAs a Site Reliability Engineer (SRE) at Mercor, you will take ownership of production reliability for our critical systems, working closely with our infrastructure leadership. You will play a pivotal role in establishing our SRE function and defining how Mercor manages large-scale, high-availability systems.Your ResponsibilitiesEnsure the reliability and safety of production for key shared services and customer-facing systems.Collaborate directly with infrastructure leadership to outline SRE priorities, reliability benchmarks, and the production safety roadmap.Enhance the structure of our production systems to ensure stability, resource efficiency, isolation, and observability.Advocate for and implement modern SRE methodologies (e.g., incident management, postmortems, SLIs/SLOs) across engineering teams.Work alongside engineering and applied AI teams to facilitate sustainable growth.Promote SRE best practices internally, supporting teams in a safe, scalable, and consistent production onboarding process.Who We SeekThe ideal candidate will have:Extensive experience in genuine SRE roles (not merely operations) across various positions or organizations.A deep understanding of SRE methodologies popularized by Google (e.g., error budgets, reliability vs. risk trade-offs, large-scale distributed systems).5+ years of SRE experience; ideally, 15+ years in total experience for this inaugural SRE position.A proven track record of managing systems at scale, with a strong grasp of the complexities involved.

Dec 27, 2025
Apply
Superhuman, Inc. logo
Full-time|$214K/yr - $260K/yr|Hybrid|Hub - San Francisco

At Superhuman, we embrace a vibrant hybrid work model that offers our team members the ideal blend of focused individual work and collaborative in-person interactions, fostering trust, innovation, and a robust team culture.About SuperhumanSuperhuman, the AI productivity platform, is on a transformative mission to unlock the superhuman potential within everyone. With the integration of Grammarly's writing assistance and innovative tools like Coda’s collaborative workspaces and Go, our proactive AI assistant, we empower over 40 million individuals and 50,000 organizations globally. Founded in 2009, we strive to eliminate busywork and enhance productivity. Discover more at superhuman.com and explore our values here.The OpportunityTo meet our ambitious goals, we are seeking a Site Reliability Engineer (SRE) to join our infrastructure team. This pivotal role focuses on developing software solutions to maintain the reliability of our back-end systems while collaborating with engineering teams to strategize our future growth. You will also engage with our production engineering teams in Europe as we transition from a “you build it, you own it” approach.At Superhuman, our engineers and researchers enjoy the autonomy to innovate and drive breakthroughs, directly impacting our product roadmap. As we rapidly scale our interfaces, algorithms, and infrastructure, the complexity of our technical challenges is growing. Learn more about our technical endeavors on our technical blog.As an SRE, your responsibilities will include:Scaling our Kubernetes-based control plane that processes billions of events each day.Enhancing our automation mechanisms to efficiently respond to workload demands.Deploying machine learning systems across various departments.

Jun 18, 2025
Apply
prosper logo
Full-time|On-site|San Francisco, CA

Role overview The Senior Site Reliability Engineer at prosper plays a key role in maintaining and improving the reliability and performance of the company’s core systems. Collaboration with teams across the organization is essential to ensure services remain stable and efficient. What you will do Design and set up monitoring tools to track the health and performance of systems Automate routine operational tasks to minimize manual intervention and boost efficiency Diagnose and resolve complex technical problems that impact infrastructure or services Support projects aimed at strengthening infrastructure stability and preparing for future growth Location This role is located in San Francisco, CA.

Apr 27, 2026
Apply
EngFlow logo
Full-time|On-site|San Francisco

Join Our Team at EngFlowEngFlow is revolutionizing the software development process by enabling developers to save valuable time in their build and test cycles. Our innovative cloud-based distributed service optimizes workflows through advanced remote execution and caching, significantly enhancing efficiency, productivity, and product quality.Supported by esteemed investors, EngFlow is at the forefront of transforming how organizations develop software and deliver thoroughly tested products. Our solutions can accelerate builds by tenfold or more, and our observability platform provides crucial insights for ongoing optimization. Founded by leading contributors to Bazel, we create tools that empower engineering teams, from startups to Fortune 500 companies, to boost developer velocity and build performance.Discover more about our mission, culture, and team: EngFlow | Watch Our VideoWe are seeking a talented and experienced Site Reliability Engineer to join our dynamic engineering team. In this pivotal role, you will bridge the gap between software engineering and systems operations, ensuring our distributed infrastructure is highly available, performant, and scalable, thereby allowing our engineers to work swiftly and with confidence.

Jan 27, 2026
Apply
Carta logo
Full-time|On-site|San Francisco, California; Santa Clara, California; Seattle, WA

Join Carta as a Senior Site Reliability Engineer, where you will play a pivotal role in enhancing our infrastructure and ensuring the reliability of our platforms. You will work collaboratively with cross-functional teams to implement innovative solutions that drive operational excellence and scalability.

Apr 3, 2026
Apply
Latent logo
Full-time|On-site|San Francisco

Site Reliability EngineerLocation: San Francisco, CA (5 Days In-Office)As a Site Reliability Engineer at Latent, you will be the backbone of our infrastructure, ensuring the exceptional stability and performance of our cutting-edge clinical AI platform that serves major health systems. Your role is pivotal in enhancing operational excellence, directly impacting patient access to critical treatments.What Makes a Great Engineer at LatentWe seek individuals who are not just technically skilled but also passionate about ownership and high standards. You will thrive in our dynamic, in-office culture where teamwork and a winning mentality are key.Tool Proficiency: You are highly adept with your tools, fluent in command line operations, and skilled in keyboard shortcuts.Ownership: You take pride in managing complex systems and have a successful history of scaling mission-critical deployments.Automation Drive: You have a passion for automation, consistently seeking innovative methods to enhance efficiency and establish operational excellence.Problem Solver: You proactively address challenges, stepping in to resolve issues without waiting for others.Your ResponsibilitiesAs our SRE, you will take full ownership of the production environment and enhance the developer experience:Infrastructure Ownership: Design, implement, and maintain a robust production environment, having experience with over 500 machine deployments.Kubernetes Mastery: Utilize your expertise in Kubernetes and Helm to manage our containerized infrastructure, ensuring optimal deployment, scalability, and operational health.CI/CD & Deployment Optimization: Streamline the deployment pipelines for TypeScript and Python/ML, supporting rapid feature releases while upholding top-notch reliability.DevX Support: Enhance developer workflows by supporting Developer Experience (DevX) initiatives to improve tool proficiency and CI/CD systems.Infrastructure as Code (IaC): Manage infrastructure definitions using Terraform.

Dec 5, 2025
Apply
Unify logo
Full-time|On-site|San Francisco Office

About UnifyAt Unify, we're pioneering the first AI-driven system of action for revenue teams. Our innovative approach empowers companies to transform their outbound strategies into a leading growth engine, ensuring that go-to-market execution is observable, repeatable, and scalable. Established in 2023 by visionaries from Ramp and Scale AI, our diverse team boasts experience from industry giants such as Airbnb, Meta, Waymo, and Perplexity.Having achieved an impressive 8x revenue growth in 2024, we proudly serve esteemed clients including Perplexity, Cursor, SoFi, and Justworks. With a dynamic team that has successfully raised $58M from prominent investors like Thrive, Emergence, and OpenAI, we are at the forefront of revolutionizing the future of GTM. Come and be a part of this exciting journey!About the RoleAs a Senior Site Reliability Engineer (SRE) at Unify, you will play a pivotal role in addressing the challenges of scaling and maintaining reliability as we handle immense data volumes and support enterprise clients with stringent uptime standards. Your expertise will span the entire tech stack—optimizing databases, fortifying services, and crafting automation and observability tools to ensure Unify remains fast and dependable at scale.

Jan 5, 2026
Apply
Air Apps logo
Full-time|On-site|San Francisco

Join Our Team at Air AppsAt Air Apps, we are driven by innovation and speed. Founded by a family in 2018 in Lisbon, Portugal, we are on a quest to revolutionize how individuals and entrepreneurs manage their resources through the world’s first AI-powered Personal & Entrepreneurial Resource Planner (PRP). With over 100 million downloads globally, our self-funded journey now spans across offices in Lisbon and San Francisco.We constantly challenge conventional norms, leveraging AI to develop solutions that genuinely impact lives. As part of our team, you will be a critical player in shaping impactful products that empower users around the world.Join us as we redefine resource management and make a difference in people’s lives.Your Role as a Site Reliability Engineer (SRE)As a Site Reliability Engineer at Air Apps, you will play a pivotal role in maintaining the reliability, availability, and scalability of our systems. Your work will bridge software development and operations by implementing automation, monitoring solutions, and performance optimization strategies to minimize downtime and enhance system resilience.

Mar 27, 2025
Apply
Thinking Machines Lab logo
Full-time|$350K/yr - $475K/yr|On-site|San Francisco

Thinking Machines Lab brings together scientists, engineers, and innovators who have shaped well-known AI products like ChatGPT and Character.ai, as well as open-weight models such as Mistral. The team also contributes to open-source projects including PyTorch, OpenAI Gym, Fairseq, and Segment Anything. The company’s mission centers on advancing collaborative general intelligence, aiming to make AI accessible and adaptable to individual needs. Tinker, the company’s fine-tuning API, enables researchers and developers to customize advanced AI models using their own data and algorithms. Thinking Machines manages the infrastructure, giving users the flexibility to train open-weight models while focusing on their unique requirements. As Tinker expands, the platform continues to evolve alongside its growing community. Role overview The Site Reliability Engineer will focus on improving the reliability and resilience of the Tinker platform. This role involves close collaboration with platform engineers and research teams to strengthen every layer of the system, from infrastructure to user-facing services. What you will do Define and take ownership of end-to-end reliability, including CI/CD workflows, production observability, and incident response processes. Set Service Level Objectives for distributed training systems, balancing reliability, scheduling latency, and development speed. Design and implement monitoring and observability across the training pipeline. Manage incident response for Tinker, ensuring prompt recovery, thorough incident analysis, and systematic improvements to prevent recurrence. Enhance multi-tenant isolation and resource scheduling to support LoRA-based workload co-scheduling, maintaining both reliability and data separation. Collaborate with security teams to identify and address production vulnerabilities. This position is based in San Francisco.

Apr 28, 2026
Apply
CodeRabbit logo
Full-time|On-site|San Francisco

About CodeRabbitCodeRabbit is a pioneering research and development firm dedicated to creating highly efficient human-machine collaboration systems. Our mission is to develop the next generation of AI-driven code review tools, fostering a harmonious partnership between human creativity and advanced algorithms that far exceed the capabilities of individual engineers. By merging language models with human innovation, we aim to elevate the standards of efficiency and quality in software development.The RoleWe are in search of a talented Site Reliability Engineer (SRE) to become a vital part of our Platform Engineering team located in the Bay Area. In this role, you will play a crucial part in maintaining the high availability, performance, and scalability of CodeRabbit's AI-enhanced code review platform. This position lies at the nexus of software engineering and systems operations, where you will construct the foundational platforms and automation that empower our engineering teams to deploy, monitor, and scale our services with reliability.As a Site Reliability Engineer at CodeRabbit, your responsibilities will include improving the reliability of our essential services that handle millions of code reviews, developing sophisticated automation platforms, and managing the infrastructure that drives our AI analysis engine. You will engage with cutting-edge technologies such as large language models, real-time processing systems, and distributed architectures that function at scale.Key ResponsibilitiesInfrastructure & Platform OwnershipDesign, implement, and maintain scalable infrastructure on Google Cloud Platform to accommodate CodeRabbit's expanding user base and processing needs.Take ownership of and operate essential platform services.Develop and manage Infrastructure as Code using Terraform to guarantee consistent, reproducible, and version-controlled infrastructure deployments.Reliability & Performance EngineeringEstablish and uphold SLI/SLO frameworks for all critical services, ensuring we fulfill our reliability commitments to users.Implement comprehensive monitoring, alerting, and observability solutions utilizing Datadog and custom instrumentation.Conduct in-depth incident response, root cause analysis, and post-mortem processes to continually enhance system reliability.Optimize application and infrastructure performance to manage millions of pull request analyses with minimal latency.

Jan 9, 2026
Apply
Plaud Inc. logo
Full-time|On-site|San Francisco, CA

About Plaud Inc.Plaud is revolutionizing the way professionals enhance productivity and performance with our trusted AI work companion. Our innovative note-taking solutions have gained the admiration of over 1,500,000 users globally since our inception in 2023. We are on a mission to amplify human intelligence by developing next-generation intelligence infrastructure and interfaces that seamlessly capture, extract, and leverage what you say, hear, see, and think.Based in San Francisco, Plaud Inc. is a Delaware-incorporated company that is redefining the boundaries of human-AI collaboration through a unique combination of hardware and software solutions. We adhere to the highest standards of data security and privacy protection, with certifications including ISO 27001, ISO 27701, GDPR, SOC 2, HIPAA, and EN 18031 compliance.Discover more about our innovative solutions by visiting https://www.plaud.ai and follow us on Instagram, X, Facebook, LinkedIn, and YouTube.Why You Should Join UsAt Plaud, you will play a pivotal role in shaping the future of human-AI interaction. Here’s what we offer:A thriving, bootstrapped company with a remarkable $250M revenue run rate achieved in just three years.An opportunity to define the next-generation paradigm for human-AI interaction.Direct exposure to cutting-edge AI tools for professionals and a chance to contribute to our global expansion.Collaborate with a passionate team that values innovation, teamwork, and customer success.Advance your career in a culture that promotes continuous learning and rapid career growth.

Feb 24, 2026
Apply
Okta, Inc. logo
Full-time|$162K/yr - $249K/yr|On-site|San Francisco, California

Okta is seeking a Staff Site Reliability Engineer to join the Infrastructure Platform AGILE SRE team in San Francisco. This position centers on supporting and improving the systems that underpin Okta’s identity infrastructure. Role overview The Staff SRE will work closely with multiple teams to develop and maintain critical infrastructure. A core part of this role involves enhancing internal tools and operational processes, ensuring that Okta’s systems remain secure and reliable as the company grows. What you will do Provide cross-functional support to teams building and maintaining key infrastructure components. Collaborate with Infrastructure Operations groups to address complex technical challenges. Diagnose, troubleshoot, and resolve sophisticated infrastructure issues by developing new tools and strategic solutions. Who we’re looking for Experienced SREs who are comfortable working on large-scale, impactful projects. Engineers who enjoy collaborating across teams and disciplines. Problem-solvers who can tackle intricate technical challenges and deliver reliable solutions. This role offers the chance to contribute directly to Okta’s mission of building secure, trusted infrastructure for organizations navigating the evolving landscape of AI and identity.

Apr 27, 2026
Apply
Rox Data Corp logo
Full-time|On-site|San Francisco

About UsAt Rox, we are dedicated to empowering individuals to achieve their greatest potential. Our innovative platform enhances sales efforts through autonomous revenue agents, allowing sellers to prioritize their expertise in selling. Just as coding agents transformed engineering, revenue agents amplify customer interactions.We are revolutionizing the revenue stack by developing the world’s first revenue operating system, encompassing everything from the application layer to systems of context. At Rox, we envision a future where humans evolve into orchestrators while agents handle the complete customer lifecycle.Our solutions support Global 2000 leaders across sectors such as banking, construction, and AI, partnering with industry giants like Ramp and Cognition.Our success stems from a united belief in our mission and an unwavering commitment to making it a reality.The TeamOur world-class team is the backbone of our innovative approach to redefining business operations.Our team members have:Founded and successfully exited companiesHeld top roles at Google, AWS, Confluent, and New RelicWon gold medals in international mathematics competitionsPublished groundbreaking research papersWe are proud to be backed by leading investors, having raised $50 million from Sequoia (Alfred Lin), General Catalyst (Hemant Taneja), Google Ventures, Elad Gil, and Chris Ré.Core PrinciplesTaste: Craft beautiful experiences.We meticulously focus on every detail, striving to ensure that each interaction not only helps sellers accomplish their tasks but also enhances their experience. We are relentless in our pursuit of excellence, always exploring new ways to delight our sellers.Obsession: Commit unreasonably.We are dedicated to our craft, responding to customer needs proactively and driving value even before they ask. Our commitment to continuous learning and self-improvement is unwavering.Action: Get it done.Execution is key; we prioritize thoughtful yet swift decision-making and immediate delivery. Trust is essential in our field, and we earn it through our actions.

Nov 4, 2025
Apply
Drata logo
Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco

Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.

Apr 27, 2026
Apply
Baseten logo
Full-time|On-site|San Francisco Office

ABOUT BASETENBaseten is at the forefront of powering mission-critical AI inference for some of the most innovative companies globally, including Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. We integrate cutting-edge applied AI research with a flexible infrastructure and intuitive developer tools to empower companies at the leading edge of AI to deploy sophisticated models effectively. With our recent $300M Series E funding round—supported by prominent investors such as BOND, IVP, Spark Capital, Greylock, and Conviction—we are rapidly expanding. Join our dynamic team and contribute to creating an essential platform for engineers to launch AI products with ease.THE ROLEAs a Site Reliability Engineer, you will design and implement resilient systems and processes that ensure our infrastructure is scalable, reliable, and efficient. Your responsibilities will encompass everything from automating deployments and monitoring systems to enhancing performance and managing incidents effectively.Collaboration is key; you will work closely with our users to understand their challenges in operationalizing machine learning, facilitating their onboarding onto our platform, and leveraging these insights to inform improvements to Baseten.EXAMPLE INITIATIVESAs part of our Infrastructure team, you will engage in exciting projects such as:Innovative multi-cloud capacity managementOptimizing inference on B200 GPUsImplementing multi-node inferenceUtilizing fractional H100 GPUs for efficient model servingRESPONSIBILITIESDesign and maintain scalable infrastructures to support the deployment and operational needs of machine learning models.Establish standards and best practices to enhance reliability and performance across the infrastructure.Proactively identify and resolve reliability issues using monitoring and alerting systems.Collaborate with cross-functional teams to apply best practices in infrastructure management and incident response.Create automation scripts to streamline processes and reduce manual intervention.

Oct 9, 2025
Apply
Hyperbolic Labs logo
Full-time|On-site|San Francisco, CA

Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.

Mar 26, 2026
Apply
Superhuman logo
Full-time|$214K/yr - $260K/yr|Hybrid|San Francisco, CA

At Superhuman, we embrace a flexible hybrid working model that combines focused work time with in-person collaboration, fostering trust, innovation, and a vibrant team culture.About SuperhumanSuperhuman, now part of Grammarly, is an AI productivity platform dedicated to unlocking the superhuman potential in everyone. Our suite of applications integrates AI with over 1 million tools and websites, offering innovative solutions such as Grammarly's writing assistance, Coda's collaborative workspaces, Mail's inbox management, and Go, our proactive AI assistant. Since our inception in 2009, we have empowered over 40 million individuals and 50,000 organizations worldwide, enabling them to eliminate busywork and focus on what truly matters. Discover more at superhuman.com and explore our values here.The OpportunityIn pursuit of our ambitious goals, we are seeking a Site Reliability Engineer to enhance our infrastructure team. This pivotal role involves building software that ensures the reliability of our back-end systems while collaborating closely with our engineering teams. You will also help plan for our future growth as we shift from a “you build it, you own it” model.Our engineers and researchers enjoy the freedom to innovate and influence our product roadmap, tackling increasingly complex technical challenges as we scale our systems. Learn more about our technical endeavors on our technical blog.As a Site Reliability Engineer, your responsibilities will include:Scaling our Kubernetes-based control plane, processing billions of events daily.Enhancing our automation mechanisms in response to workload demands.Deploying machine learning systems across the organization.

Mar 18, 2026

Sign in to browse more jobs

Create account — see all 11,331 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.

Site Reliability Engineer Sre At Mithril S… | RoboApply Jobs