Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Experience Level
Senior
Qualifications
Key ResponsibilitiesDesign, develop, and sustain scalable infrastructure to facilitate real-time analytics and machine learning workloads. Enhance system reliability and performance through automation, observability, and proactive capacity planning. Lead the evolution of CI/CD pipelines, deployment automation, rollback mechanisms, and configuration management. Establish and maintain monitoring, alerting, and incident response protocols, including SLOs, runbooks, and on-call rotations. Foster collaboration across engineering and data science teams to promote a culture of performance and reliability. Ensure security, compliance, and operational readiness of our cloud infrastructure. Drive post-incident analyses and continuous improvement efforts.
About the job
About the Role
Join alembic as a Senior Site Reliability Engineer (SRE) and become an integral part of our mission to enhance platform reliability, observability, and operational excellence. In this pivotal role, you will collaborate with engineers and data scientists to architect, automate, and maintain the robust infrastructure that drives our platform, including data pipelines, machine learning workloads, and real-time analytics systems.
This hands-on position offers significant visibility across the technology stack and provides you with the opportunity to shape the future of our infrastructure and operations.
About alembic
Alembic is dedicated to building innovative solutions that empower organizations to harness the full potential of their data. Our team is committed to fostering a collaborative and dynamic work environment where creativity and technical excellence thrive.
Full-time|$170.1K/yr - $283.6K/yr|On-site|Bay Area, CA, United States of America
At Block, we are a mosaic of diverse teams united by a common mission of economic empowerment. Our foundational teams—including People, Finance, Counsel, Hardware, Information Security, and Platform Infrastructure Engineering—provide essential support and guidance across the organization. We collaborate across various business groups, transcending time zones…
Full-time|On-site|Bay Area, CA, United States of America
Role Overview Block, Inc. is looking for a Reliability Program Manager focused on hardware solutions in the Bay Area, CA. This role guides efforts to improve the reliability and performance of hardware products. The position works closely with teams across the company to spot and address risks, helping Block deliver products that meet high standards for quality and durability.
Full-time|On-site|Bay Area, CA, United States of America
Block, Inc. brings together teams across People, Finance, Legal, Hardware, Information Security, and Platform Infrastructure Engineering to support economic empowerment. These groups collaborate globally to shape policies, forecast trends, offer legal guidance, protect systems, and drive new initiatives. Diverse perspectives are valued here, and every challenge is seen as a chance to make an impact. Role Overview The Executive Operations team at Block does more than relay information. This group generates insights that help leaders anticipate and address issues before they escalate. By connecting observations from across the company, the team identifies bottlenecks, surfaces misalignments, and helps correct course early. This centralized approach allows Block to spot patterns and signals that might otherwise go unnoticed. What You Will Do As an Executive Business Partner, you will work closely with leaders from Sales, Guidance, and Risk. The position calls for a deep understanding of how leadership decisions are made and the context needed for effective discussions. Your work will focus on: Anchoring efforts to Block’s top priorities Facilitating the flow of information between leaders and teams Addressing emerging issues before they become roadblocks Supporting leaders across three distinct functions, building awareness of patterns and connections Ensuring the right context reaches the right people, maintaining alignment and clarity This role is embedded in the operational core of the organization, not just coordinating but actively shaping how leadership functions. Contributions here directly influence decision-making and help keep the company moving with clarity and speed. Location Bay Area, CA, United States of America
Join our innovative team at Lever, where we are reimagining the recruitment landscape! As a Senior Data Engineer, you will play a pivotal role in developing our cutting-edge hiring software, which is trusted by industry leaders like Netflix, Shopify, and Cirque du Soleil. Your expertise will help us continue to push the boundaries of talent acquisition technology.Lever, founded a decade ago, addresses one of the most critical challenges companies face: attracting and hiring premier talent. Recognized as the #1 workplace in San Francisco and a top employer nationwide, we take pride in our people-first culture, investing in our team members – our greatest asset. If you're passionate about shaping the future of hiring, we want to hear from you!
Join our innovative team as an Engineer at leverdemo-8! This role is part of our ongoing efforts to enhance Lever's testing environment. Please note, this posting is for testing purposes only and not for actual recruitment.At Lever, we are dedicated to revolutionizing the recruitment and hiring process, providing cutting-edge software solutions to industry leaders like Netflix, Yelp, Cirque du Soleil, Shopify, and Spotify. As we continue to grow, we seek talented individuals who are passionate about transforming talent acquisition. Lever has been recognized as the #1 workplace in San Francisco and a top employer across the United States. Our team, known as 'Leveroos', is our most valuable asset, and we are committed to fostering a people-first culture that prioritizes the well-being and growth of our employees.
Full-time|On-site|San Francisco, California; Santa Clara, California; Seattle, WA
Join Carta as a Senior Site Reliability Engineer, where you will play a pivotal role in enhancing our infrastructure and ensuring the reliability of our platforms. You will work collaboratively with cross-functional teams to implement innovative solutions that drive operational excellence and scalability.
Full-time|On-site|Bay Area, CA, United States of America
Role Overview Block, Inc. is looking for an ASIC Validation Engineer in the Bay Area, CA. This role focuses on validating ASIC designs and confirming their performance meets expectations. The work involves collaborating with teams across different functions and running thorough validation tests. What You Will Do Validate ASIC designs using established and custom test procedures Work closely with engineering and product teams to optimize design outcomes Contribute to projects that influence future technology directions
Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco
Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.
Join our innovative team at Lever as we redefine the recruitment landscape!Lever, founded a decade ago, addresses the crucial challenge of attracting and hiring top talent. Our cutting-edge hiring software is trusted by industry leaders such as Netflix, Yelp, Cirque du Soleil, Shopify, and Spotify to enhance their teams. We are proud to be recognized as the top workplace in San Francisco and among the best in the United States. Our culture emphasizes people-first values, and we continually invest in our dedicated team, known as “Leveroos.” Join us in our mission to innovate in talent acquisition!
Join our innovative team at Lever as a Senior Scientist, where you will play a crucial role in advancing our mission to redefine talent acquisition. Lever is at the forefront of developing cutting-edge hiring software that empowers companies such as Netflix, Shopify, and Spotify to attract and retain top talent. We pride ourselves on fostering a people-first culture, investing in our employees, and being recognized as a premier workplace in San Francisco and across the United States.As a Senior Scientist, you will leverage your expertise to contribute significantly to our projects and help shape the future of hiring technology. Your insights and skills will be invaluable as we continue to scale and innovate.
Join our innovative team at Lever where we are redefining the recruitment landscape! Lever, founded a decade ago, is on a mission to solve the most pressing challenge in talent acquisition: attracting and hiring top talent. Our state-of-the-art hiring software is trusted by industry leaders like Netflix, Yelp, Cirque du Soleil, Shopify, and Spotify to elevate their hiring processes.As we continue to grow, we are proud to be recognized as a top workplace in San Francisco and across the United States. Our company culture is centered around our people, whom we refer to as 'Leveroos', and we are committed to investing in their success and well-being. Join us in shaping the future of talent acquisition!
About the RoleJoin alembic as a Senior Site Reliability Engineer (SRE) and become an integral part of our mission to enhance platform reliability, observability, and operational excellence. In this pivotal role, you will collaborate with engineers and data scientists to architect, automate, and maintain the robust infrastructure that drives our platform, including data pipelines, machine learning workloads, and real-time analytics systems.This hands-on position offers significant visibility across the technology stack and provides you with the opportunity to shape the future of our infrastructure and operations.
Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.
Join our innovative team at Lever as a Software Engineer, where you will play a pivotal role in building cutting-edge hiring software that empowers companies to attract and retain top talent. Lever, recognized as one of the best workplaces in San Francisco and across the United States, is at the forefront of transforming the talent acquisition landscape.In this position, you will collaborate with a dynamic group of engineers and contribute to the development of solutions used by leading organizations such as Netflix, Shopify, and Spotify. We are proud of our people-first culture and are looking for talented individuals who share our values to join us in our mission.
Join Arena IntelligenceArena Intelligence is the premier open platform for assessing the real-world performance of AI models. Founded by innovative researchers from UC Berkeley’s SkyLab, our mission is to enhance and measure the frontier of AI applications in real-world scenarios.Every month, millions of users leverage Arena Intelligence to delve into the performance of advanced AI systems. We harness community feedback to create transparent, rigorous, and user-centered evaluations. Leading enterprises and AI research labs count on our assessments to gauge real-world reliability, alignment, and impact. Our leaderboards are recognized as the definitive standard for AI performance, shaping the global discourse around model reliability and progress.Our dynamic team is composed of researchers, engineers, and academics from esteemed institutions like UC Berkeley, Google, Stanford, DeepMind, and Discord. We prioritize truth-seeking, rapid innovation, and craftsmanship, fostering an environment where diverse, thoughtful individuals can excel. Every team member is a deep expert in their respective field, contributing to a vibrant culture of excellence and focus.Position OverviewWe are looking for a Senior Infrastructure Engineer to architect, build, and enhance the technical framework that drives the world’s most transparent AI evaluation platform. This is more than just a role focused on architecture; you will collaborate closely with product, security, and data teams to identify system bottlenecks, design robust systems, and deploy code that ensures our platform remains fast, reliable, and secure at scale.Your work will have a significant impact across the technology stack: optimizing system performance, creating building blocks that facilitate swift feature delivery, fortifying our systems against misuse, and scaling infrastructure to accommodate billions of API requests and real-time interactions. If you are passionate about cross-functional collaboration, diving deep into code, and transforming infrastructure into a standalone product, this role is for you.Key ResponsibilitiesLead technical strategy for Arena’s core infrastructure - encompassing networking, APIs, services, data systems, and security frameworks.Design and develop high-performance systems capable of scaling to hundreds of millions of real-time interactions and evaluations daily.Collaborate with security engineering to integrate anti-abuse mechanisms, Sybil resistance, and trust systems into the core architecture of the platform.Work alongside product engineering to implement infrastructure enhancements that expedite feature delivery and enhance overall platform performance.
Full-time|$227.2K/yr - $324.5K/yr|Hybrid|San Francisco, CA (Hybrid)
About the Role: At Tubi, our Site Reliability Engineering (SRE) team transcends traditional operations. We embody a software engineering ethos, leveraging a developer's toolkit to tackle the complexities of large-scale, distributed systems. Our core mission focuses on building resilience from the ground up, empowering our product teams to innovate swiftly while delivering an exceptional user experience. We oversee the availability, latency, performance, and capacity of our platform, driven by a culture of data-informed decision-making, blameless learning, and relentless automation. We are on the lookout for a seasoned and visionary Senior Manager of SRE to lead and expand our newly formed Site Reliability Engineering team. You will be more than just a people manager or tech lead; you will be the strategic architect behind our reliability roadmap. Your role will involve building and mentoring a team of skilled engineers, cultivating an environment of blameless learning and continuous improvement, while advocating for the engineering practices that balance rapid innovation with unwavering stability. You will play a pivotal role within our engineering leadership, collaborating with peers across the organization to embed reliability as a shared responsibility and a fundamental principle of our engineering culture.
Join us as a Software Engineer at leverdemo-8, where we revolutionize the hiring landscape.Lever is pioneering the future of talent acquisition software, trusted by industry giants like Netflix, Yelp, Cirque du Soleil, Shopify, and Spotify. Our mission is to simplify the recruitment process, making it efficient and effective for organizations of all sizes. As we continue to grow, we're looking for passionate individuals who are eager to contribute to our innovative team culture.We're proud to be recognized as the top workplace in San Francisco and a leading employer across the United States. Our team, affectionately known as 'Leveroos', is our greatest asset, and we are committed to nurturing a supportive and inclusive environment.
About HiveHive stands at the forefront of cloud-based AI innovation, providing cutting-edge solutions that enable organizations to understand, search, and generate content. Our platform is relied upon by some of the world's most prestigious and forward-thinking companies. We empower developers with an extensive suite of state-of-the-art, pre-trained AI models that handle billions of API requests each month. In addition to our robust model offerings, we deliver comprehensive software applications backed by proprietary AI models and datasets, unlocking transformative applications in various sectors such as content moderation, brand protection, sponsorship measurement, and context-based advertising.With over $120 million in funding from esteemed investors like General Catalyst, 8VC, Glynn Capital, Bain & Company, and Visa Ventures, Hive has cultivated a vibrant global team of over 250 employees across our San Francisco, Seattle, and Delhi offices. If you’re passionate about shaping the future of AI, we invite you to join our dynamic team!DevOps and Systems TeamIn response to our distinctive machine learning demands, we have developed our own data centers focusing on distributed high-performance computing with GPU integration. While we harness the power of these data centers, our infrastructure remains hybrid, leveraging public cloud solutions when advantageous. As we scale our machine learning models for commercial use, we are expanding our DevOps and Site Reliability team to ensure the reliability of our enterprise SaaS offerings. Our ideal candidate thrives in dynamic environments, embraces automation, and believes that every task can be automated and every server can scale. You take pride in enhancing performance across all layers of our stack and are committed to never performing the same task manually twice.
Join our dynamic team at fal as a Senior/Staff Site Reliability Engineer. In this key role, you will leverage your expertise to enhance our systems' reliability and performance. If you are passionate about building scalable systems and enjoy working in a collaborative environment, we want to hear from you!
Full-time|$172K/yr - $209K/yr|On-site|San Francisco, CA - US
At Crusoe, our mission is to drive the future of energy and intelligence. We are developing the infrastructure that empowers ambitious AI creations without compromising on scale, speed, or sustainability.Join us in leading the AI revolution through sustainable technology. At Crusoe, you will be at the forefront of innovation, contributing to impactful projects and collaborating with a team dedicated to transforming cloud infrastructure responsibly.About This Role:As a Senior Site Reliability Engineer, you will play a crucial role in ensuring the operational excellence of Crusoe’s energy-efficient, AI-optimized GPU cloud. Your focus will be on maintaining stability, resilience, and performance, driving initiatives that enhance our cloud platform.This position is perfect for engineers who thrive in dynamic environments, relish the challenge of solving operational issues, and seek to advance their technical careers while enhancing incident response and reliability for a large-scale distributed platform.You will collaborate closely with senior SREs, infrastructure engineers, and platform teams to bolster reliability, minimize operational toil, and refine our incident management processes.What You’ll Be Working On:Work with cross-functional teams to establish and enhance availability metrics for our cloud infrastructure, including the development, tracking, and improvement of Service Level Indicators (SLIs) and Service Level Objectives (SLOs).Assist in incident response by diagnosing and resolving service disruptions, while supporting post-incident processes through root cause analysis documentation and participation in reviews.Build, maintain, and monitor the health of our infrastructure using Crusoe’s observability tools (Prometheus, Grafana, Alertmanager, OpenTelemetry).Identify and communicate reliability risks and performance bottlenecks, along with early indicators of potential incidents that may impact service availability.Develop automation and tools to reduce operational toil, minimize manual processes, and improve service recovery and self-healing capabilities.Collaborate with compute, network, storage, and platform teams to enhance service resilience and strengthen disaster recovery preparedness.Engage in knowledge sharing and contribute to the development of operational best practices across the organization.