Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Experience Level
Entry Level
Qualifications
Strong understanding of database technologies, including SQL and NoSQL systems. Experience with database performance tuning and optimization. Proficiency in scripting languages such as Python or Bash. Excellent problem-solving skills and a proactive mindset. Ability to work in a collaborative team environment.
About the job
Join Cloudflare as a Database Reliability Engineer, where you will play a crucial role in ensuring the reliability and performance of our database systems. You will work collaboratively with our engineering teams to develop, implement, and maintain robust database solutions that support our mission of making the internet safer and faster.
Your responsibilities will include monitoring database performance, troubleshooting issues, and optimizing queries to enhance system efficiency. If you are passionate about databases and eager to make an impact in a dynamic environment, we encourage you to apply!
About Cloudflare, Inc.
Cloudflare is a global network that powers over 25 million internet properties, providing security, performance, and reliability solutions. Our mission is to help build a better internet, and we are committed to improving the performance and security of our customers' websites.
Full-time|$44K/yr - $66K/yr|On-site|San Francisco, CA
About the TeamDoorDash Labs, founded in 2018, is the innovative core of DoorDash, dedicated to pioneering automation and robotics solutions that enhance last-mile logistics. Our mission is to develop technologies that empower and augment human networks, ultimately improving efficiency for Dashers, merchants, and consumers. With a remarkable focus on business…
Astranis is at the forefront of satellite technology, developing advanced satellites that operate in high orbits to extend humanity's capabilities throughout the solar system. Our satellites deliver dedicated, secure networks for an array of high-profile clients, including major corporations, government entities, and the U.S. military. With five satellites currently in orbit and numerous launches on the horizon, we're managing a robust pipeline of over $1 billion in commercial contracts.As a trusted partner in satellite communications, Astranis meets the stringent requirements for uptime, data security, network visibility, and customization. Supported by over $750 million in funding from top investors such as Andreessen Horowitz, Blackrock, and Fidelity, our team of 450 engineers and innovators operates from a state-of-the-art 153,000 sq. ft. facility in Northern California.Reliability Test Intern (Fall 2026)Our internship program typically spans twelve weeks and is designed for students currently enrolled in a four-year university. As an intern at Astranis, you will engage in challenging projects that make a significant impact on our satellite technologies. Many previous interns have successfully contributed to the design and testing of hardware and software destined for space, with numerous transitioning into full-time roles at Astranis.If you have already completed your degree, we invite you to apply for our Associate Engineer position instead.
Astranis is revolutionizing satellite technology by creating advanced satellites designed for high orbits, pushing the boundaries of humanity's presence in space. Our satellites deliver dedicated, secure networks to a diverse clientele, including large enterprises, governments, and military organizations globally. Currently, with five operational satellites and numerous launches on the horizon, our portfolio boasts over $1 billion in commercial contracts.Astranis stands out as the go-to partner for satellite communications, catering to clients who demand unparalleled uptime, stringent data security, network visibility, and bespoke solutions. Backed by over $750 million from elite investors such as Andreessen Horowitz, Blackrock, and Fidelity, our team of 450 engineers and entrepreneurs innovates from our expansive 153,000 sq. ft. headquarters in Northern California, USA.Position Overview: Environmental Test TechnicianAs an Environmental Test Technician, you will play an integral role in validating our spacecraft, sub-assemblies, and components through rigorous simulations of space's extreme conditions. You will also oversee the operation and maintenance of our state-of-the-art environmental test laboratory, ensuring readiness to support critical test missions.Key Responsibilities:Independently operate thermal vacuum (TVAC) chambers, thermal chambers, and electrodynamic shakers for comprehensive component and vehicle-level testing.Utilize a diverse array of testing equipment and data acquisition systems, including the construction and maintenance of specialized test setups.Effectively employ various software tools for test automation, data analysis, documentation, and procurement activities.Conduct tests and verifications of flight hardware according to established requirements and meticulously document results to adhere to product quality standards.Manage maintenance and calibration of testing equipment to ensure optimal performance.Interpret technical drawings, specifications, assembly procedures, and test protocols.Assist engineering teams in troubleshooting and root cause analysis.Engage in area efficiency and 5S improvement initiatives.
Join Astranis as an Assembly, Integration, and Test Technician, where you will play a crucial role in our cutting-edge satellite technology projects. You will work closely with engineers and other technicians to assemble, integrate, and test satellite systems, ensuring they meet the highest standards of quality and performance.
Full-time|On-site|San Francisco, California, United States
Join Redwood Materials as an Engineering Test Technician specializing in energy storage solutions. In this pivotal role, you will be at the forefront of testing and validating cutting-edge technologies that contribute to sustainable energy practices. You will work collaboratively with engineers and technicians to ensure the reliability and performance of our innovative energy storage systems.
Astranis is at the forefront of satellite technology, crafting advanced satellites designed for high orbits that significantly extend humanity’s reach into the solar system. Our satellites provide dedicated and secure networks to sophisticated clients worldwide, including large enterprises, sovereign governments, and the US military. With five satellites successfully in orbit and numerous upcoming launches, we have a robust backlog exceeding $1 billion in commercial contracts.Astranis has established itself as the go-to satellite communications partner for clients with stringent demands regarding uptime, data security, network visibility, and customization. Backed by over $750 million from premier investors such as Andreessen Horowitz, Blackrock, and Fidelity, our talented team of 450 engineers and entrepreneurs works diligently to design, build, and operate our satellites from our expansive 153,000 sq. ft. headquarters in Northern California, USA.Propulsion Test TechnicianAs a Propulsion Test Technician, you will play a pivotal role in ensuring that the hardware used to maneuver our satellites in orbit is fully operational. You will transition designs from the engineering desk to the testing stand, meticulously checking every valve, thruster, and propellant line to guarantee they are flight-ready. If you possess a passion for high-pressure systems and uphold a 'zero-fail' mentality, you will thrive in this dynamic environment.Key ResponsibilitiesExecute Proof and Burst testing utilizing high-pressure pneumatic (GN2, GHe) and hydrostatic systems.Conduct ultra-sensitive leak detection using helium sniffers or vacuum detection methods.Perform comprehensive functional tests on solenoid valves, regulators, and transducers, verifying cracking pressures and flow coefficients.Design and build custom test manifolds and ground support equipment (GSE).Install and calibrate pressure transducers, thermocouples, and flow meters; interface with DAQ systems to capture real-time test data.Diagnose the root causes of 'no-go' test outcomes by distinguishing between actual hardware failures and test setup anomalies.Uphold a pristine cleanroom environment and adhere strictly to safety protocols concerning high-pressure gases and hazardous propellants.Take ownership of the propulsion test cell, ensuring a clean, organized workspace in accordance with 5S standards.
About GridwareGridware is an innovative technology firm headquartered in San Francisco, committed to safeguarding and enhancing the reliability of the electrical grid. We have pioneered a revolutionary approach to grid management known as Active Grid Response (AGR), which meticulously monitors the electrical, physical, and environmental factors influencing grid safety and reliability. Our state-of-the-art AGR platform leverages high-precision sensors to identify potential issues at an early stage, facilitating proactive maintenance and fault resolution. This holistic strategy is designed to bolster safety, minimize outages, and ensure optimal grid performance. We are proud to be supported by prominent climate-tech and Silicon Valley investors. To learn more, visit www.Gridware.io.About the RoleWe are seeking a skilled Senior Hardware Reliability Engineer to lead reliability testing, analysis, and lifetime modeling of various outdoor electronic assemblies. This pivotal role will concentrate on the electronic components of our products, collaborating closely with our mechanical-focused Reliability Engineer and engaging with the broader hardware and cross-functional teams.
Full-time|$166.9K/yr - $225.9K/yr|Hybrid|Hybrid - San Francisco
Drata helps organizations demonstrate their commitment to security and integrity. The platform supports companies as they build and maintain trust with users, customers, partners, and prospects. Values Built on Trust: Consistency shapes decisions and actions. Integrity: Choosing to do what is right, every time. Customer-Obsessed: Prioritizing customer needs above all else. Competitive Fire: Striving for higher standards and greater achievements. Diversity: Welcoming different perspectives to encourage creative solutions. Automation First: Pursuing efficiency by saving time and resources wherever possible. How the Team Works Drata blends high standards with a supportive environment focused on growth. Team members are encouraged to own their work, improve continuously, and deliver meaningful results. The company values quick, informed decisions that drive immediate impact, while always keeping the mission and customer needs at the center. The San Francisco-based team uses a hybrid work model. Colleagues collaborate in the office Tuesday through Thursday, focusing on alignment and innovation. Mondays and Fridays offer flexibility for deep work or personal needs. Growth and Culture Drata has expanded to over 600 professionals worldwide, recognized for a culture that values trust, speed, and continuous learning. The environment supports both personal and professional development. See the Speed: CEO Adam Markowitz discusses Drata’s rapid journey to $100M ARR in four years. Hear the Voice of the Team: Employee stories highlight collaboration and growth at Drata.
Become a vital part of the engineering teams that responsibly bring OpenAI’s transformative technologies to the world!At OpenAI, our Applied Engineering team collaborates across research, engineering, product management, and design to deliver AI solutions to both consumers and businesses. We are committed to learning from our deployments, maximizing the benefits of AI, and ensuring that this powerful technology is utilized both safely and ethically. Our priority is safety over unchecked growth.About the RoleAs OpenAI continues to expand, we are seeking seasoned engineers who excel in problem-solving to enhance the scalability of our systems. Our achievements hinge on our ability to rapidly iterate on product development while ensuring optimal performance and reliability. You will thrive in a collaborative, fast-paced environment, playing a key role in delivering our technology to millions globally, with a focus on safety and reliability. As a reliability engineer, you will lead efforts to maintain and improve the stability, scalability, and performance of our dynamic infrastructure. You will collaborate closely with cross-functional teams, including software engineers, product managers, and data scientists, to construct and sustain robust systems capable of accommodating our growing user base and workload.Your Responsibilities Include:Designing and implementing solutions to scale our infrastructure to meet increasing demands effectively.Developing and maintaining load, chaos, and synthetic testing software that enhances the reliability of systems designed by development teams.Creating and managing automation tools to streamline repetitive tasks and bolster system reliability.Overseeing the lifecycle management platform for CPU/storage, GPU, and network resources to foster efficiency and support dynamic optimization.Implementing fault-tolerant and resilient design patterns to minimize service interruptions.Establishing and maintaining service level objectives (SLOs) and service level indicators (SLIs) to ensure system reliability.Collaborating with researchers, engineers, product managers, and designers to introduce new features and research advancements to the world.Participating in an on-call rotation to address critical incidents and ensure 24/7 system availability.Your Impact: Your contributions will be essential in guaranteeing the reliability and performance of our platforms as we continue to scale our operations.
Full-time|Remote|Denver, Colorado, United States; San Francisco, California, United States
Join Checkr as a Software Engineer focusing on Reliability, where your contributions will enhance our platform's robustness and performance. You will be part of a dynamic team dedicated to building and scaling systems that support our growth and ensure outstanding service delivery to our clients.
About Multiply LabsMultiply Labs is an innovative startup located in San Francisco, California, backed by renowned investors in technology and life sciences such as Casdin Capital, Lux Capital, and Y Combinator. Our goal is to develop the world's leading robotic systems and utilize them to make groundbreaking life-saving therapies accessible to everyone.We are transforming the manufacturing process of cell therapies through the creation of advanced robotic systems that automate and scale the production of these crucial treatments. Our cutting-edge robots enable biopharma companies to produce cell therapies efficiently without overhauling their existing processes, thus minimizing regulatory hurdles and risks. Unlike traditional methods that are labor-intensive and costly (often exceeding $1M per patient), our robotic solutions aim to make these vital treatments more affordable and reachable for those who need them.To discover more and view our robots in action, please visit www.multiplylabs.com and follow us on LinkedIn.Position OverviewWe are looking for a dedicated Hardware Reliability Engineer to become an essential part of Multiply Labs’ Reliability Engineering team. As a founding member, you will collaborate closely with the Hardware Product and Systems Integration teams to enhance our designs throughout the entire development lifecycle, from initial prototypes to fully deployed GMP production systems. Your contributions will directly support the delivery of life-saving therapies by ensuring our robots operate seamlessly within the high-stakes biotech environment.
Join Our Innovative TeamAt OpenAI, our Hardware organization is pioneering cutting-edge silicon and system-level solutions tailored to meet the demands of advanced AI workloads. We pride ourselves on developing next-generation AI-native silicon while collaborating with software and research partners to create hardware that is intricately integrated with AI models. Our mission includes delivering high-performance silicon for OpenAI’s supercomputing infrastructure and designing custom tools and methodologies that accelerate innovations, specifically optimized for AI applications.Your Role in Our MissionWe are on the lookout for a dynamic and experienced Reliability/DFX Engineer who possesses extensive knowledge in scaling machine learning systems. As an integral member of our hardware team, you will collaborate with chip design, platform design, hardware health, and the wider industry ecosystem to architect, implement, and deploy dependable next-generation AI accelerator systems. You will take a holistic approach to evaluate system and chip architecture, pinpointing high-ROI opportunities that enhance reliability and availability throughout the stack while translating these insights into actionable strategies and silicon features.Key Responsibilities:Lead the architecture, implementation, and execution of DFX strategies in silicon from concept to high-volume deployment, proposing impactful features to boost reliability and fault tolerance. Your focus will encompass design for testability, reliability, availability, and serviceability of high-performance AI hardware.Develop system-level reliability models based on empirical data to guide the organization’s DFX and reliability strategy, necessitating a deep understanding of chip and system architecture, design, implementation, and component-level reliability.Collaborate with chip and platform architecture/design teams to explore and implement DFX features, including the specification and integration of digital/mixed-signal IP, firmware/system software, and DFX methodologies.Work alongside hardware health and platform design teams to enhance reliability and fault tolerance in New Product Introduction (NPI) and High-Volume Manufacturing (HVM), driving continuous, data-driven improvements across the stack through optimized operating conditions and data analysis.Act as the DFX/reliability advocate, aligning the broader industry ecosystem with OpenAI’s strategic objectives and roadmap.Qualifications:Bachelor’s degree in Engineering or related field with 15+ years of experience, or a Master’s degree with 10+ years of relevant experience.Proven expertise in DFX methodologies and reliability engineering for high-performance hardware.Strong analytical and problem-solving skills, with a track record of improving system reliability and performance.Excellent collaboration and communication abilities, capable of working effectively in a cross-functional team environment.Familiarity with AI workloads and associated hardware requirements is highly desirable.
Join Cloudflare as a Database Reliability Engineer, where you will play a crucial role in ensuring the reliability and performance of our database systems. You will work collaboratively with our engineering teams to develop, implement, and maintain robust database solutions that support our mission of making the internet safer and faster.Your responsibilities will include monitoring database performance, troubleshooting issues, and optimizing queries to enhance system efficiency. If you are passionate about databases and eager to make an impact in a dynamic environment, we encourage you to apply!
Join Our TeamAt Cognition, we are at the forefront of applied AI innovation, developing cutting-edge software agents that redefine the engineering landscape. Our flagship products, Devin, the pioneering AI software engineer, and Windsurf, an AI-native IDE, embody our commitment to creating AI that collaborates with engineers as a true partner.Our team is composed of elite talent including competitive programming champions, visionary founders, and researchers from top AI institutions such as Scale AI, Palantir, Cursor, Google DeepMind, and more.Your MissionAs a Site Reliability Engineer, you will play a crucial role in ensuring the reliability of our user-focused products, which are utilized by hundreds of thousands of developers daily. Your mission is to preemptively address potential issues and swiftly resolve any incidents that may arise, maintaining a seamless experience for our users.You will be responsible for overseeing production reliability and enhancing our platform engineering practices, encompassing SLOs, incident response, and on-call duties, alongside CI/CD pipelines, deployment infrastructure, and developer tools. At Cognition, we believe in integrating reliability into our systems rather than treating it as an afterthought, and we strive to cultivate a culture that reflects this philosophy.Your AchievementsProduction Reliability: Establish and manage SLOs, SLIs, and error budgets for our products. Develop robust monitoring, alerting, and observability systems to maintain a transparent view of service health.Incident Management: Spearhead incident response with precision and promptness. Conduct blameless postmortems to derive actionable insights from outages, and create effective runbooks and tools to enhance on-call sustainability.Platform Engineering: Oversee deployment pipelines and internal developer tools, ensuring rapid, reliable shipping of code while minimizing unnecessary toil for engineers.Infrastructure as Code: Manage cloud infrastructure via code, creating reproducible, auditable environments that can scale with product demands and mitigate configuration drift.Capacity Planning: Analyze growth trends, anticipate resource requirements, and ensure our infrastructure is always ahead of user demand, optimizing system performance proactively.Security and Reliability: Integrate security protocols with reliability practices to create a robust framework that safeguards our infrastructure.
Join our dynamic team at fal as a Senior/Staff Site Reliability Engineer. In this key role, you will leverage your expertise to enhance our systems' reliability and performance. If you are passionate about building scalable systems and enjoy working in a collaborative environment, we want to hear from you!
ABOUT BASETENBaseten is at the forefront of powering mission-critical AI inference for some of the most innovative companies globally, including Cursor, Notion, OpenEvidence, Abridge, Clay, Gamma, and Writer. We integrate cutting-edge applied AI research with a flexible infrastructure and intuitive developer tools to empower companies at the leading edge of AI to deploy sophisticated models effectively. With our recent $300M Series E funding round—supported by prominent investors such as BOND, IVP, Spark Capital, Greylock, and Conviction—we are rapidly expanding. Join our dynamic team and contribute to creating an essential platform for engineers to launch AI products with ease.THE ROLEAs a Site Reliability Engineer, you will design and implement resilient systems and processes that ensure our infrastructure is scalable, reliable, and efficient. Your responsibilities will encompass everything from automating deployments and monitoring systems to enhancing performance and managing incidents effectively.Collaboration is key; you will work closely with our users to understand their challenges in operationalizing machine learning, facilitating their onboarding onto our platform, and leveraging these insights to inform improvements to Baseten.EXAMPLE INITIATIVESAs part of our Infrastructure team, you will engage in exciting projects such as:Innovative multi-cloud capacity managementOptimizing inference on B200 GPUsImplementing multi-node inferenceUtilizing fractional H100 GPUs for efficient model servingRESPONSIBILITIESDesign and maintain scalable infrastructures to support the deployment and operational needs of machine learning models.Establish standards and best practices to enhance reliability and performance across the infrastructure.Proactively identify and resolve reliability issues using monitoring and alerting systems.Collaborate with cross-functional teams to apply best practices in infrastructure management and incident response.Create automation scripts to streamline processes and reduce manual intervention.
About HiveHive stands at the forefront of cloud-based AI innovation, providing cutting-edge solutions that enable organizations to understand, search, and generate content. Our platform is relied upon by some of the world's most prestigious and forward-thinking companies. We empower developers with an extensive suite of state-of-the-art, pre-trained AI models that handle billions of API requests each month. In addition to our robust model offerings, we deliver comprehensive software applications backed by proprietary AI models and datasets, unlocking transformative applications in various sectors such as content moderation, brand protection, sponsorship measurement, and context-based advertising.With over $120 million in funding from esteemed investors like General Catalyst, 8VC, Glynn Capital, Bain & Company, and Visa Ventures, Hive has cultivated a vibrant global team of over 250 employees across our San Francisco, Seattle, and Delhi offices. If you’re passionate about shaping the future of AI, we invite you to join our dynamic team!DevOps and Systems TeamIn response to our distinctive machine learning demands, we have developed our own data centers focusing on distributed high-performance computing with GPU integration. While we harness the power of these data centers, our infrastructure remains hybrid, leveraging public cloud solutions when advantageous. As we scale our machine learning models for commercial use, we are expanding our DevOps and Site Reliability team to ensure the reliability of our enterprise SaaS offerings. Our ideal candidate thrives in dynamic environments, embraces automation, and believes that every task can be automated and every server can scale. You take pride in enhancing performance across all layers of our stack and are committed to never performing the same task manually twice.
Full-time|$135K/yr - $235K/yr|On-site|San Francisco
Astranis is revolutionizing satellite technology by creating advanced spacecraft designed for high orbits, thereby extending humanity's presence in the solar system. Our satellites deliver dedicated and secure networks to an elite clientele, including large corporations, government entities, and the U.S. military. With five satellites successfully launched and a robust pipeline of over $1 billion in commercial contracts, Astranis is set for growth as we prepare for numerous upcoming launches.We are the go-to satellite communications partner for clients demanding exceptional uptime, data security, network visibility, and tailored solutions. Backed by over $750 million from industry-leading investors such as Andreessen Horowitz, Blackrock, and Fidelity, our team of 450 engineers and entrepreneurs thrives in our 153,000 sq. ft. headquarters in Northern California.Senior Electrical Reliability EngineerAs a Senior Reliability Engineer at Astranis, you will be pivotal in ensuring that our spacecraft electronics and systems fulfill our reliability and availability requirements. Collaborating with a multidisciplinary engineering team, you will push the boundaries of geo-synchronous spacecraft design and achieve previously unattainable performance in space. Your expertise will ensure that Design for Reliability remains central to our engineering efforts.
Who We AreAt Hyperbolic Labs, we are committed to democratizing AI by removing barriers to computing power with our Open-Access AI Cloud. By aggregating global computing resources, we provide an innovative GPU marketplace and AI inference service that ensures both affordability and accessibility. As trailblazers at the convergence of AI and open-source technology, we envision a future where AI innovation is only limited by creativity, not by resource availability. We invite forward-thinking individuals who share our dedication to making AI universally accessible, secure, and affordable. Join us in crafting a platform that empowers innovators worldwide to realize their visionary AI projects.In anticipation of our growth following our Series A funding, our team — guided by co-founders with advanced degrees in AI, Mathematics, and Computer Science — is set to transform the computing landscape.About the RoleWe are in search of a skilled Site Reliability Engineer to guarantee that Hyperbolic's GPU marketplace and AI infrastructure function with outstanding reliability, performance, and security. As an aggregator of computational resources from numerous global providers, our service level objectives (SLOs), trust, and economic efficiency are critical to our product. Your key responsibilities will include defining and maintaining service level objectives, developing resilient incident response protocols, managing capacity across our extensive GPU network, and implementing secure rollout and rollback mechanisms to ensure uninterrupted platform operation around the clock.In this influential role, you'll set the reliability benchmarks that foster customer trust in our platform, design comprehensive monitoring and alerting systems for enhanced infrastructure visibility, automate capacity management and resource allocation processes, lead incident response and post-mortem evaluations, and collaborate closely with engineering teams to bolster system resilience. Security and infrastructure hardening will be paramount, necessitating strong isolation protocols between tenants and suppliers, the implementation of effective key management systems, and the establishment of compliance frameworks. This high-impact position will directly affect our ability to deliver on our commitment to providing affordable, accessible AI compute at scale.
Welcome to the Teo Job Test position at reteam! This role is designed to facilitate our internal testing processes. Please note that applications submitted through this platform will not be processed.Position OverviewThis is a unique test job intended solely for evaluation purposes.We appreciate your understanding in this matter.