Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.
Experience Level
Senior
Qualifications
We are looking for candidates with a strong background in software engineering, specifically in reliability and systems design. The ideal candidate will possess:Proficiency in programming languages such as Python, Java, or Go. Experience with cloud services and infrastructure management. Solid understanding of software development methodologies and best practices. Ability to work collaboratively in a fast-paced environment. Excellent problem-solving skills and analytical thinking.
About the job
Join Wise as a Senior Software Engineer I focused on reliability! In this pivotal role, you will be at the forefront of ensuring our systems are robust, scalable, and efficient. You will collaborate with cross-functional teams to identify and resolve issues, enhance system performance, and contribute to our mission of making international money transfers easy and affordable. If you are passionate about technology and thrive in a dynamic environment, we want to hear from you!
About Wise
Wise is a leading financial technology company that has transformed the way people transfer money internationally. Our mission is to make money transfers instant, convenient, and affordable. With a diverse and inclusive workforce, we pride ourselves on our commitment to innovation and excellence, constantly striving to improve our services and customer experience.
We are seeking a highly skilled Lead Product Reliability Engineer to join our innovative team at fyxer in London. In this critical role, you will be responsible for enhancing the reliability and performance of our products, ensuring that we meet our high standards and customer expectations.Your expertise will help us develop robust solutions that can withsta…
Join fyxer as a Senior Product Reliability Engineer and play a crucial role in ensuring the reliability and performance of our innovative products. In this position, you will collaborate with cross-functional teams to develop and implement best practices for product reliability, while also troubleshooting issues to enhance user experience.
Location: London, Waterloo (Hybrid, 4 days in-office - Wednesday is our designated work from home day, though you are welcome to join us in the office on Wednesdays if you prefer)At getground, we are revolutionizing one of the world's most significant asset classes: property. With over £2 billion in assets on our platform and a community of more than 30,000 users across 70 countries, we are shaping the future of asset ownership and tackling wealth inequality.Our innovative product streamlines property investing from start to finish, making real estate investment accessible to everyone.Your Key Responsibilities:Collaborating within cross-functional product teams to transition infrastructure and reliability initiatives from concept to live deployment.Thriving in a dynamic environment where autonomy and ownership are fundamental to our operations.Developing and sustaining a robust, scalable infrastructure within our GCP cloud ecosystem. Utilizing Kubernetes, Terraform, Cloudflare, and cutting-edge observability tools to ensure seamless platform functionality.Working closely with engineering teams to formulate CI/CD pipelines, enhance deployment methodologies, and advocate for reliability as a core engineering principle.Contributing to the establishment of SRE practices for a rapidly growing fintech platform. Mentoring fellow engineers as we expand our teams and influence.Your Day-to-Day Activities:Designing, implementing, and maintaining cloud infrastructure on Google Cloud Platform (GCP), ensuring it meets scalability, reliability, and security standards.Taking ownership of our Kubernetes clusters and containerization strategy, including Docker image optimization, cluster management, and deployment orchestration.Creating and optimizing Infrastructure as Code using Terraform, producing modular, testable, and well-documented configurations that adapt to our rapid growth.Managing and enhancing our Cloudflare infrastructure, including Workers for edge computing, DNS, CDN, security policies, and performance optimization.Implementing AI-powered product features in isolated and secure serverless environments.Establishing comprehensive monitoring and observability with Prometheus and Grafana, defining SLIs/SLOs, and proactively identifying potential issues before they affect users.Designing and maintaining CI/CD pipelines with appropriate quality gates, testing strategies, and deployment methodologies (blue-green, canary) to facilitate rapid deployments.
What You'll Accomplish:As a Senior Data Reliability Engineer, you will spearhead the integration of Site Reliability Engineering (SRE) across all engineering practices. Your leadership will ensure that every engineer and team is dedicated to crafting software that is not only resilient but also exceptionally reliable. You will collaborate with a diverse, cross-functional team of subject matter experts and on-call engineers, focused on maintaining high performance of our platform around the clock.Overseeing a comprehensive suite of products, you will be responsible for the reliability of enterprise-grade applications that process thousands of queries per second. Elliptic is acclaimed for its extensive and dependable datasets, and your role will be pivotal in establishing a market-leading infrastructure for data quality and governance. This involves creating the processes, culture, and frameworks that will enhance observability, data quality, lineage, and remediation, forming a crucial backbone of our data and intelligence platform.Your Responsibilities:This role spans multiple teams, and you will receive full support from leadership and engineering while showcasing exemplary standards. Your main tasks will include:Promote the principles of SRE and DRE throughout the engineering teams.Lead the development of a data quality framework that assures our clients of the accuracy of our data and supports marketing and revenue initiatives.Define and manage the on-call process within the SRE function:Quickly gain an in-depth understanding of our systems.Lead incident management.Conduct post-incident reviews.Ensure timely completion of follow-up actions.Assess and enhance our existing end-to-end on-call processes.Participate in the on-call rotation, approximately every 4 to 5 weeks, ensuring 24/7 coverage.Evaluate, manage, and improve our current monitoring, alerting, paging, and documentation solutions.Provide reports on system uptime, availability, and performance across our product range.Draft post-mortem reports for both internal and external stakeholders.Represent the SRE and DRE functions during discussions with top-tier enterprise financial institutions.
Join our dynamic team as a Senior Site Reliability Engineer at Bumble Inc., where your expertise in Linux and system-level operations will be pivotal in managing complex production environments. We seek a proactive engineer capable of independently troubleshooting incidents, leading post-incident recovery efforts, and implementing enhancements to boost overall system stability, performance, and observability. This role is ideal for hands-on SREs with a solid foundation in Linux infrastructure and third-party system operations, focusing on optimizing large-scale environments of over 5,000 hosts utilizing technologies such as Kafka, Redis, and Kubernetes. Please note, this position centers on operational excellence rather than application development, requiring deep technical acumen and advanced troubleshooting capabilities.
About Neo4j Neo4j builds a graph intelligence platform used by 84 of the Fortune 100 and supported by the world’s largest graph community. The platform powers knowledge graphs for AI, delivers reliable graph capabilities across cloud environments, and integrates with a wide range of systems. Neo4j’s technology is designed for precision, accountability, and governance, helping organizations turn data into actionable insights for intelligent applications and AI systems. Engineered for seamless operation in any cloud, Neo4j supports dynamic, personalized, and autonomous AI solutions. The focus is on delivering swift results, contextual knowledge, and solutions that improve both customer and employee experiences. Our Vision Neo4j’s mission is to help the world understand data. As business and society become more interconnected, Neo4j’s technology enables organizations to find and understand relationships within their data. The company pioneered the graph database category and continues to lead in helping teams innovate and stay competitive. About the Site Reliability Engineering Team The Site Reliability Engineering (SRE) team supports Neo4j’s Database as a Service (DBaaS) product, Neo4j Aura. Aura operates globally across all major cloud providers, running hundreds of Kubernetes clusters and managing thousands of Neo4j instances in production. This team is redefining SRE within Neo4j Aura. Rather than simply reacting to incidents, the SRE group empowers teams to design for reliability from the start. The work centers on building tools, practices, and a culture that embed SRE principles into the foundation of Aura’s operations. Collaboration with product teams and a commitment to resilience and engineering excellence are central to the team’s approach. What You Will Do Automate for insight and scale: Build systems that enable fast, safe, and scalable troubleshooting across thousands of Neo4j instances. This includes developing internal tools that provide actionable insights. Location London
Wintermute Trading is a leading force in algorithmic trading for digital assets, providing liquidity across major cryptocurrency exchanges and platforms. The company works with both established blockchain projects and traditional financial institutions entering the crypto space. In addition to trading, Wintermute supports the broader blockchain ecosystem through investments, partnerships, and project incubation. Founded in 2017 by industry veterans, the team combines technical depth with the agility of a startup. Role overview The Tech Lead for Product Engineering will be based in London and will take a hands-on leadership role in developing the NODE platform. This platform is designed for institutional clients, enabling seamless crypto trading experiences. The position offers direct influence over both the direction and execution of the platform, working in a cross-functional setting. What you will do Lead the development of the NODE trading platform, focusing on institutional users Collaborate with engineers, designers, core developers, and stakeholders Define, build, and deliver a secure, scalable, and high-performance trading product Work across teams to align the platform with business and technical objectives Location This role is based in London.
At Orgvue, we are at the forefront of organizational design and planning software, harnessing the transformative power of data visualization and modeling to help organizations become more adaptable and high-performing. Our platform empowers HR, finance, and business leaders to make swift, informed workforce decisions in an ever-evolving landscape.Trusted by some of the world's largest enterprises and renowned management consulting firms, Orgvue enables organizations to visualize and proactively shape their futures. Headquartered in London, we also have offices in Philadelphia, The Hague, Toronto, and Sydney.We are currently on the lookout for a Principal Site Reliability Engineer to join our team as a senior technical leader specializing in scaling and fortifying our AWS and Kubernetes-based infrastructure.Role OverviewIn this pivotal role, you will collaborate with product, platform, and operations teams to ensure our systems are reliable, observable, and resilient, even at scale. This position marries hands-on technical proficiency with strategic foresight, enabling us to cultivate a world-class reliability culture and a strong engineering framework for growth. We seek an individual with robust technical skills, exceptional communication abilities, and a passion for cross-team collaboration.Key ResponsibilitiesEstablish and uphold SLOs, SLIs, and error budgets across vital servicesDesign and execute a comprehensive cloud infrastructure and tooling strategyElevate SRE practices organization-wideImplement effective observability metrics, logs, and traces using our observability toolsLead the team in creating automated, self-healing systemsManage and refine our incident response protocols, including on-call practices and a post-mortem cultureMentor engineers throughout the organization on reliability best practices, operational readiness, and scalable infrastructureDrive Infrastructure as Code (IaC) initiatives using Terraform, Kubernetes, CloudFormation, and GitOps methodologiesWork closely with security, DevOps, and software teams to guarantee compliance, scalability, and operational excellenceAssess and introduce tools, patterns, and practices that enhance the performance and reliability of our SaaS platformQualificationsProven experience leading SRE transformationsExtensive hands-on expertise with Kubernetes (EKS preferred) in production settingsStrong proficiency with AWS core services (EC2, EKS, RDS, S3, ALB/NLB, IAM, CloudWatch, etc.)Expertise in Infrastructure as Code utilizing tools such as Terraform, with familiarity in GitOps workflowsSolid background in observability: metrics, visualization, logging, and tracingUnderst...
Role overview The Forward Deployed Reliability Engineer at Palantir Technologies in London plays a key role in supporting the reliability and performance of Palantir's software as it becomes part of client operations. This position centers on ensuring that solutions remain stable and effective after deployment. What you will do Partner with clients to help integrate Palantir's technology into their daily workflows. Troubleshoot and resolve complex technical challenges to keep systems stable. Work to optimize performance and apply established reliability engineering practices. Collaborate with teams across disciplines to enhance system functionality and deliver results for clients.
The Trade Desk is a leading global technology firm dedicated to establishing a more accessible and transparent internet through principled and intelligent advertising solutions. Our platform processes over 1 trillion queries daily, operating at an unparalleled scale. We pride ourselves on nurturing an award-winning culture rooted in trust, ownership, empathy, and collaboration. We celebrate the distinctive experiences and viewpoints that every individual contributes to The Trade Desk, and we are devoted to cultivating inclusive environments where everyone can express their authentic selves daily. Are you passionate about tackling complex challenges on a grand scale? Do you long to be part of a vibrant, globally-connected team where your efforts will significantly impact the creation of a better media ecosystem? Discover why Fortune magazine regularly recognizes The Trade Desk as one of the best small to medium-sized workplaces worldwide. About the Role: We are on the lookout for a Lead Systems Reliability Engineer to enhance our engineering team, focusing on the development and upkeep of our data-driven platform. Utilizing cutting-edge technologies such as Aerospike, MongoDB, and Kafka, we execute numerous real-time activities, achieving a remarkable p99 latency of under 1 millisecond on the backend! If you thrive on tuning, performance testing, troubleshooting, automation, and operating at scale, and if you find excitement in testing next-gen hardware, evaluating data access patterns, and architecting automation for distributed systems, we want to hear from you!
Full-time|£40K/yr - £60K/yr|Hybrid|Bristol, England, United Kingdom; Edinburgh, Scotland, United Kingdom; London, England, United Kingdom
Join our dynamic Release Engineering team at Kaluza as a Site Reliability Engineer. In this pivotal role, you will play a crucial part in enhancing our software development lifecycle by developing innovative engineering solutions that empower our software teams to deploy high-quality code efficiently. Your efforts will significantly boost engineering productivity through the optimization of testing, deployment, and release processes across all Kaluza engineering teams.
About the Role At SumUp, our mission is to bridge the gap between merchants and consumers, empowering local communities to flourish. With SumUp Pay, we are innovating a banking experience that connects individuals to the small businesses they adore while facilitating free transactions for our merchants. SumUp Pay simplifies financial management for consumers, enhances everyday spending, and fortifies the bond between them and local merchants, delivering value and fostering growth for community businesses. This pivotal role offers the chance to influence our global strategy at a rapid pace, tackling diverse topics and opportunities. Your strategic acumen, creative problem-solving, and relentless drive will be essential as you lead initiatives within the Consumer Tribe, dedicated to expanding SumUp Pay and enhancing the consumer-merchant relationship while collaborating across the organization.
About WheelyWheely is revolutionizing premium transportation in major cities across Europe, the United States, and the Middle East. We seamlessly integrate cutting-edge technology with the artistry of five-star chauffeuring to provide an unparalleled experience that has earned the trust of over 100,000 active riders and 1,200 corporate clients.As a profitable and rapidly growing scale-up, we have raised $43M and surpassed $100M in annual revenue. Following our recent launch in New York City, we are swiftly expanding across the US and EMEA. If you take pride in your craft and are eager to contribute to our next phase of growth, we invite you to connect with us.Our infrastructure has been rebuilt almost from the ground up over the past few years, and we are now seeking to further expand our infrastructure team.As a valued member of our team, you will focus on minimizing incidents related to availability, performance, and security. You will accelerate the delivery of new features to customers by building flexible, highly available, and secure infrastructure, ensuring a smooth journey for every customer.
Join our dynamic Systems Engineering team as a pivotal and trusted DevOps Engineer / Site Reliability Engineer. Collaborating closely with software engineers, you will design and implement mission-critical services and systems. Your role will involve managing infrastructure and services at scale, employing a diverse array of cutting-edge technologies that support our high-traffic, real-time Freelancer.com marketplace as well as various other business products deployed on Amazon Web Services. Our technology stack includes Nginx, MySQL, Redis, ElasticSearch, RabbitMQ, Consul, Docker, and Kubernetes. We aim to build highly resilient, dynamically scaling, self-healing systems by automating and monitoring all processes using tools such as Terraform, Puppet, Prometheus, Grafana, Kibana, and Jenkins.
Join Graphcore, a pioneering company at the forefront of artificial intelligence and machine learning technology, as a Senior Systems Engineer specializing in Performance and Reliability. In this role, you will be instrumental in ensuring that our systems deliver exceptional performance and reliability that sets us apart in the industry.Your expertise will contribute to designing, implementing, and optimizing system architectures that support our cutting-edge technology. You will collaborate closely with cross-functional teams to tackle complex challenges and drive innovation within the organization.
Join Pylon Labs and Shape the Future of B2B Post-Sales!At Pylon Labs, we are revolutionizing B2B post-sales support with our comprehensive platform that harnesses conversational data and advanced intelligence to empower our clients to manage their operations in real time.Supported by renowned investors like a16z, BCV, General Catalyst, and Y Combinator, we proudly serve over 1000 companies, including Linear, Cognition (creators of Devin), Modal Labs, and Incident.io. We are also featured on the Enterprise Tech 30 List.Your RoleThis position starts as an individual contributor role, with growth opportunities into a team lead or management position as we expand in the EMEA region. You will act as the primary support contact for our European clients — managing issues from start to finish, becoming a product authority, and establishing best practices for exceptional support in this area.Location: Initially remote for a few months, transitioning to in-office at our East London office upon its opening. We seek candidates who are based in or willing to relocate to London.Your ResponsibilitiesAddress customer inquiries regarding our products across various topics.Create and revise knowledge base articles, including troubleshooting guides and feature explanations.Utilize Pylon's support tools, provide feedback, and help shape product development.Collaborate closely with product and engineering teams to resolve bugs and troubleshoot customer issues.Assist in building a scalable support team and processes for the EMEA region.Experiment with new features, processes, and innovative AI solutions.QualificationsMust be located in London or willing to relocate, enthusiastic about working in the East London office once it opens — remote work for the initial months as we set up, with an office opening anticipated in September 2026.1 month on-site training in our San Francisco office.Skilled in engaging with customers through chat and video platforms.A passion for product development and improvement.1 to 8 years of relevant experience.Leadership experience is a plus.
About xAIAt xAI, our mission is to develop advanced AI systems that can comprehend the universe and assist humanity in its quest for knowledge. Our dedicated team is small, highly motivated, and committed to engineering excellence, making it an ideal environment for individuals who thrive on challenges and curiosity. We foster a flat organizational structure where every employee plays a crucial role in driving our mission forward. We value initiative and excellence, rewarding those who consistently demonstrate strong work ethic and prioritization skills. Effective communication is essential, and all team members are expected to share their insights clearly and concisely.About the TeamYou will join a team responsible for the backend services that power our innovative products, including grok.com and our API. Our focus is on developing and maintaining highly scalable and reliable services capable of efficiently processing tens of thousands of queries per second, hosted across multiple Kubernetes clusters in both on-premises and cloud environments.About the RoleWe are looking for a candidate who meets the following criteria:In-depth expertise in Kubernetes.Proficiency with continuous deployment systems, including Buildkite and ArgoCD.Extensive experience with monitoring tools such as Prometheus, Grafana, and PagerDuty.Strong knowledge of infrastructure as code practices utilizing tools like Pulumi or Terraform.Familiarity with systems programming languages such as Rust, C++, or Go.Experience in traffic management and HTTP proxies, such as nginx and envoy.LocationThis position requires in-person attendance in London, UK. While we typically work from the office five days a week, we do provide flexibility for remote work when necessary. Candidates should be prepared to attend late meetings at least once a week to coordinate with our global teams.
Join Wayve Technologies as a Staff Cloud Site Reliability Engineer and play a pivotal role in shaping the future of autonomous driving technology. In this position, you will leverage your expertise to enhance the reliability, performance, and scalability of our cloud infrastructure. Collaborate with cross-functional teams to design robust systems that can handle high traffic and ensure seamless operation.
Blockchain.com is at the forefront of revolutionizing finance, providing millions globally with secure access to cryptocurrency. Established in 2011, we have gained the trust of over 90 million wallet holders and more than 40 million verified users, facilitating over $1 trillion in crypto transactions.Blockchain is the world's premier software platform for digital assets. We operate the largest production blockchain platform globally, driven by our passion for coding and building an open, accessible, and equitable financial future, one innovative software solution at a time.We are seeking a Site Reliability Engineer to join our Core team. This role involves advocating for infrastructure best practices across our organization, enabling us to securely scale a distributed financial platform that serves millions daily.Our distributed financial platform addresses some of the most fascinating challenges in the crypto space for our vast customer base and is experiencing rapid growth. The Site Reliability Engineering (SRE) team at Blockchain merges software and systems engineering to create a platform that simplifies complexity, enhancing security, reliability, and swift product delivery.The SRE organization at Blockchain is a dynamic environment focused on continual improvement. We foster a culture where team members can propose, discuss, design, and implement changes with a high degree of autonomy. We value abstract thinking to develop exceptionally effective tools and strive to eliminate toil.As a member of the Core team, you will gain a comprehensive understanding of our products' infrastructure needs. Your role will include establishing and maintaining innovative engineering solutions to enhance our customers' experience through the development of essential tools. Importantly, you will also mentor and guide developer teams to deliver new features in a rapid, secure, and scalable manner.
Join Wise as a Senior Software Engineer I focused on reliability! In this pivotal role, you will be at the forefront of ensuring our systems are robust, scalable, and efficient. You will collaborate with cross-functional teams to identify and resolve issues, enhance system performance, and contribute to our mission of making international money transfers easy and affordable. If you are passionate about technology and thrive in a dynamic environment, we want to hear from you!