About the job
Join us at the forefront of technological innovation as we revolutionize the data storage industry. At Pure Storage, you’ll lead with creative solutions, grow alongside industry experts, and be part of a team that is redefining what’s possible.
If you’re eager to make a meaningful impact and embrace limitless opportunities, we invite you to become a part of our dynamic organization.
THE ROLE
As a Site Reliability Engineer at Pure Storage, you will play a pivotal role in enhancing the performance and stability of our essential engineering infrastructure and production services globally. This hybrid role combines software expertise and systems knowledge, where you'll take ownership of the reliability and operational excellence of core applications driving Pure’s innovative offerings. You will have the unique opportunity to set benchmarks for operational efficiency through automation, collaborative postmortems, and ambitious service level objectives alongside your engineering peers.
WHAT YOU'LL DO
- Ensure exceptional service reliability for our cloud platforms and infrastructure by implementing comprehensive monitoring, proactive incident response, and conducting thorough root cause analysis (RCA) in a 24x7 environment.
- Transform operational practices by identifying, designing, and implementing automated solutions for manual cloud service operations and deployment, significantly boosting efficiency and minimizing human error.
- Collaborate with development teams to integrate SRE principles early in the development lifecycle, defining enhancements to service architecture that promote high availability, scalability, and compliance with established SLAs.
- Enhance the observability stack by configuring and refining service health monitoring, collecting critical metrics, and developing effective alerting systems to maintain comprehensive insight into system performance.
- Champion modern cloud operations technologies by exploring and implementing new tools for Infrastructure as Code (IaC), container orchestration, and high-availability (HA), continuously optimizing the reliability and scalability of our cloud services.
