About the job
Join NordLayer, where we are revolutionizing cybersecurity solutions that grow with your business.
Our adaptable platform empowers modern teams to excel without the burden of security concerns. Trusted by over 11,000 companies worldwide, NordLayer seamlessly integrates into any technology stack, ensuring user protection across borders.
Your Role: Play a pivotal part in helping businesses maintain robust security while advancing with innovative network protection.
At NordVPN, we manage a global edge infrastructure that serves millions of users. This position is designed to address the challenge of gaining real-time insights into our infrastructure's performance at scale, while minimizing operational noise.
We are seeking a Senior Site Reliability Engineer (SRE) with a focus on observability: your responsibilities will include designing monitoring systems, enhancing signal quality, alleviating alert fatigue, and collaborating with data teams for anomaly detection. You will take ownership of our understanding of the health and performance of our distributed systems.
Key Responsibilities
- Develop, construct, and refine monitoring pipelines and observability tools across a globally distributed infrastructure.
- Establish and execute service-level monitoring strategies based on key performance indicators (latency, traffic, errors, saturation).
- Mitigate alert fatigue by creating actionable alerts that engineers can rely on.
- Create and manage custom exporters, scripts, and integrations for metrics and log collection.
- Work alongside the data team on anomaly detection and data-driven operational insights.
- Comprehend service signals, understand what metrics to track, their significance, and their interpretations.
Core Requirements
- Experience in distributed systems observability, including monitoring architecture, signal design, and dashboarding.
- Proficiency in golden signal methodology, designing monitoring processes that focus on what truly matters rather than what is easy to measure.
- Expertise in alert design, reducing noise, creating actionable alerts, and managing on-call responsibilities.
- Strong Python skills for scripting, custom exporter development, automation, and data processing.
- Linux administration and troubleshooting capabilities.
