CoreWeave logoCoreWeave logo

Senior Network Observability Engineer

CoreWeaveLondon, England
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Senior

Qualifications

To excel in this role, you should possess:Proficiency in Python and Golang for developing observability tools. Experience with network protocols and observability frameworks. Strong problem-solving skills and ability to work collaboratively across teams. Familiarity with observability tools like Prometheus, Grafana, and Alertmanager. Excellent communication skills to articulate technical concepts effectively.

About the job

CoreWeave is The Essential Cloud for AI™. Engineered by innovators for innovators, our platform equips leading AI labs, startups, and enterprises with the cutting-edge technology and expert teams needed to scale AI solutions confidently. Since our inception in 2017 and our public listing on Nasdaq (CRWV) in March 2025, we've been at the forefront of AI infrastructure, facilitating breakthroughs by transforming compute into capability. Discover more at www.coreweave.com.
 
We proudly uphold our status as a Living Wage accredited Employer.

 

Your Role

We are looking for a skilled and experienced Senior Engineer for Network Observability to enhance our Network Observability team. In this pivotal role, you will shape the design, development, and maintenance of monitoring, telemetry, and observability systems that ensure the reliable operation and scalability of CoreWeave’s GPU cloud network. Your focus will be on crafting solutions that deliver real-time insights into network performance, enabling proactive issue detection and swift resolution.

Your mission is to enrich CoreWeave’s network with advanced observability features: comprehensive metrics, insightful analytics, and automated alerting, ensuring that any anomalies are identified before they impact our customers.

  • Design, optimize, and maintain network observability platforms. Leverage your expertise in Python and Golang to create and automate collectors, exporters, and dashboards that provide in-depth visibility into network health and performance.
  • Partner with Network Engineering and Platform teams to aggregate and unify logs, metrics, and events from diverse platforms (Arista EOS, NVIDIA Cumulus Linux, Nokia SR OS, SR Linux, etc.) into a cohesive observability pipeline.
  • Architect and implement scalable telemetry solutions utilizing protocols such as gNMI, SNMP, and streaming analytics. Ensure sophisticated alerting and anomaly detection frameworks with tools like Prometheus, Grafana, and Alertmanager.
  • Collaborate closely with network developers, site reliability engineers, and security teams to integrate observability solutions throughout the wider infrastructure. Engage in design discussions, RFCs, and architectural decisions.

About CoreWeave

At CoreWeave, we empower innovation through our Essential Cloud for AI™, providing unparalleled infrastructure for AI development. Our commitment to excellence has made us a trusted partner for leading AI laboratories and enterprises around the globe.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.