About the job
Your Role
We are looking for a skilled and experienced Senior Engineer for Network Observability to enhance our Network Observability team. In this pivotal role, you will shape the design, development, and maintenance of monitoring, telemetry, and observability systems that ensure the reliable operation and scalability of CoreWeave’s GPU cloud network. Your focus will be on crafting solutions that deliver real-time insights into network performance, enabling proactive issue detection and swift resolution.
Your mission is to enrich CoreWeave’s network with advanced observability features: comprehensive metrics, insightful analytics, and automated alerting, ensuring that any anomalies are identified before they impact our customers.
- Design, optimize, and maintain network observability platforms. Leverage your expertise in Python and Golang to create and automate collectors, exporters, and dashboards that provide in-depth visibility into network health and performance.
- Partner with Network Engineering and Platform teams to aggregate and unify logs, metrics, and events from diverse platforms (Arista EOS, NVIDIA Cumulus Linux, Nokia SR OS, SR Linux, etc.) into a cohesive observability pipeline.
- Architect and implement scalable telemetry solutions utilizing protocols such as gNMI, SNMP, and streaming analytics. Ensure sophisticated alerting and anomaly detection frameworks with tools like Prometheus, Grafana, and Alertmanager.
- Collaborate closely with network developers, site reliability engineers, and security teams to integrate observability solutions throughout the wider infrastructure. Engage in design discussions, RFCs, and architectural decisions.
