About the job
About Traversal
Traversal stands at the forefront of AI Site Reliability Engineering (SRE) for enterprises, earning the trust of some of the biggest names in the industry to troubleshoot, resolve, and proactively prevent intricate production incidents. Our mission is to liberate engineers from the cycle of constant crisis management, allowing them to concentrate on innovative, impactful projects.
With deep roots in AI research, we are channeling scientific rigor and creativity into establishing the leading AI agent lab tailored for enterprises. We take pride in assembling a diverse and talented team, featuring researchers from renowned institutions such as MIT, Harvard, and Berkeley, alongside top-tier engineers from prestigious organizations like Citadel Securities, Cockroach Labs, and Datadog. Together, we tackle one of the most challenging problems in AI, and it's our collaborative spirit that drives our success.
The Role
As an AI Engineer on the Data Platform team at Traversal, you will design, develop, and maintain the backend systems that fuel our AI-driven observability platform. Your work will span both cloud-based and on-premises deployments, ensuring that our systems are highly reliable, efficient, and capable of supporting extensive AI operations. This hands-on position combines distributed systems engineering, low-level system design, performance optimization, observability, and AI integration—working closely with engineers across the organization to deliver resilient infrastructure that empowers our AI agents to diagnose and rectify production incidents in real-time.
Responsibilities
Architecture & Implementation: Participate in the design and execution of scalable, resilient infrastructure systems that support AI-driven root cause analysis and observability workflows across varied on-premises environments.
Low-Level System Design: Contribute to the foundational elements of our infrastructure, ensuring optimal resource utilization and high performance at scale.
Performance Optimization: Analyze and fine-tune backend systems to enhance throughput, decrease latency, and eliminate bottlenecks across the infrastructure.Observability Systems: Assist in the development and maintenance of our internal observability stack—logs, metrics, and traces—utilized by our agents to comprehend and respond to production challenges.
Hybrid Infrastructure: Facilitate architectures for both cloud-based (SaaS) and on-premises deployments to serve enterprise needs effectively.

