About the job
Join Air as a Senior Database Reliability Engineer, where you will take charge of the performance, reliability, and scalability of our PostgreSQL Aurora infrastructure, which is integral to supporting the creative workflows of renowned brands. In this role, you will collaborate closely with backend engineering teams and leadership to design scalable solutions, implement industry-leading database operations practices, and create an observability framework that ensures Air operates with over 99.99% uptime.
Core Responsibilities
Guarantee Database Reliability & Performance
You will be responsible for maintaining the health, performance, and availability of Air's PostgreSQL Aurora infrastructure:
- Proactively enhance database parameters, indexes, and query patterns to achieve sub-100ms p95 response times.
- Elevate migration practices and tooling to guarantee zero-downtime schema changes as the platform grows.
- Establish and uphold comprehensive backup, recovery, and disaster recovery procedures with clear RTO/RPO targets.
- Collaborate with backend engineers to implement database best practices within application code, including connection pooling, query optimization, and caching strategies.
Strategize and Implement Long-Term Scaling
Formulate a multi-quarter roadmap aimed at scaling Air's database infrastructure to accommodate a tenfold increase in asset volume and user engagement:
- Work with backend engineers and product leadership to model data growth patterns and foresee scaling inflection points.
- Assess and apply horizontal scaling strategies (such as read replicas, sharding, and partitioning) aligned with business objectives.
- Continuously evaluate AWS Aurora capabilities, PostgreSQL ecosystem advancements, and emerging database technologies to maintain a strategic edge.
- Design and execute database architectures that support Air's AI-powered features and real-time creative workflows.
Develop Observability and Data Health Framework
Establish comprehensive monitoring, alerting, and reporting systems to uphold database reliability and enable informed infrastructure decisions:
- Implement detailed instrumentation for database performance metrics, including query latency, connection pool utilization, replication lag, and disk I/O.
- Create automated alerts for anomalies in query performance, connection trends, and resource usage.
- Develop executive-level dashboards that display database health trends, capacity utilization, and cost efficiency.
- Facilitate regular database health reviews with engineering leadership to surface insights and drive continuous improvement.

