About the job
About Runware
Runware is at the forefront of innovation, creating a powerful, full-stack AI media-creation platform that enables developers and businesses to generate diverse media types effortlessly and instantaneously. As our company rapidly scales and incorporates more sophisticated models, we require enhanced visibility, robust analytics, and comprehensive monitoring across our entire platform stack.
We are in search of a Data Expert (Analytics, Monitoring, Observability) who will play a critical role in understanding, measuring, and optimizing the performance of the Runware platform at scale for both internal operations and our clients.
Mission
Your primary objective will be to provide Runware with complete visibility over:
- Comprehensive inference performance
- Integration usage and model activity
- Errors, delays, bottlenecks, and regressions
- Internal and client-facing analytics dashboards
- Health and performance of production pipelines
By delivering valuable data insights, you will empower engineering, ML, backend, DevOps, and leadership teams to make informed decisions and continuously enhance performance and reliability.
What You Will Do
Performance Monitoring & Benchmarking
- Develop and maintain end-to-end inference time tracking (both globally and per model).
- Analyze how changes in implementation affect overall request latency.
- Identify regressions resulting from suboptimal code paths.
- Set up automated alerts and analyze historical trends.
Usage & Analytics Reporting
- Create dashboards for internal stakeholders (engineering, product, leadership).
- Deliver client-facing usage dashboards (covering requests, errors, success rates, and performance).
- Assist clients in debugging their integrations through enhanced visibility.
- Monitor model-level usage, API endpoint utilization, and adoption metrics.
Platform Observability
- Implement metrics, logs, and traces to ensure the smooth scalability of the platform.
- Collaborate closely with DevOps and backend teams to enhance system observability.
- Provide insights to inform infrastructure decisions (GPU allocation, autoscaling, caching, batching, etc.).
Data Infrastructure Ownership
- Select and manage tools (e.g., Prometheus/Grafana, Datadog, OpenTelemetry, ELK, BigQuery, etc.).
- Ensure data pipelines are reliable, accessible, and consistently up-to-date.
- Create user-friendly dashboards tailored for both technical and non-technical teams.

