About the job
ML Ops Engineer , Agentic AI Lab (Founding Team)
Location: San Francisco Bay Area
Type: Full-Time
Compensation: Competitive salary + meaningful equity (founding tier)
At fabrion, supported by 8VC, we are assembling a premier team to address one of the industry's most significant infrastructure challenges.
About the Role
Join our AI Lab as we advance the future of intelligent infrastructure through pioneering open-source LLMs, agent-native pipelines, retrieval-augmented generation (RAG), and knowledge-graph-grounded models. We seek an ML Ops Engineer who will serve as the vital link between ML research and production systems, taking charge of automating the model training, deployment, versioning, and observability pipelines that empower our agents and AI data fabric.
In this role, you will engage in compute orchestration, GPU infrastructure management, fine-tuned model lifecycle stewardship, and ensure model governance and security.
Key Responsibilities
Design and maintain secure, scalable, and automated pipelines for:
LLM fine-tuning, SFT, LoRA, RLHF, and DPO training
RAG embedding pipelines with real-time updates
Model conversion, quantization, and inference deployment
Oversee hybrid compute infrastructure (cloud, on-premises, GPU clusters) for training and inference workloads utilizing Kubernetes, Ray, and Terraform
Containerize models and agents using Docker, ensuring reproducible builds and CI/CD through GitHub Actions or ArgoCD
Establish and enforce model governance, including versioning, metadata management, lineage tracking, reproducibility, and evaluation capture
Develop and manage evaluation and benchmarking frameworks (e.g. OpenLLM-Evals, RAGAS, LangSmith)
Integrate with security and access control mechanisms (OPA, ABAC, Keycloak) to implement model policies by tenant
Implement observability practices for model latency, token usage, performance metrics, error tracing, and drift detection
Assist in deploying agentic applications using LangGraph, LangChain, and custom inference solutions.
