fabrion logofabrion logo

ML Ops Engineer - Founding Team at Agentic AI Lab

fabrionSan Francisco Bay Area
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Experience Level

Entry Level

Qualifications

Qualifications:Bachelor's degree in Computer Science, Engineering, or a related fieldExperience with ML Ops practices and toolsProficient in programming languages such as Python, and familiarity with containerization technologies like DockerUnderstanding of cloud computing platforms and orchestration tools (Kubernetes, Terraform)Knowledge of ML frameworks (TensorFlow, PyTorch, etc.) and model governanceExcellent problem-solving skills and ability to work collaboratively in a team environment

About the job

ML Ops Engineer , Agentic AI Lab (Founding Team)

Location: San Francisco Bay Area

Type: Full-Time

Compensation: Competitive salary + meaningful equity (founding tier)

At fabrion, supported by 8VC, we are assembling a premier team to address one of the industry's most significant infrastructure challenges.

About the Role

Join our AI Lab as we advance the future of intelligent infrastructure through pioneering open-source LLMs, agent-native pipelines, retrieval-augmented generation (RAG), and knowledge-graph-grounded models. We seek an ML Ops Engineer who will serve as the vital link between ML research and production systems, taking charge of automating the model training, deployment, versioning, and observability pipelines that empower our agents and AI data fabric.

In this role, you will engage in compute orchestration, GPU infrastructure management, fine-tuned model lifecycle stewardship, and ensure model governance and security.

Key Responsibilities

  • Design and maintain secure, scalable, and automated pipelines for:

  • LLM fine-tuning, SFT, LoRA, RLHF, and DPO training

  • RAG embedding pipelines with real-time updates

  • Model conversion, quantization, and inference deployment

  • Oversee hybrid compute infrastructure (cloud, on-premises, GPU clusters) for training and inference workloads utilizing Kubernetes, Ray, and Terraform

  • Containerize models and agents using Docker, ensuring reproducible builds and CI/CD through GitHub Actions or ArgoCD

  • Establish and enforce model governance, including versioning, metadata management, lineage tracking, reproducibility, and evaluation capture

  • Develop and manage evaluation and benchmarking frameworks (e.g. OpenLLM-Evals, RAGAS, LangSmith)

  • Integrate with security and access control mechanisms (OPA, ABAC, Keycloak) to implement model policies by tenant

  • Implement observability practices for model latency, token usage, performance metrics, error tracing, and drift detection

  • Assist in deploying agentic applications using LangGraph, LangChain, and custom inference solutions.

About fabrion

fabrion is a forward-thinking company backed by 8VC, dedicated to building a world-class team that addresses critical infrastructure challenges in the AI domain. We are committed to innovation and excellence in the AI landscape.

Similar jobs

Browse all companies, explore by city & role, or SEO search pages. View directory listings: all jobs, search results, location & role pages.

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.