About the job
Join Integrant as a Lead AI Platform Engineer and be a pivotal force in revolutionizing our AI capabilities.
The Lead AI Platform Engineer will be instrumental in integrating AI workloads with robust production-grade infrastructure, focusing on leveraging the NVIDIA AI stack to deliver high-performance, scalable, and optimized AI systems.
This role emphasizes model optimization, runtime efficiency, and GPU utilization to ensure that AI workloads are ready for production, cost-effective, and highly effective in enterprise environments.
Key Responsibilities:
- Transform AI/ML workloads into optimized infrastructure and deployment strategies.
- Enhance model performance within GPU environments, focusing on latency, throughput, and memory utilization.
- Design and execute inference and training pipelines utilizing NVIDIA stack tools like TensorRT, Triton, and NIM.
- Convert and optimize models across various frameworks (e.g., PyTorch to ONNX to TensorRT).
- Identify and mitigate performance bottlenecks through the use of profiling tools (GPU, memory, network).
- Boost GPU utilization and scheduling efficiency across clusters.
- Architect scalable distributed training and inference frameworks.
- Collaborate with clients to outline AI infrastructure strategies and deployment models.
- Oversee production deployments, ensuring effective monitoring, rollback, and performance validation.
- Conduct applied research to enhance model efficiency and infrastructure utilization.
- Guide team members in AI infrastructure, optimization, and GPU systems.
- Utilize experiment tracking tools (MLflow, W&B, Neptune) to log parameters, metrics, and artifacts for analysis.
- Detect model degradation post-deployment due to issues like concept drift, data pipeline alterations, and traffic pattern changes.
- Perform root cause analysis (RCA) on ML systems by isolating variables and reproducing issues.
