NVIDIA and Deutsche Telekom are collaborating to create the world's first industrial AI cloud tailored for European manufacturers. This innovative AI factory in Germany will house 10,000 GPUs distributed across NVIDIA DGX B200 systems and RTX Pro Servers. Deutsche Telekom delivers a secure, sovereign, and high-speed infrastructure encompassing data centers, operations, security, and AI solutions.Role Overview:We are in search of a Senior Network Engineer for Industrial AI Cloud to develop and automate the network platform for automation and operational components, including switches, firewalls, routers, and border gateways, as a core part of the Industrial AI Cloud environment. In this role, you will provision and manage the aforementioned stack, implement and refine monitoring, and deploy additional components when necessary. You will coordinate across multiple teams (such as Infrastructure and Platform) to deliver and continuously enhance infrastructure services in alignment with ITIL processes.The Senior Network Engineer will design and implement solutions to facilitate automated configuration management, release management, as well as build, test, and deployment activities. This position involves direct customer interaction to create customized solutions and implementations, including consultancy services. The role utilizes proprietary technologies including InfiniBand, Cumulus OS, RoCE, UFM, FortiGate firewalls, and Cisco border gateways.Key Responsibilities:Operational Coordination: Collaborate with Data Center, IaaS, and PaaS teams to oversee and support network lifecycle activities, including installations, upgrades, changes, and firmware updates, while managing network interconnections and documentation.Switch and Firewall Management: Provision and maintain InfiniBand switches in compliance with ITIL standards.Automation Development: Create and maintain automation scripts to orchestrate the overall scope, fine-tuning configurations throughout the project lifecycle.OS and Firmware Management: Sustain network-based environments, applying patches and managing firmware upgrades at scale.Monitoring and Observability: Implement and oversee effective monitoring mechanisms.ITIL Process Compliance: Adhere to and enhance incident, problem, and change management workflows; document runbooks and standard operating procedures while following ZERO Outage guidelines.Cross-Team Collaboration: Partner closely with Platform Engineers and AI solution teams to guarantee smooth deployments and operations.High-Speed Fabric Management: Manage a unified network fabric leveraging both InfiniBand and Ethernet/RoCE technologies.Management Network Setup: Oversee a dedicated 1 Gbps Ethernet and serial console for out-of-band (OOB) network management.PE/CE Data Center Connectivity: Manage CE routers, firewalls, and associated connectivity.
May 4, 2026