NVIDIA

Senior HPC AI Cluster Engineer

NVIDIA is seeking an experienced HPC Engineer to join the E2E software verification HPC/AI Infrastructure team. This role involves working with groundbreaking technologies to build supercomputers and HPC clusters, contributing to breakthroughs in artificial intelligence and GPU computing. You will provide insights on large-scale system design and tuning mechanisms.

You will collaborate with scientific researchers, developers, and customers, utilizing the latest Accelerated computing and Deep Learning software and hardware platforms to craft improved workflows and develop differentiated solutions. You will also interact with HPC, OS, GPU compute, and systems specialists to architect, develop, and bring up large-scale performance platforms.

What you will be doing:

Designing, implementing, and maintaining large-scale HPC/AI clusters, including monitoring, logging, and alerting.
Managing Linux job/workload schedulers and orchestration tools.
Developing and maintaining continuous integration and delivery pipelines.
Creating tooling to automate the deployment and management of large-scale infrastructure environments, operational monitoring, alerting, and enabling self-service resource consumption.
Deploying monitoring solutions for servers, networks, and storage.
Troubleshooting and resolving issues from bare metal to application level.
Serving as a technical resource, developing, redefining, and documenting standard methodologies for internal teams.
Supporting Research & Development activities and participating in Proofs of Concept (POCs) and Proofs of Value (POVs) for future improvements.

What we need to see:

Bachelor's Degree in Computer Science, Engineering, or a related field, or equivalent experience.
5+ years of experience.
Knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.
Experience with job scheduling workloads and orchestration tools like Slurm and Kubernetes (K8s).
Excellent knowledge of Windows and Linux (Redhat/CentOS, Ubuntu) networking (sockets, firewalls, iptables, Wireshark, etc.), internals, ACLs, OS-level security, and common protocols (TCP, DHCP, DNS, etc.).
Experience with multiple storage solutions such as Lustre, GPFS, ZFS, and XFS, with familiarity with emerging storage technologies.
Proficiency in Python programming and bash scripting.
Experience with automation and configuration management tools like Jenkins, Ansible, Puppet/Chef.
Deep knowledge of Networking Protocols such as InfiniBand and Ethernet.
Deep understanding and experience with virtual systems (e.g., VMware, Hyper-V, KVM, Citrix).
Familiarity with cloud computing platforms (e.g., AWS, Azure, Google Cloud).

Ways to stand out from the crowd:

Knowledge of CPU and/or GPU architecture.
Knowledge of Kubernetes and container-related microservice technologies.
Experience with GPU-focused hardware/software (DGX, CUDA).
Background with RDMA (InfiniBand or RoCE) fabrics.

NVIDIA has a rich history of innovation in computer graphics, PC gaming, and accelerated computing. We are leveraging the potential of AI to define the future of computing, where GPUs power intelligent systems. Our teams are comprised of driven, innovative professionals dedicated to technological advancement. We offer competitive salaries, comprehensive benefits, and a work environment that fosters diversity, inclusion, and flexibility.

NVIDIA

Job offers 6

NVIDIA

Senior HPC AI Cluster Engineer

Job description

NVIDIA