Principal HPC Engineer

At Hippocratic AI, we are at the forefront of technological innovation, leveraging advanced computing resources to solve complex problems. Our dedicated GPU clusters, including high-end NVIDIA A100 and H100 models, are crucial for our data processing, machine learning, and computational tasks, including the development and optimization of Large Language Models (LLMs).

Position Overview:

As Principal HPC Engineer, you will play a crucial role in designing, implementing, and maintaining our advanced computing infrastructure. Your in-depth knowledge of GPUs, networking, and file systems will enable you to optimize our system performance, ensure reliable operation, and support our growing computational needs.

Responsibilities:

GPU Cluster Management:
- Run high-performance compute services in public cloud environments (AWS, GCP, and Azure) like Sagemaker and Hyperpods.
- Knowledge of hardware components, such as GPUs (including high-end models like NVIDIA A100 and H100), and familiarity with NVIDIA Container Toolkit.
- Experience in managing GPU nodes in cloud environments, ensuring optimal performance and reliability.
- Experience with parallel file systems like Lustre
Orchestration and Automation:
- Proficiency in Kubernetes for container orchestration and Slurm for workload management to efficiently distribute tasks across the GPU cluster.
- Experience in setting up and configuring these orchestration tools to ensure high availability and scalability of cluster resources.
Troubleshooting and Debugging:
- Ability to provide in-depth technical support for complex issues, including debugging and troubleshooting high-end GPUs.
- Familiarity with debugging tools and techniques specific to GPU hardware and software.
Performance Optimization:
- Continuous monitoring of system performance to identify bottlenecks and implement solutions to optimize resource utilization and throughput.
- Knowledge of performance tuning techniques for GPU clusters and the ability to apply them effectively.
Security and Compliance:
- Ensure adherence to security best practices and compliance requirements for GPU cluster infrastructure.
- Implementation and management of security protocols and disaster recovery strategies to safeguard cluster resources and data.
Collaboration and Support:
- Work closely with other engineering, research and applied science teams to understand and support their computational needs.
- Offer guidance and expertise on utilizing the GPU cluster efficiently for various tasks and applications.
- Participate in planning and executing future expansion or enhancement of cluster capabilities to meet evolving computational requirements.

Requirements:

Education:
- Bachelor’s degree in Computer Science, Electrical Engineering, or a related field. Master’s degree preferred.
Experience:
- At least 5 years of experience in managing and maintaining GPU clusters, preferably in the cloud, with hands-on experience with NVIDIA A100 and H100 GPUs or similar high-end models.
Technical Skills:
- Proficiency in Kubernetes for container orchestration and management, with experience in deploying, scaling, and managing containerized applications within Kubernetes clusters, including familiarity with AWS Kubernetes services for cloud deployment and management.
- Experience with Slurm for workload management in GPU cluster environments.
- Deep understanding of GPU hardware, including experience with debugging and troubleshooting GPU issues.
- Strong background in Linux/Unix administration, scripting (e.g., Bash, Python), and automation tools, with expertise in Ansible for configuration management and automation tasks.
- Familiarity with network configuration, storage systems, and security protocols relevant to GPU clusters.
- IAC experience such as Terraform
Problem-Solving:
- Exceptional analytical and problem-solving skills, with the ability to handle complex technical challenges effectively.
Communication:
- Excellent communication and documentation skills, capable of collaborating effectively across diverse teams.

About Hippocratic AI

Hippocratic AI is dedicated to developing a safety-focused large language model (LLM) tailored for the healthcare sector. We firmly believe in the potential of generative AI to significantly enhance global healthcare accessibility, provided it is developed and tested responsibly. Mirroring the principles of the Hippocratic oath that guides medical professionals, our model is designed with the ethos of "Do no Harm."

Apply for this job

Principal HPC Engineer

Other AI Jobs like this

Engineering

Data

Other Roles

Locations