Linux System Engineer

About Us

At Phantom AI, we’ve built a team of incredibly talented and ambitious people challenging the norm in the automotive industry. We are building cost-effective L2/L3 solutions to reduce the burden of everyday driving and make the roads safe for everyone. For instance, we believe democratizing technologies such as Automatic Emergency Braking and Emergency Lane Support is the first priority before tackling a fully self-driving vehicle. Our main customers are Tier 1 automotive manufacturers who are focused on delivering L2/L3 solutions and in the future will deliver full autonomy.

We differentiate ourselves from other autonomous driving startups through a combination of state-of-the-art technological know-how and real automotive experiences of shipping ADAS systems at a volume production scale. If you feel that you have the passion, commitment, and drive to challenge the status quo within the automotive industry, we would love to hear from you.

Key Responsibilities

Support the AI/ML cluster infrastructure on GPU focusing on systems automation, configuration management and deployment at scale
Improve our cluster health monitoring and auto-recovery pipeline
Work with users on debugging application performance issues
Work with hardware and storage vendors to tune and optimize our servers, TrueNas storage and network
Automate and Deploy GPU cluster with Ansible
Performance tuning and OS provisioning on Linux systems
Manage HPC clusters, workloads and applications
Availability 24x7 on-call

Qualifications

Bachelor’s degree in computer science, electrical engineering or related field
Strong understanding of Linux fundamentals and performance optimizations (Ubuntu)
Advanced experience with SLURM configuration management systems, starting from scratch
Demonstrable knowledge of TCP/IP, Linux operating system internals, filesystems, disk/storage technologies and storage protocols
Experience in collaborating with network and data center teams for large scale cluster builds
Experience with configuration management software systems monitoring and alerting (Prometheus, Grafana, Telegraf, Splunk, etc.) and/or administering HPC workload managers (SLURM)
Experience with high-throughput low-latency networks, GPU-based computing systems, and/or high performance storage systems
Experience with Slurm and storage management of distributed parallel file systems a plus
3+ years of additional equivalent experience or evidence of exceptional ability related to the position

Benefits

This is a contract position
Office snacks & reimbursable meals* when in-office

Work Type

Remote or In-Office

Equal Opportunity for Diversity & Inclusion

Phantom AI provides equal employment opportunities to all employees and applicants for employment and prohibits discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws.

Apply for this job

Linux System Engineer

Other AI Jobs like this

HPC Operations Engineer

Systems Engineer

Senior HPC Operations Engineer

Engineering

Data

Other Roles

Locations