ML Performance Engineer, Deep Learning

About Stability:

Stability AI is a community and mission driven, open-source artificial intelligence company that cares deeply about real-world implications and applications. Our most considerable advances grow from our diversity in working across multiple teams and disciplines. We are unafraid to go against established norms and explore creativity. We are motivated to generate breakthrough ideas and convert them into tangible solutions. Our vibrant communities consist of experts, leaders and partners across the globe who are developing cutting-edge open AI models for Image, Language, Audio, Video, 3D and Biology.

About the role:

We are looking for a talented ML Performance Engineer with a focus on Deep Learning and High-Performance Computing that will work with a growing multidisciplinary team of talented research scientists and machine learning engineers to improve and scale the efficiency within our computing capacity.

Responsibilities:

Optimizing Deep Learning Workflows:

Monitor reports and dashboards and detect low utilization jobs, projects, users
Partner with researchers to check their workflow when they lack performance
Identify bottlenecks and suggest scripting optimisations
For high-scale jobs, introduce AWS proprietary profiler and libraries to boost performance
Scale-up gating process: check the scripts performance and vet requests to scale up
Build a knowledge base / best practices documentation for all researchers
Implement and monitor CPU usage levels for our CPU clusters; identify users that need assistance in properly coding to maximize usage of CPU’s
Train researchers on best practices on how to implement automatization strategies to minimize human oversight on jobs.

Develop and Test Strategies for Future Workloads:

Benchmark new systems capabilities and identify strategies to properly utilize them (H100, TRN2, TPUv5, Intel Gaudi)
Define the minimum needs for storage speeds and find better data loading strategies to support high processing demands of the new accelerators

Qualifications:

At least 8+ years of relevant experience
Applied programming experience in Python, C, and/or C++
Experience with libraries and tools like PyTorch and CUDA
Experience in building, productizing and monitoring orchestration pipelines for AI and Machine Learning pipelines
Experience with training frameworks like Megatrong, NVIDIA or similar frameworks
Experience in leading more junior engineers
Experience with AWS and/or GCP
Experience/exposure to CI tools infra tools is a nice to have (Kubernetes)
Experience with Linux-based environments and scripting (Shell Scripting, Python, Powershell)
Ability to work well as an individual contributor as well as within a multidisciplinary team environment
Strong communicator with excellent interpersonal skills and can-do attitude to work and thrive in a fast-paced team environment

Compensation

The salary range for this role is between $146,300 and $271,700. Individual pay within the range is based on factors like job-related skills and experience. Total compensation also includes stock options and benefits.

Equal Employment Opportunity:

We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or other legally protected statuses.

Apply for this job

ML Performance Engineer, Deep Learning

Other AI Jobs like this

Director of Treasury

Research Engineer / Research Scientist, Alignment

Research Engineer

Engineering

Data

Other Roles

Locations