ML Performance Engineer, Deep Learning
About Stability:
Stability AI is a community and mission driven, open-source artificial intelligence company that cares deeply about real-world implications and applications. Our most considerable advances grow from our diversity in working across multiple teams and disciplines. We are unafraid to go against established norms and explore creativity. We are motivated to generate breakthrough ideas and convert them into tangible solutions. Our vibrant communities consist of experts, leaders and partners across the globe who are developing cutting-edge open AI models for Image, Language, Audio, Video, 3D and Biology.
About the role:
We are looking for a talented ML Performance Engineer with a focus on Deep Learning and High-Performance Computing that will work with a growing multidisciplinary team of talented research scientists and machine learning engineers to improve and scale the efficiency within our computing capacity.
Responsibilities:
Optimizing Deep Learning Workflows:
- Monitor reports and dashboards and detect low utilization jobs, projects, users
- Partner with researchers to check their workflow when they lack performance
- Identify bottlenecks and suggest scripting optimisations
- For high-scale jobs, introduce AWS proprietary profiler and libraries to boost performance
- Scale-up gating process: check the scripts performance and vet requests to scale up
- Build a knowledge base / best practices documentation for all researchers
- Implement and monitor CPU usage levels for our CPU clusters; identify users that need assistance in properly coding to maximize usage of CPU’s
- Train researchers on best practices on how to implement automatization strategies to minimize human oversight on jobs.
Develop and Test Strategies for Future Workloads:
- Benchmark new systems capabilities and identify strategies to properly utilize them (H100, TRN2, TPUv5, Intel Gaudi)
- Define the minimum needs for storage speeds and find better data loading strategies to support high processing demands of the new accelerators
Qualifications:
- At least 8+ years of relevant experience
- Applied programming experience in Python, C, and/or C++
- Experience with libraries and tools like PyTorch and CUDA
- Experience in building, productizing and monitoring orchestration pipelines for AI and Machine Learning pipelines
- Experience with training frameworks like Megatrong, NVIDIA or similar frameworks
- Experience in leading more junior engineers
- Experience with AWS and/or GCP
- Experience/exposure to CI tools infra tools is a nice to have (Kubernetes)
- Experience with Linux-based environments and scripting (Shell Scripting, Python, Powershell)
- Ability to work well as an individual contributor as well as within a multidisciplinary team environment
- Strong communicator with excellent interpersonal skills and can-do attitude to work and thrive in a fast-paced team environment
Compensation
The salary range for this role is between $146,300 and $271,700. Individual pay within the range is based on factors like job-related skills and experience. Total compensation also includes stock options and benefits.
Equal Employment Opportunity:
We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or other legally protected statuses.
Apply for this job
Other AI Jobs like this
Director of Treasury
OpenAI
Research Engineer
Snorkel