Senior Research Engineer, Language Model Evaluations

Hippocratic AI’s mission is to develop the first safest focused Large Language Model (LLM) for healthcare. The company believes that a safe LLM can dramatically improve healthcare accessibility and health outcomes in the world by bringing deep healthcare expertise to every human. No other technology has the potential to have this level of global impact on health. The company was co-founded by CEO Munjal Shah, alongside a group of physicians, hospital administrators, healthcare professionals, and artificial intelligence researchers from El Camino Health, Johns Hopkins, Washington University in St. Louis, Stanford, Google, and Nvidia. Hippocratic AI has received a total of $120M in funding and is backed by leading investors, including General Catalyst, Andreessen Horowitz, Premji Invest, and SV Angel.

About the role

We are looking for a Research Engineer to lead evaluations for Hippocratic AI’s 1 trillion+ parameters constellation of Large Language Models. Your job will be to design and implement evaluations that allow Hippocratic AI to evaluate the performance and safety of our models. As a Research Engineer focused on Evaluation, you'll work closely with our research and applied science teams to design experiments and build evaluation infrastructure. You'll help validate performance and safety across a wide range of important tasks.   You’ll help to assure that our LLMs are well-benchmarked with known performance and safety  on a wide range of healthcare related tasks, allowing us to compare against human feedback. 


  • 5+ years Python programming experience / machine learning research 
  • Have experience using Large Language Models, preferably have trained or fine tuned large models in the past. 
  • Are comfortable writing code 
  • Want to learn more about machine learning research
  • Care about patient safety 
  • You want to design and implement rigorous evaluations 


  • Building user interfaces for data analysis
  • Developing robust evaluation metrics for language models
  • Handling textual dataset sourcing, curation, and processing tasks at scale
  • Statistics

Representative projects:

  • Designing and running a new evaluation that tests our model’s reasoning capabilities
  • Leading the vision of what it takes to safely evaluate patient safety in the world of Generative AI
  • Devise a consistent but representative evaluation suite for healthcare conversations
  • Running experiments to determine how prompting techniques affect results on industry benchmarks
  • Improving the tooling that researchers use to implement evaluations
  • Explaining our evaluations and their results to internal decision makers and Stakeholders
  • Collaborating with a research team to develop a robust evaluation for a new model capability they are developing

Apply for this job

Other AI Jobs like this

logo Hippocratic AI Research Engineer FullTime 🌎 Remote 📍 Palo Alto Apply Now
Your subscription could not be saved. Please try again.
Your subscription has been successful.


Subscribe and stay updated.

Your subscription could not be saved. Please try again.
Your subscription has been successful.

Join our newsletter