Lead Site Reliability Engineer

About Us

Observe.AI is the fastest way to boost contact center performance with live conversation intelligence. Built on the most accurate AI engine in the industry, Observe.AI uncovers insights from 100% of customer interactions and maximizes frontline team performance through coaching and end-to-end workflow automation. With Observe.AI, companies can act faster with real-time insights and guidance to improve performance, from more sales to higher retention.

Observe.AI is trusted by hundreds of customers and partners, including Pearson, Accolade, Group 1 Automotive, Southeast Trans, and Public Storage. Our recent $125 million Series C led by Softbank Vision Fund 2 with participation from Zoom Video Communications, Inc., brings our total funding to date to $213M, with investments from Menlo Ventures, Next47, NGP Capital, Emergent Ventures, Scale Ventures, Nexus Ventures, and Y-Combinator. For more information, visit www.observe.ai.

The Opportunity

We are looking for a seasoned Lead Site Reliability Engineer to join our team and ensure the utmost functionality and uptime of our production services across multiple clusters. This role offers the unique challenge of managing systems that rely on bidirectional communication with clients, primarily built over websockets, which introduces additional complexities in system design, uptime and observability. As a key member of our SRE team, you will bring your expertise in maintaining high SLAs, resolution, problem-solving and cloud to maintain and enhance our system architecture.

About the Team

Our team is a dynamic mix of engineers, ranging from recent college graduates to seasoned principal engineers. Several team members have been with the company for over five years, witnessing and contributing to the significant evolution of our systems and the company's growth. Our engineers share a passion for a fast-paced work environment, emphasizing quality and fostering healthy competition.

In this role, you'll have the unique opportunity to work directly with principal engineers, the Director of Product, and other executives. This position offers high visibility within the organization, as it is central to a new product line that the company is heavily invested in. Your contributions will not only influence the immediate project but also the broader strategic direction of the company.

What you'll do day to day as a Lead Site Reliability Engineer

System Optimization: Implement and maintain strategies to enhance the reliability and observability of our complex systems. Ensure robust documentation and troubleshooting procedures are in place to support operational excellence.

Team Collaboration and Leadership: Work closely with a team of skilled engineers who will look up to you for guidance and standards with regards to reliability, resilience and scalability. You will lead and collaborate on projects, sometimes working independently to deliver critical system improvements.

Cross-Geographical Collaboration: Collaborate with engineering teams in India, ensuring smooth coordination and communication.

Cluster Synchronization: Develop and execute strategies to ensure multiple production clusters are synchronized in terms of features, uptime, functionality, and SLA compliance. You will be responsible for creating repeatable processes that maintain consistency across different environments, enhancing the reliability and performance of our systems.

Comprehensive Testing Strategies: Design and document comprehensive test plans that encompass various aspects of system reliability, including integration tests and periodic production tests. These plans will aim to proactively identify and mitigate risks, ensuring continuous system integrity and performance.

Technical Troubleshooting and Problem-Solving: Engage in detailed analysis and troubleshooting of system issues. Develop and refine observability tools and practices to proactively monitor and address system performance.

Scripting and Automation for efficiency: Identify common resolution/debugging/release patterns and taking steps towards their efficiency and automation.

What you'll bring to the role

Experience: Ideal candidate would have around 7 years of experience in site reliability engineering or related fields, with a strong background in managing high-availability systems.

Technical Expertise: Proficient in one of python or shell scripting, and extensive experience with AWS cloud environments and infrastructure management. Knowledge of system design complexities related to real-time, bidirectional communications.

Leadership and Mentorship: Demonstrated ability to lead projects and mentor junior engineers, fostering a collaborative and productive environment.

Communication Skills: Excellent communication skills to effectively manage team interactions and articulate technical challenges and solutions to stakeholders.

Adaptability: Comfort with working in predetermined flexible hours to interact with teams across different time zones, ensuring project alignment and timely delivery.

Compensation, Benefits and Perks

Competitive compensation including equity

Excellent medical, dental, and vision insurance options

Flexible time off

Generous holidays and parental leave policies

401K plan

Learning & Development fund to support you in your continuing education journey and professional development

Fun events to drive towards our culture supporting a community of Connect, Collaborate, Celebrate

Our Commitment to Inclusion and Belonging

Observe.AI is an Equal Employment Opportunity employer that proudly pursues and hires a diverse workforce. Observe AI does not make hiring or employment decisions on the basis of race, color, religion or religious belief, ethnic or national origin, nationality, sex, gender, gender identity, sexual orientation, disability, age, military or veteran status, or any other basis protected by applicable local, state, or federal laws or prohibited by Company policy. Observe.AI also strives for a healthy and safe workplace and strictly prohibits harassment of any kind.

We welcome all people. We celebrate diversity of all kinds and are committed to creating an inclusive culture built on a foundation of respect for all individuals. We seek to hire, develop, and retain talented people from all backgrounds. Individuals from non-traditional backgrounds, historically marginalized or underrepresented groups are strongly encouraged to apply.

If you are ambitious, make an impact wherever you go, and you're ready to shape the future of Observe.AI, we encourage you to apply. For more information, visit www.observe.ai.

Apply for this job

Lead Site Reliability Engineer

Other AI Jobs like this

Site Reliability Engineer (San Francisco)

Staff Site Reliability Engineer

Site Reliability Engineer (India)

Engineering

Data

Other Roles

Locations