Engineering Manager - Site Reliability
At Collective Health, we put our users first. Our mission is to create a healthcare management platform that is always on and ready when and where our members need at their most vulnerable times.
Site Reliability Management at Collective Health involves leading a hardworking team of engineers who are responsible for efficiently maintaining availability and performance guarantees of products that are changing the way people experience healthcare. By combining systems knowledge with software engineering, we build more robust distributed systems to guarantee system uptime and performance our members need. Our first Site Reliability Manager hire will be responsible for recruiting talented and enthusiastic engineers and building out the SRE presence in Chicago.
The ideal candidate for this role is technical, with hands on experience running and leading production services. They can communicate well, set the tone and maintain relationships that span multiple engineers, engineering teams, problem domains, and timezones. They understand the bigger picture and can lead to define and meet Collective Healths reliability, performance, and efficiency goals. They understand that in order to meet those goals, they may have to take things apart to understand why certain trade-offs were made, and rebuild them. They understand how to respond quickly and efficiently in an emergency and that the ultimate goal is to drive the frequency and cost of emergencies to zero.
SREs understand that every system fails, and when it does, we run blameless postmortems to identify areas for improvement so that we can iteratively improve product quality. We use a number of tools and approaches to solve a broad set of problems. We focus on optimizing systems, building infrastructure, and eliminating work through automation. We have hard boundaries around how much time is spent on operational work, and work to actively identify and mitigate potential outage factors.
- Build out and lead a team of software and systems engineers to deliver 3+ 9’s of reliability to our members
- Lead end-to-end availability and performance of critical services and reduce the operational costs of those services through automation
- Lead by example, care for your team, and establish credibility with the quality of your own and your team's technical execution
- Manage on-call duties such as incident response and management of rotation schedules.
- Take an active role in the design, and implementation of software and systems that help measure and improve availability, performance, and efficiency of Collective Health services
- Bachelor's degree in Computer Science or related technical field, and work history of effective management and communication skills.
- Proven expertise in recruiting and developing small to medium sized teams of enthusiastic, experienced engineers.
- Experience with any of the following or similar technologies, including: Kubernetes, Docker, Postgres, etcd, Elasticsearch, or related scheduling and persistence services. Apache Kafka, Kafka Streams, RabbitMQ, Apache Spark or related message and data pipeline technologies
- Experience in at least one of the following areas of software development: refactoring code, test-driven development, build infrastructure, debugging, building tools and testing frameworks
- Hands-on technical experience
- Deep understanding of private and public cloud design considerations and limitations in the areas of infrastructure, distributed systems, load balancing and networking, data storage, ETL, and security
- Capable of technical deep-dives into code, networking, operating systems and storage, yet verbally and cognitively agile enough to hold your own in a strategy discussions.
- Proficiency in algorithms, data structures, complexity analysis and software design and/or expertise in Unix/Linux systems, IP networking, performance and application issues.
- Expertise in problem solving and analyzing distributed systems.
At Collective Health, we care about creating a culture of diversity, openness, and transparency, while engaging our intellectual curiosity, problem solving and software engineering skills. This is vital to maintaining an agile engineering culture while putting a robust user experience front and center. We bring together people with a wide variety of backgrounds and perspectives, while creating an environment where their passions can be supported, and mentored so they can learn and grow.
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.