Site Reliability Engineer
At Collective Health, we put our users first. Our goal is to create a platform that is up and ready when and where they need it at their most vulnerable times.
Reliability Engineering at Collective health is a discipline combining software and systems engineering skills. We exercise this to build and run secure fault-tolerant distributed applications that exceed the challenges of the healthcare space. We extend and apply modern systems, software, architecture and development practices to give our customers a more reliable overall healthcare management experience. Collective Health Reliability engineers ensure that our internal and externally visible critical services are always there when our users need them. We do this by delivering on uptime guarantees, and managing capacity and performance.
The SRE mindset enables us to deliver better running production applications quickly and efficiently. We develop a broad understanding of how our systems relate to one another in order to use our abilities to engineer solutions to hard operational problems. Our focus is on automating away operational effort and complexity, blameless postmortems, and shifting from reactive responses to outages to proactive identification and mitigation of operational risks. These practices allow us to iteratively achieve highly reliable services that garner user loyalty and trust, but also keep our work interesting and dynamic day to day.
At Collective Health, we care about creating a culture of diversity, openness, and transparency, while engaging our intellectual curiosity, problem solving and risk analysis skills. This is vital to maintaining an agile engineering culture while putting a reliable user experience front and center. We bring together people with a wide variety of backgrounds and perspectives, while creating an environment where their passions can be supported, and mentored so they can learn and grow.
We’re building the next generation healthcare platform, and we’re always there to make sure our users have the fastest and most reliable experience possible. We’re proud to be on the leading edge of this important mission.
- Measure monitoring availability, latency and efficiency to build an overall picture of system health
- Scale systems through automation, and evolve systems by pushing for changes that improve reliability and velocity
- Engage in and improve the development lifecycle of applications—from concept and design, through commit to production deployment, and beyond into operation and iteration
- Practice sustainable incident response and blameless postmortems
- BS degree in Computer Science or a related technical field involving systems engineering and/or coding, or equivalent practical experience
- Passionate about solving challenging problems
- Experience in one or more of the following programming languages: Java, Go (golang), Python, C, C++, Perl, Ruby or shell scripting.
- Experience with Linux internals and/or network administration (e.g., filesystems, system calls, signals, process states, TCP/IP, routing, AWS VPCs, Firewalls, AWS Security Groups, IP Block Management)
- Experience with algorithms, data structures, complexity analysis and software design
- Container build, management, and orchestration
- Expertise in debugging and optimizing systems, and automating routine tasks
- Interest and expertise in designing, analyzing and troubleshooting distributed systems and APIs
- Methodical problem-solving approach, coupled with strong communication skills and an ability to own and drive projects to completion
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.