Lead Site Reliability Engineer
Ideally you will have a development background that shifted away from day-to-day development, have a passion for improving reliability, maintainability, application performance, and are an excellent troubleshooter. As a Site Reliability Engineer, you must also have experience with the NewRelic APM or similar platforms and be able to guide us in adopting and implementing the best tools and practices.
- Proactively monitor and review application performance.
- Create monitoring dashboards and alerts using NewRelic, Solarwinds DPA for database monitoring, and log entries for log aggregation
- Be partly responsible for responding to production incidents that are escalated to the Engineering team and helping identify solutions and improvements to prevent issues from occurring in the future
- Design and implement monitoring and alerting solutions to identify possible issues before they impact customers
- Work with the production support team to adopt monitoring tools and processes.
- Identify improvements to make the system more fault tolerant and scalable, and work with the development team to implement those improvements
- Participate in design reviews and make recommendations to improve the reliability and maintainability of the system
- Help triage and respond to incidents escalated to the Engineering team, including emergencies, escalating to the development team as needed
- Automate operations including infrastructure changes and releases by enhancing our existing Ansible & Jenkins-based solution. (Recommend automation improvements and additional tools as needed)
- Participate in root cause analysis reviews to discuss the root cause of production issues, and identify improvements to avoid in the future
- Test the resiliency of the system via tools such as Chaos Monkey
- Ensure software has good logging and diagnostics
- Create and maintain operational runbooks
- Contribute to the overall product roadmap
- Experience with load testing tools a plus!
- Work on feature requests, defects and other development tasks, in particular, those related to monitoring, reliability, and scalability
- 5+ years of professional experience starting from a developer role and transitioning to a SRE role.
- Software Development experience is required (.NET preferred but not required)
- Experience with NewRelic or similar infrastructure monitoring and APM solutions.
- Linux Administration experience
- Windows Administration experience
- Experience automating operations, including releases and infrastructure changes Experience with Ansible or similar
- Experience working with relational databases, including an understanding of relational table designs, and SQL experience
- Experience with ElasticSearch or other NoSQL databases
- Experience with Redis or other distributed caches
- Experience with containers and Kubernetes