Site Reliability Engineer - DevOps
ThinkTime, a start-up company backed by parent company Productive Edge, is looking for a Site Reliability Engineer. Our SRE is responsible for making thinktime.com highly reliable, fault-tolerant, maintainable, and scalable through monitoring, automation and identifying improvements needed in the system. The Site Reliability Engineer will participate in and improve the overall life cycle of development, from inception and design through implementation and production deployment, and ongoing operation. They will do this through a combination of design consulting, conducting reviews, identifying additional tools and processes needed, capacity planning, and retrospective reviews.
Ideally, you will have a development background that shifted away from day-to-day development, has a passion for improving reliability, maintainability, application performance, and is an excellent troubleshooter. As a Site Reliability Engineer, you must also have experience with the NewRelic APM or similar platforms and be able to guide us in adopting and implementing the best tools and practices.
- Proactively monitor and review application performance.
- Create monitoring dashboards and alerts using NewRelic, Solarwinds DPA for database monitoring, and log entries for log aggregation
- Be partly responsible for responding to production incidents that are escalated to the Engineering team and helping identify solutions and improvements to prevent issues from occurring in the future
- Design and implement monitoring and alerting solutions to identify possible issues before they impact customers
- Work with the production support team to adopt monitoring tools and processes.
- Identify improvements to make the system more fault tolerant and scalable, and work with the development team to implement those improvements
- Participate in design reviews and make recommendations to improve the reliability and maintainability of the system
- Help triage and respond to incidents escalated to the Engineering team, including emergencies, escalating to the development team as needed
- Automate operations including infrastructure changes and releases by enhancing our existing Ansible & Jenkins-based solution. (Recommend automation improvements and additional tools as needed)
- Participate in root cause analysis reviews to discuss the root cause of production issues, and identify improvements to avoid in the future
- Test the resiliency of the system via tools such as Chaos Monkey
- Ensure software has good logging and diagnostics
- Create and maintain operational run books
- Contribute to the overall product roadmap
- Experience with load testing tools a plus!
- Work on feature requests, defects and other development tasks, in particular, those related to monitoring, reliability, and scalability
- Experience with NewRelic or similar infrastructure monitoring and APM solutions.
- Linux Administration experience
- Windows Administration experience
- Experience automating operations, including releases and infrastructure, changes Experience with Ansible or similar
- Experience working with relational databases, including an understanding of relational table designs, and SQL experience
- Experience with ElasticSearch or other NoSQL databases
- Experience with Redis or other distributed caches
- Experience with containers and Kubernetes
- Software Development experience (.NET preferred but not required)