Site Reliability Engineer - DevOps

Productive Edge

Sorry, this job was removed at 1:36 p.m. (CST) on Wednesday, July 31, 2019

View 889 Jobs

Find out who's hiring in Chicago.

See all Developer + Engineer jobs in Chicago

View 889 Jobs

Apply

By clicking Apply Now you agree to share your profile information with the hiring company.

Save job

Description

ThinkTime, a start-up company backed by parent company Productive Edge, is looking for a Site Reliability Engineer. Our SRE is responsible for making thinktime.com highly reliable, fault-tolerant, maintainable, and scalable through monitoring, automation and identifying improvements needed in the system. The Site Reliability Engineer will participate in and improve the overall life cycle of development, from inception and design through implementation and production deployment, and ongoing operation. They will do this through a combination of design consulting, conducting reviews, identifying additional tools and processes needed, capacity planning, and retrospective reviews.

Ideally, you will have a development background that shifted away from day-to-day development, has a passion for improving reliability, maintainability, application performance, and is an excellent troubleshooter. As a Site Reliability Engineer, you must also have experience with the NewRelic APM or similar platforms and be able to guide us in adopting and implementing the best tools and practices.

Responsibilities:

Proactively monitor and review application performance.
Create monitoring dashboards and alerts using NewRelic, Solarwinds DPA for database monitoring, and log entries for log aggregation
Be partly responsible for responding to production incidents that are escalated to the Engineering team and helping identify solutions and improvements to prevent issues from occurring in the future
Design and implement monitoring and alerting solutions to identify possible issues before they impact customers
Work with the production support team to adopt monitoring tools and processes.
Identify improvements to make the system more fault tolerant and scalable, and work with the development team to implement those improvements
Participate in design reviews and make recommendations to improve the reliability and maintainability of the system
Help triage and respond to incidents escalated to the Engineering team, including emergencies, escalating to the development team as needed
Automate operations including infrastructure changes and releases by enhancing our existing Ansible & Jenkins-based solution. (Recommend automation improvements and additional tools as needed)
Participate in root cause analysis reviews to discuss the root cause of production issues, and identify improvements to avoid in the future
Test the resiliency of the system via tools such as Chaos Monkey
Ensure software has good logging and diagnostics
Create and maintain operational run books
Contribute to the overall product roadmap
Experience with load testing tools a plus!
Work on feature requests, defects and other development tasks, in particular, those related to monitoring, reliability, and scalability

Needed Skills:

Experience with NewRelic or similar infrastructure monitoring and APM solutions.
Linux Administration experience
Windows Administration experience
Experience automating operations, including releases and infrastructure, changes Experience with Ansible or similar
Experience working with relational databases, including an understanding of relational table designs, and SQL experience
Experience with ElasticSearch or other NoSQL databases
Experience with Redis or other distributed caches
Experience with containers and Kubernetes
Software Development experience (.NET preferred but not required)

Read Full Job Description

Site Reliability Engineer - DevOps

Description

Location

Similar Jobs