Lead\/Senior Site Reliability Engineer

Productive Edge

Sorry, this job was removed at 11:23 a.m. (CST) on Thursday, December 16, 2021

View 889 Jobs

Find out who's hiring in Chicago.

See all Developer + Engineer jobs in Chicago

View 889 Jobs

Apply

By clicking Apply Now you agree to share your profile information with the hiring company.

Save job

Description ThinkTime, a start-up company backed by parent company Productive Edge, is leading the retail industry with its revolutionary SaaS and mobile product. With our impressive list of premier global clients and exponential growth, we're looking to add a Lead/Senior Site Reliability Engineer to be part of this exciting time.You'll be responsible for making thinktime.com highly reliable, fault-tolerant, maintainable, and scalable through monitoring, automation and identifying improvements needed in the system. You will be involved as a key member in all stages of the software development lifecycle, from inception and design through implementation, production deployment, and ongoing operation. Your involvement will include participating in design reviews, identifying additional tools and processes needed, capacity planning, and retrospective reviews.Ideally you will have a development background that shifted away from day-to-day development, have a passion for improving reliability, maintainability, application performance, and are an excellent troubleshooter. As a Site Reliability Engineer, you must also have experience with the NewRelic APM or similar platforms and be able to guide us in adopting and implementing the best tools and practices.Responsibilities:

Identify and champion strategies to meet our goals of high reliability, scalability, and minimizing manual work
Lead reliability-focused practices such as Failure Analysis, Load and Capacity Planning, Service Reviews, Architecture Designs, Incident Postmortems, and others. You'll be a subject matter expert on how the platform operates and a contact point for software engineers
Build upon and improve our existing infrastructure automation solution based on Jenkins and Ansible, following an infrastructure-as-code approach, including automation of Kubernetes, GCP, Azure, Windows, and Linux platforms.
Proactively monitor and review application performance.
Create monitoring dashboards and alerts using NewRelic, and Solarwinds DPA for database monitoring.
Lead and coordinate emergency production incident response efforts

Help troubleshoot and resolve emergency production issues, pulling in development staff as needed.
Coordinate with managers, developers, and other SRE staff
Ensure customer emergency issues are addressed in timely manner
Conduct root cause analysis to identify systemic issues and prevent from occurring in future

Identify improvements to make the system more fault tolerant and scalable, and work with the development team to implement those improvements
Participante in architecture design reviews, with a focus on performance, scalability, maintainability, and reliability.
Participate in root cause analysis reviews to identify the root cause of production issues, and recommend improvements to avoid in the future
Test the resiliency of the system via tools such as Chaos Monkey
Ensure software has necessary logging and telemetry to allow us to monitor the health of the system and quickly diagnose problems.
Create and maintain operational runbooks
Contribute to the overall product roadmap
Work on feature requests, defects and other development tasks, in particular, those related to monitoring, reliability, and scalability

Skills we're looking for:

5+ years of professional experience starting from a developer role and transitioning to an SRE role.
High standards for quality and attention to detail
Software Development experience in C#, Java, Python, Go, or another object-oriented language.
Experience with NewRelic or similar infrastructure monitoring and APM solutions.
Experience troubleshooting production issues for high profile and widely used SaaS application
Experience with cloud platforms such as Google Cloud Platform or Microsoft Azure
Linux Administration experience
Windows Administration experience
Experience with Ansible, Chef, Puppet, Terraform, or other modern Infrastructure as code technologies
Experience working with relational databases, including an understanding of relational table designs, and SQL experience
Experience with distributed or microservice architecture systems
Experience with containers and Kubernetes is a plus
Experience with ElasticSearch or other NoSQL databases is a plus
Experience with Redis or other distributed caches is a plus
Experience with load testing tools a plus

Read Full Job Description

Lead\/Senior Site Reliability Engineer

Location

Similar Jobs