Lead\/Senior Site Reliability Engineer

Sorry, this job was removed at 11:23 a.m. (CST) on Thursday, December 16, 2021
Find out who's hiring in Chicago.
See all Developer + Engineer jobs in Chicago
Apply
By clicking Apply Now you agree to share your profile information with the hiring company.

Description ThinkTime, a start-up company backed by parent company Productive Edge, is leading the retail industry with its revolutionary SaaS and mobile product. With our impressive list of premier global clients and exponential growth, we're looking to add a Lead/Senior Site Reliability Engineer to be part of this exciting time.You'll be responsible for making thinktime.com highly reliable, fault-tolerant, maintainable, and scalable through monitoring, automation and identifying improvements needed in the system. You will be involved as a key member in all stages of the software development lifecycle, from inception and design through implementation, production deployment, and ongoing operation. Your involvement will include participating in design reviews, identifying additional tools and processes needed, capacity planning, and retrospective reviews.Ideally you will have a development background that shifted away from day-to-day development, have a passion for improving reliability, maintainability, application performance, and are an excellent troubleshooter. As a Site Reliability Engineer, you must also have experience with the NewRelic APM or similar platforms and be able to guide us in adopting and implementing the best tools and practices.Responsibilities:

  • Identify and champion strategies to meet our goals of high reliability, scalability, and minimizing manual work
  • Lead reliability-focused practices such as Failure Analysis, Load and Capacity Planning, Service Reviews, Architecture Designs, Incident Postmortems, and others. You'll be a subject matter expert on how the platform operates and a contact point for software engineers
  • Build upon and improve our existing infrastructure automation solution based on Jenkins and Ansible, following an infrastructure-as-code approach, including automation of Kubernetes, GCP, Azure, Windows, and Linux platforms.
  • Proactively monitor and review application performance.
  • Create monitoring dashboards and alerts using NewRelic, and Solarwinds DPA for database monitoring.
  • Lead and coordinate emergency production incident response efforts
    • Help troubleshoot and resolve emergency production issues, pulling in development staff as needed.
    • Coordinate with managers, developers, and other SRE staff
    • Ensure customer emergency issues are addressed in timely manner
    • Conduct root cause analysis to identify systemic issues and prevent from occurring in future
  • Identify improvements to make the system more fault tolerant and scalable, and work with the development team to implement those improvements
  • Participante in architecture design reviews, with a focus on performance, scalability, maintainability, and reliability.
  • Participate in root cause analysis reviews to identify the root cause of production issues, and recommend improvements to avoid in the future
  • Test the resiliency of the system via tools such as Chaos Monkey
  • Ensure software has necessary logging and telemetry to allow us to monitor the health of the system and quickly diagnose problems.
  • Create and maintain operational runbooks
  • Contribute to the overall product roadmap
  • Work on feature requests, defects and other development tasks, in particular, those related to monitoring, reliability, and scalability

Skills we're looking for:

  • 5+ years of professional experience starting from a developer role and transitioning to an SRE role.
  • High standards for quality and attention to detail
  • Software Development experience in C#, Java, Python, Go, or another object-oriented language.
  • Experience with NewRelic or similar infrastructure monitoring and APM solutions.
  • Experience troubleshooting production issues for high profile and widely used SaaS application
  • Experience with cloud platforms such as Google Cloud Platform or Microsoft Azure
  • Linux Administration experience
  • Windows Administration experience
  • Experience with Ansible, Chef, Puppet, Terraform, or other modern Infrastructure as code technologies
  • Experience working with relational databases, including an understanding of relational table designs, and SQL experience
  • Experience with distributed or microservice architecture systems
  • Experience with containers and Kubernetes is a plus
  • Experience with ElasticSearch or other NoSQL databases is a plus
  • Experience with Redis or other distributed caches is a plus
  • Experience with load testing tools a plus
Read Full Job Description
Apply Now
By clicking Apply Now you agree to share your profile information with the hiring company.

Location

PE is in trendy River North with great bars & restaurants nearby. Plus, the office is easy to get to with various train & bus stops being close!

Similar Jobs

Apply Now
By clicking Apply Now you agree to share your profile information with the hiring company.
Learn more about Productive EdgeFind similar jobs