Lead Site Reliability Architect/Automation at Discover
Discover. A brighter future.
With us, you’ll do meaningful work from Day 1. Our collaborative culture is built on three core behaviors: We Play to Win, We Get Better Every Day & We Succeed Together. And we mean it — we want you to grow and make a difference at one of the world's leading digital banking and payments companies. We value what makes you unique so that you have an opportunity to shine.
Come build your future, while being the reason millions of people find a brighter financial future with Discover.Job Description
Responsible for the technical design, deployment, monitoring and ongoing support and maintenance of a diverse set of cloud technologies. The role is a technical, hands-on opportunity with a heavy focus on automation, resilient design and deployment of cloud ready systems and services. This role collaborates with Product teams internal and external to IS to provide world class products and services in support of our application development community, and our business as whole. This is a 'DevOps' position, responsible for the full-stack engineering and support of products that support our hybrid cloud capabilities.
Being a Site Reliability Engineer at Discover is someone who likes to take responsibility for new applications going into production to ensure operational excellence (Availability, latency, performance, efficiency, problem management, monitoring, emergency response and capacity planning). You will participate in anything that prevents a system/app from serving its’ intended purpose. Could be slowness, could be an outage, to understand how we can improve Time to Detect, Time to Fix, and Time to Mitigate issues. You will improve our monitoring solutions and define SLIs/SLOs. You will develop automated solutions using a variety of coding languages. We are organized as a Chapter organization, so you will be expected to lead the SRE mindset across the organization.
In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.” Site reliability engineers create a bridge between development and operations by applying a software engineering mindset to system administration topics.
- Leads the design, build and maintenance of modern cloud platforms that support agile teams.
- Partners with key stakeholders as a platform champion for cloud-native systems, and coaches on how to use platform capabilities effectively through appropriate venues.
- Drives continuous improvement of cloud products & capabilities though internal user groups and external market research.
- Driving innovation and platform evolution, Scaling cloud infrastructure to support our growing ecosystem
- Provide reliable, predictable deployment and maintenance of distributed systems Adhering to security best practices
- Writing and designing automation, monitoring, diagnostics and debug tooling to improve troubleshooting and recovery
- Participating in production support and on-call rotations
- Conducting incident management and contribute to associated retrospective/post mortem as needed
- Responsible for the Stability and Performance of critical Business Services
- Contribute to associated retrospective/post mortem as needed
- Participating in Agile Sprints and associated ceremonies
- Bachelor’s Degree in Information Technology or related
- 4+ years Application or platform development, consulting, or related experience
- 1+ years Team Lead, or related experience
- 3+ years in a SRE role
- Well versed with the entire software development lifecycle, DevOps, and SRE practices
- Experience with operational monitoring tools with a mindset towards predictive analysis
- Working knowledge of the automation tools such as Ansible, Terraform, or Chef
- Familiar with Pivotal Cloud Foundry (PCF), OpenShift (OCP), Amazon Web Service (AWS), and Google Cloud Platform (GCP)
- A solid understanding of working with git
- Experience with troubleshooting and debugging issues at any level
- Strong knowledge and understanding of microservices based architectures, APIs, etc.
- Good understanding of networking including L2 and L3 concepts, including Firewall, Load Balancing, Routing and Switching.
- A working knowledge of Linux based systems and Virtual Machines (VM) technology
- Strong scripting skills including ability to write scripts from scratch using Python and/or Bash
- Can identify and mitigate reliability risks
- Excellent communication and troubleshooting skills
- Strong analytical and problem-solving skills
- Basic knowledge and understanding of Security (CIA Model and PCI compliance) is a plus
- Experience with Continuous Integration and Continuous Delivery models including Blue/Green and Canary release models is a plus
- Experience with continuous integration/deployment frameworks such as Jenkins #LI-BG1 #Remote #BI-Remote
What are you waiting for? Apply today!
The same way we treat our employees is how we treat all applicants – with respect. Discover Financial Services is an equal opportunity employer (EEO is the law). We thrive on diversity & inclusion. You will be treated fairly throughout our recruiting process and without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or veteran status in consideration for a career at Discover.