Senior Principal Software Engineer - Site Reliability Engineering Practice
Discover. A brighter future.
With us, you’ll do meaningful work from Day 1. Our collaborative culture is built on three core behaviors: We Play to Win, We Get Better Every Day & We Succeed Together. And we mean it — we want you to grow and make a difference at one of the world's leading digital banking and payments companies. We value what makes you unique so that you have an opportunity to shine.
Come build your future, while being the reason millions of people find a brighter financial future with Discover.Job Description
Site Reliability Engineering (SRE) applies software engineering techniques and discipline to production operations to attack reliability and performance issues to fix them for good. SREs focus on availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning of their services
As a Senior Principal Software Engineer you will partner with DFS SREs and external consultants who will be responsible for defining the Direct Banking Application Development organization’s SRE practice. Along with defining the practice, you’ll be responsible for leading a group of external consultants that will be working directly with application teams in implementing and maturing SRE functions. Your final responsibility will be to stand up a Community of Practice for the organization, where all SREs and enthusiasts can collaborate on SRE principles and best practices.
From a practice perspective, focus will be on defining consistent, best practices for teams:
- Define SRE framework
- Define reliable design patterns
- Define canned reliability user stories for feature delivery
- Observability: define what good looks like for baseline monitoring/alerting
- Develop Scorecards, gates, technical debt oversight for organization
- Define Capacity Management processes: define what good looks like, stress tests, load tests
- Emergency Response: define consistent problem management process, PIRs,
- Culture: Job descriptions, training, common language, definitions
From a chapter perspective, SREs will be accountable for:
- Leading teams in developing SRE playbooks
- Ensuring reliability is built into new designs
- Ensuring canned reliability users stories are executed for every feature
- Performing design reviews of existing apps
- Performing production readiness reviews
- Executing capacity management processes
- Executing chaos testing
- Identifying operational functions that need to be automated
- Bachelors Degree in Information Technology or related
- 10+ years Computer Science, Information Technology or Equivalent Experience
- In Lieu of Education ,12+ years Computer Science, Information Technology or Equivalent Experience
- 5+ years of SRE experience in a highly customer-focused environment
- Proficiency in designing resilient app patterns
- Expertise in 24x7 site monitoring and ability to own uptime & performance SLA’s for large scale distributed systems
- Expertise and operational experience at operating highly available, scalable and fault-tolerant systems using container platforms
- Familiar with OS tuning, optimization and system requirements for vertical scaling
- Proficiency in one or more general purpose programming languages: Python, Go, shell scripting (Unix/Linux), Java
- Expertise in automation tools experience such as Chef, Puppet, Ansible
- Strong leadership skills and the ability to motivate teams.
- Ability to drive change, and motivate engineers to develop simple solutions for complex operational challenges.
- Experience collaborating and partnering effectively with several other teams.
- Experience leading discussions with senior leadership, and are able to tailor the level of technical detail to suit your audience.
- Experience driving business efficiency across evolving platform and channel infrastructures
- Expertise in designing and building well-tested, efficient, reusable, and scalable software
- Experience architecting financial applications using a broad range of technologies, platforms and vendor offerings
- Knowledge of best practices around advanced cloud-based solutions and experience developing strategies for migrating existing workloads to the cloud
#Remote #BI-Remote #LI-LJ1
What are you waiting for? Apply today!
The same way we treat our employees is how we treat all applicants – with respect. Discover Financial Services is an equal opportunity employer (EEO is the law). We thrive on diversity & inclusion. You will be treated fairly throughout our recruiting process and without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or veteran status in consideration for a career at Discover.