Service Reliability Engineer
Discover. A brighter future.
With Discover, you’ll have the chance to make a difference at one of the world’s leading digital banking and payments companies. From Day 1, you’ll do meaningful work you’re passionate about, with the support and resources you need for success. We value what makes each employee unique and provide a collaborative, team-based culture that gives everyone an opportunity to shine. Be the reason millions of people find a brighter financial future, while building the future you want, here at Discover.
Job Description
At Discover, be part of a culture where diversity, teamwork and collaboration reign. Join a company that is just as employee-focused as it is on its customers and is consistently awarded for both. We’re all about people, and our employees are why Discover is a great place to work. Be the reason we help millions of consumers build a brighter financial future and achieve yours along the way with a rewarding career.
Responsible for the Operational Stability and Performance of one or more Critical Business Services used by Discover Customers and Employees.
Site Reliability Engineers (SREs) are a hybrid of systems and software engineers who are responsible for scaling, automation, and production issue support for applications. SRE’s have an intense passion for finding and improving efficiencies with infrastructure, development and deployment automation. As a SRE, you will lead the efforts of application deployment, reliability, scalability, availability and performance alongside the engineering and infrastructure teams. Site Reliability Engineers will work closely with our Software Development & Engineering teams to build mature, production-ready services and applications. As part of the SRE team, you will help define our standards for monitoring, alerting, scalability, and production-readiness. You will monitor and report on the uptime of our systems and services, the performance of our applications, and the capacity of our platform.
You will be empowered (yes, empowered) to apply engineering techniques and discipline to production operations and help us deliver the world’s greatest solutions. You will provide feedback into the architecture and application design for each next generation of Payment Services development. If you are the type of person that loves driving technology problem solving sessions; has a tireless passion to increase the performance, resiliency and availability of IT solutions serving the greatest Customers and Partners in the World; we believe our SRE opportunity will allow you to be the superstar of all superstars!
Responsibilities :
- Operational Performance & Stability: Works with other members of their assigned Value Stream to ensure that the in-scope applications/platforms are meeting performance and stability requirements. This includes, managing major incidents to mitigation/resolution.
- Problem Management: Performing post-incident reviews of all major incidents and determining action items required to avoid similar issues/minimizing downtime for future incidents. Monitoring and Metrics: Works with Application Development to ensure that assigned applications/platforms have appropriate monitoring and metrics in place to appropriately measure performance and stability. Identifies functional and nonfunctional improvements. Acts as the Operations representative in Value Stream planning and prioritization sessions to ensure that the operational needs of the assigned applications/platforms are addressed as needed. Holds quarterly Operational Performance Reviews with Value Stream management.
- Operational Readiness: Ensures that applications/platforms in the Value Stream are operationally ready for production. This includes, annual review of all SOPs/Knowledge Articles. Monitors review for any new feature launch or other significant changes that may impact monitoring.
- SOP/Knowledge Article review for any new feature launch or other significant change that may impact support documentation. Trains Command Center and Application first-level Support on new SOPs, Knowledge Articles, and any other support-related needs. Performs Monthly Capacity Analysis of applications/platforms within the Value Stream. Creates and Maintains Operationally focused ELK Dashboards for the Value Stream.
- Release Planning & Coordination: Works with other members of their assigned Value Stream to ensure that the production releases for their in scope applications/platforms are properly planned and coordinated. This includes, Holding Change/Release implementation reviews to ensure thorough and appropriate implementation plans. Provides review and sign-off/approval of change tickets for the assigned Value Stream. Represents the Value Stream in Change Advisory Board Meetings. Participates in Program Increment Planning Sessions as a liaison for Operations and Infrastructure support. Provides information regarding upcoming critical changes to the Value Stream.
- Ability Enhance and Maintain complex software components and distributed systems.
- Proficient in Monitoring, Alerting, Analyzing and Troubleshooting large scale distributed systems
- Defines and drives adoption of a best in class monitoring frameworks to accomplish end to end application or service monitoring.
- Experience with clustering technologies - High Availability, Resiliency, Reliability and Scaling.
- Monitor and report on SLA/SLO for a given applications services.
- Develop & Maintain Dashboards (ELK) - Business and Operational to establish key performance indicators & trends.
- Good understanding of defining and executing High Availability, Disaster Recovery, Sustained Resiliency, Chaos Engineering tests
- Lead and participate in Non-Functional Testing(performance& resilience), identifies the bottlenecks, opportunities for optimization and capacity demands
- Experience in DevOps skills and methodologies - Create and manage a continuous build, integration, test, and deployment systems. Control application code deployment servers and code deployment methods
- Control application log collection and analysis - Automate processes and systems configuration/deployment
- Partnering with security engineers and developing plans and automation to aggressively and safely respond to new risks and vulnerabilities.
- Design and architect operational solutions for managing applications and infrastructure, with the specific goal of increasing the automation, repeatability, and consistency of operational tasks.
- Ownership of Release & Change Management – Includes CAB Representation and Implementation of change and software releases
- Partner & Train the L1 & L1.5 Teams – Including creation and/or enhancement of SOPs, Knowledge Articles etc.
- Proficiency in one or more general purpose programming languages: Python, Go, shell scripting (Unix/Linux), Java
- Automation tools experience such as Chef, Puppet, Ansible. Developing monitoring tools and log analysis tools to manage operations
- Analyze and participate in periodic on-call duties to prevent, solve and automate the response to problems on mission critical services
Minimum Qualifications
The Basics:
- Bachelor's Degree in Information Technology, Engineering ,Social Sciences
- 2+ years of experience in I Information Technology, or related experience
- In lieu of a degree, 4+ years of experience in Information Technology, or related experience
Preferred Qualifications
Bonus Points If You Have:
- 4+ years of experience in Information Technology, or related experience Preferred
- At least 2-5-10 years of experience in software engineering (SRE 3-2-1)
- 2-3-5 years of coding experience using strongly typed language Java, Golang (SRE 3-2-1)
- 2-3-5 years of experience in SRE, DevOps, or similar role (SRE 3-2-1)
- 2 years of experience with scripting languages like Python / Bash
- Familiar with design principles of monitoring and alerting systems
- Understanding of Networking concepts and experience with HTTP protocol #LI-BG1
What are you waiting for? Apply today!
The same way we treat our employees is how we treat all applicants – with respect. Discover Financial Services is an equal opportunity employer (EEO is the law). We thrive on diversity & inclusion. You will be treated fairly throughout our recruiting process and without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or veteran status in consideration for a career at Discover.