Site Reliability Engineer
Discover. A brighter future.
With Discover, you’ll have the chance to make a difference at one of the world’s leading digital banking and payments companies. From Day 1, you’ll do meaningful work you’re passionate about, with the support and resources you need for success. We value what makes each employee unique and provide a collaborative, team-based culture that gives everyone an opportunity to shine. Be the reason millions of people find a brighter financial future, while building the future you want, here at Discover.
Job Description
What You’ll Do
- Handle responsibilities for operational stability and performance of one or more critical business services used by Discover customers and employees.
How You’ll Do It
Operational stability and performance
Defines and drives adoption of a best in class monitoring frameworks to accomplish end to end application or service monitoring and noiseless alerting end application or service monitoring and noiseless alerting with proper telemetry.
Lead and participate in performance tests, identifies the bottlenecks, opportunities for optimization and capacity demands.
Design and architect operational solutions for managing applications and infrastructure, with the specific goal of increasing the automation, repeatability, and consistency of operational tasks.
Control application code deployment servers and code deployment methods.
Work with other members of their assigned value stream to ensure that in-scope applications/platforms are meeting performance and stability requirements. This includes managing major incidents to mitigation/resolution.
Self manages the effort split between operational work and engineering work.
Problem management:
- Analyze and participate in periodic on-call duties to prevent, solve and automate the response to problems in mission critical services and automated deployments
- Maintain (evaluate and upgrade) all platform required applications and libraries (java, python, etc)
- Partnering with security engineers and developing plans and automation to aggressively and safely respond to new risks and vulnerabilities.
Monitors and metrics:
- Control application log collection and analysis - Automate processes and systems configuration/deployment
- Create and maintain monitoring technologies and processes that improve the visibility to our applications' performance and business metrics and keep operational workload reasonable.
- Monitor and report on SLA/SLO for a given applications services. Work with business and product owners to establish key performance indicators.
- Work with Application Development to ensure that assigned applications/platforms have appropriate monitoring and metrics in place to appropriately measure performance and stability.
Identify functional and non-functional improvements:
- Act as the Operations representative in value stream planning and prioritize sessions to ensure that operational needs of assigned applications/platforms are addressed as needed. Hold quarterly operational performance reviews with value stream management.
Release planning and coordination:
- Work with team and leadership to develop the long term Site Reliability Engineering road map.
- Work with Release Manager and development teams to deploy software releases
- Work with other members of his/her assigned value stream to ensure that the production releases for their in scope applications/platforms are properly planned and coordinated. This includes Holds Change/Release implementation reviews to ensure thorough and appropriate implementation plans.
Review and sign-off/approval of change tickets for the assigned value stream:
- Represent the value stream at Change Advisory Board Meetings.
- Participate in Program Increment Planning Sessions as a liaison for Operations and Infrastructure support.
- Provide information regarding upcoming critical changes to the value stream.
Operational readiness:
- Ensure that applications/platforms in the value stream are operationally ready for production. This includes Annual Review of all SOPs/knowledge articles.
- Monitor review for any new feature launch or other significant change that may impact monitoring.
- Review SOP/knowledge article for any new feature launch or other significant change that may impact support documentation.
- Train Command Center and Application 1st level Support on new SOPs, knowledge articles, and any other support-related needs.
- Perform monthly capacity analysis of applications/platforms within the value stream. Create and maintain operationally focused ELK dashboards for the value stream.
Qualifications You’ll Need
The Basics
- Bachelor's degree in business, computer information systems, computer science, MIS, engineering, science, or related field
- 2+ years of experience in information technology, or related field
- In lieu of a degree, 4+ years of experience in Information Technology, or related field
Bonus Points If You Have:
Abilty to enhance and maintain complex software components and distributed systems.
Experienced in DevOps skills and methodologies - Create and manage a continuous build, integration, test, and deployment systems
Proficient in monitoring, alerting, analyzing and troubleshooting large scale distributed systems
Experience with clustering technologies - high availability, resiliency and horizontal scaling. Good understanding of defining and executing High Availability, Disaster Recovery, Sustained Resiliency, Chaos Engineering tests
Familiar with OS tuning, optimization and system requirements for vertical scaling
Understanding of networking concepts and experience with HTTP protocol
Automation tools experience such as Chef, Puppet, Ansible. Developing monitoring tools and log analysis tools to manage operations
Continued curiosity regarding new technologies and evolving best practices
- 4+ years of experience in technology, or related field
Java
Spring framework
App Dynamics or similar monitoring tool
Kibana or similar logging tool
Jenkins
Cloud technologies – pcf, OCP, k8s
Github
REST Services
Chaos Eng
#LI-MF1
What are you waiting for? Apply today!
The same way we treat our employees is how we treat all applicants – with respect. Discover Financial Services is an equal opportunity employer (EEO is the law). We thrive on diversity & inclusion. You will be treated fairly throughout our recruiting process and without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or veteran status in consideration for a career at Discover.