Discover. A brighter future.
With us, you’ll do meaningful work from Day 1. Our collaborative culture is built on three core behaviors: We Play to Win, We Get Better Every Day & We Succeed Together. And we mean it — we want you to grow and make a difference at one of the world's leading digital banking and payments companies. We value what makes you unique so that you have an opportunity to shine.
Come build your future, while being the reason millions of people find a brighter financial future with Discover.Job Description
Site Reliability Engineers (SREs) are a hybrid of systems and software engineers who are responsible for scaling, automation, and production issue support for applications. SRE’s have an intense passion for finding and improving efficiencies with infrastructure, development and deployment automation. As a SRE, you will lead the efforts of application deployment, reliability, scalability, availability and performance alongside the engineering and infrastructure teams. Site Reliability Engineers will work closely with our engineering teams to build mature, production-ready services and applications. As part of the SRE team, you will help define our standards for monitoring, alerting, scalability, and production-readiness. You will monitor and report on the uptime of our systems and services, the performance of our applications, and the capacity of our platform.
Additionally SRE is responsible for the provisioning, benchmarking, tuning, and improving the end to end customer experience for our Discover Website. In our industry where millions of dollars move every day and milliseconds count in every transaction you are always looking for ways to ensure our customers get the best response time. You will also be deeply involved in system roadmap planning and release management activities as well. Overall, you will become a rock star subject matter expert on the operation of these world class core systems powering our great Fortune 300 Company (which really operates like a startup). You will promote a risk-aware culture, ensure efficient and effective risk and compliance management practices by adhering to required standards and processes.
Analyze science, engineering, business, and other data processing problems to implement and improve computer systems. Analyze user requirements, procedures, and problems to automate or improve existing systems and review computer system capabilities, workflow, and scheduling limitations. May analyze or recommend commercially available software.
In this role, you will wear many hats but your skills will be crucial in the following:
- Lead the creation of a SRE team supporting Discover.com [card & bank] and Discover Mobile application. Discover.com gets 3.5million and Discover mobile gets 2.5 million logins daily!
- Help architect how the SRE organization for Discover.com is formed from scratch!
- Supports and maintains software installations and hardware systems - lifecycle management, change management, request management and incident management.
- Designs, develops and tests vended software and non-vended solutions.
- Consults with customer base to gather requirements for solution set.
- Important Works closely with plan, build, run and infrastructure teams to support key business applications.
- Important Works closely with senior staff for process improvements and automation opportunities
- Ability enhance and maintain complex software components and distributed systems.
- Experienced in DevOps skills and methodologies - Create and manage a continuous build, integration, test, and deployment system
- Proficient in monitoring, alerting, analyzing and troubleshooting large scale distributed systems
- Experience with clustering technologies - high availability, resiliency and horizontal scaling. Good understanding of defining and executing High Availability, Disaster Recovery, Sustained Resiliency, Chaos Engineering tests
- Control application code deployment servers and code deployment methods
- Familiar with OS tuning, optimization and system requirements for vertical scaling
- Understanding of networking concepts and experience with HTTP protocol
- Lead and participate in performance tests, identifies the bottlenecks, opportunities for optimization and capacity demands
- Monitor and report on SLA/SLO/performance/capacity for a given applications services. Work with business and product owners to establish key performance indicators.
- Partnering with security engineers and developing plans and automation to aggressively and safely respond to new risks and vulnerabilities.
- Control application log collection and analysis - Automate processes and systems configuration/deployment
- Design and architect operational solutions for managing applications and infrastructure, with the specific goal of increasing the automation, repeatability, and consistency of operational tasks.
- Create and maintain monitoring technologies and processes that improve the visibility to our applications' performance and business metrics and keep operational workload reasonable.
- Proficiency in one or more general purpose programming languages: Python, Go, shell scripting (Unix/Linux), Java
- Automation tools experience such as Chef, Puppet, Ansible. Developing monitoring tools and log analysis tools to manage operations
- Defines and drives adoption of a best in class monitoring frameworks to accomplish end to end application or service monitoring and noiseless alerting end application or service monitoring and noiseless alerting with proper telemetry
- Analyze and participate in periodic on-call duties to prevent, solve and automate the response to problems in mission critical services and automated deployments
- Continued curiosity regarding new technologies and evolving best practices
- Work with Release Manager and development teams to deploy software releases
- Self manages the effort split between operational work and engineering work
- Test, maintain, and monitor computer programs and systems, including coordinating the installation of computer programs and systems.
- Troubleshoot program and system malfunctions to restore normal functioning.
- Expand or modify system to serve new purposes or improve work flow.
- Select Use the computer in the analysis and solution of business problems, such as development of integrated production and inventory control and cost analysis systems.
- Consult with management to ensure agreement on system principles.
- Bachelors Information Technology
- 4+ years software testing or related
Bonus Points/Nice to have
- Bachelor’s Degree or Master’s Degree in Information Technology or Computer Science
- 7 years + of experience in software engineering
- 2 years of coding experience using strongly typed language Java, Golang
- 2 years of experience in SRE, DevOps, or similar role
- 2 years of experience with scripting languages like Python / Bash
- Familiar with design principles of monitoring and alerting systems
- Deep knowledge of distributed pub-sub message systems
#Remote #BI-Remote #LI-SY1
What are you waiting for? Apply today!
The same way we treat our employees is how we treat all applicants – with respect. Discover Financial Services is an equal opportunity employer (EEO is the law). We thrive on diversity & inclusion. You will be treated fairly throughout our recruiting process and without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or veteran status in consideration for a career at Discover.