Site Reliability Engineer at Discover
At Discover, be part of a culture where diversity, teamwork and collaboration reign. Join a company that is just as employee-focused as it is on its customers and is consistently awarded for both. We’re all about people, and our employees are why Discover is a great place to work. Be the reason we help millions of consumers build a brighter financial future and achieve yours along the way with a rewarding career.
Site Reliability Engineers (SREs) are a hybrid of systems and software engineers who are responsible for scaling, automation, and production issue support for applications. SRE’s have an intense passion for finding and improving efficiencies with infrastructure, development and deployment automation. As a SRE, you will lead the efforts of application deployment, reliability, scalability, availability and performance alongside the engineering and infrastructure teams. Site Reliability Engineers will work closely with our engineering teams to build mature, production-ready services and applications. As part of the SRE team, you will help define our standards for monitoring, alerting, scalability, and production-readiness. You will monitor and report on the uptime of our systems and services, the performance of our applications, and the capacity of our platform.
You will be empowered (yes, empowered) to apply software engineering techniques and discipline to production operations and help us deliver the world’s greatest solutions. You will provide feedback into the architecture and application design for each next generation of Payment Services development. If you are the type of person that loves driving technology problem solving sessions; has a tireless passion to increase the performance, resiliency and availability of IT solutions serving the greatest Customers and Partners in the World; we believe our SRE opportunity will allow you to be the superstar of all superstars!
In our industry where millions of dollars move every day and milliseconds count in every transaction you are always looking for ways to ensure our customers get the best response time. You will also be deeply involved in system roadmap planning and release management activities as well. Overall, you will become a rock star subject matter expert on the operation of these world class core systems powering our great Fortune 300 Company (which really operates like a startup). Additionally you will promote a risk-aware culture, ensure efficient and effective risk and compliance management practices by adhering to required standards and processes.
In this role, you will wear many hats but your skills will be crucial in the following:
- Ability enhance and maintain complex software components and distributed systems.
- Experienced in DevOps skills and methodologies – Create and manage a continuous build, integration, test, and deployment systems
- Proficient in monitoring, alerting, analyzing and troubleshooting large scale distributed systems
- Experience with clustering technologies – high availability, resiliency and horizontal scaling. Good understanding of defining and executing High Availability, Disaster Recovery, Sustained Resiliency, Chaos Engineering tests
- Control application code deployment servers and code deployment methods
- Lead and participate in performance tests, identifies the bottlenecks, opportunities for optimization and capacity demands
- Monitor and report on SLA/SLO for a given applications services. Work with business and product owners to establish key performance indicators.
- Work with team and leadership to develop the long term Site Reliability Engineering road map.
- Maintain (evaluate and upgrade) all platform required applications and libraries (java, python, etc.)
- Partnering with security engineers and developing plans and automation to aggressively and safely respond to new risks and vulnerabilities.
- Control application log collection and analysis – Automate processes and systems configuration/deployment
- Design and architect operational solutions for managing applications and infrastructure, with the specific goal of increasing the automation, repeatability, and consistency of operational tasks.
- Create and maintain monitoring technologies and processes that improve the visibility to our applications’ performance and business metrics and keep operational workload reasonable.
- Proficiency in one or more general purpose programming languages: Python, Go, shell scripting (Unix/Linux), Java
- Automation tools experience such as Chef, Puppet, Ansible. Developing monitoring tools and log analysis tools to manage operations
- Analyze and participate in periodic on-call duties to prevent, solve and automate the response to problems in mission critical services and automated deployments
- Responsible for Gemfire Administration including design, implementation, and on-going support of Pivotal Gemfire Database
- Must be proficient in configuration, debugging, troubleshooting & deployment of Gemfire based solution in a clustered environment.
- Experience with and understanding of various database technologies: RDBMS, NoSQL, Hadoop and Gemfire.
- Knowledge of virtualization technologies and concepts is a plus
- In-depth knowledge of network and distributed internet technologies including TCP/IP, firewalls, load balancers and web application architectures is a plus.
- Expert in Object query language & indexing
- Experience in performance tuning
At a minimum, here’s what we need from you:
- Bachelor’s Degree in Business, Computer Information Systems, Computer Science, MIS, Engineering, Science, or related field
- 2+ years of experience in Information Technology, or related field
- In lieu of a degree, 4+ years of experience in Information Technology, or related field
If we had our say, we’d also look for:
- At least 5 years of experience in software engineering and/or Database Administration (Gemfire or Memory Database)
- 2 years of experience in SRE, DevOps, or similar role
- 2 years of experience with scripting languages like Python / Bash
- Proven ability to champion an idea and bring a group to consensus
- Proven ability to translate functional needs into technical solutions
- Excellent problem-solving skills involving complex and ambiguous issues
- Experience work in an Agile Scrum delivery framework
The same way we treat our employees is how we treat all applicants – with respect. Discover Financial Services is an equal opportunity employer (EEO is the law). We thrive on diversity & inclusion. You will be treated fairly throughout our recruiting process and without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or veteran status in consideration for a career at Discover.