Site Reliability Engineer
What you get to do:
As a Site Reliability Engineer, you will be an integral member of a dynamic team continuously improving our AWS cloud deployment platform, “automating all the things”, in support of our rapidly expanding product portfolio.
- Lead small-team initiatives to continuously refine our AWS deployment practices for improved reliability, repeatability and security. You’ll create plans, collaborate with other DevOps team members, and coordinate with development and business teams. These high-visibility initiatives will help to increase service levels, lower costs, and deliver features more quickly.
- Write code and scripts to automate provisioning of AWS services and to configure services, using tools and languages including AWS CLI / API, Terraform, Ansible, Python, Bash, and Git.
- Design effective monitoring / alerting (for conditions such as application-errors, high memory usage) and log aggregation approaches (to quickly access logs for troubleshooting, or generate reports for trend analysis) to proactively notify business stakeholders of issues and communicate metrics, working closely with these stakeholders, using tools including AWS CloudWatch, Sumologic, New Relic.
- Configure build pipelines to support automated testing and deployments using tools including Jenkins, AWS CodeDeploy. You’ll configure these pipelines for specific products and help optimize them for performance and scalability.
- Help refine DevSecOps security practices (including regular security patching, minimum-permissions accounts and policies, encrypt-everything) in compliance with government and other standards regulations (such as PCI-DSS), implement, and verify them, using tools including Tenable, VeraCode to analyze and verify compliance.
- Clearly document and diagram deployment-specific aspects of architectures and environments, working closely with Software Engineers, Software Engineers in Test, and others in DevOps.
- Troubleshoot issues in production and other environments, applying debugging and problem-solving techniques (e.g., log analysis, non-invasive tests) , working closely with Development, QE teams.
- Suggest deployment patterns & practices improvements based on learnings from past deployments and production issues, collaborate with DevOps team to implement these.
- Promote a DevOps culture, including building relationships with other technical and business teams.
What you will bring to the team:
- A strong understanding of Linux administration including Bash scripting
- Networking expertise including VPCs, SDNs (e.g., Amazon / Azure) / VLANs, routers and firewalls
- Familiarity with at least one IAC / CM tool such as Terraform, Ansible, Chef, or Puppet
- Familiarity with at least one code build / deploy tool such as Jenkins, AWS CodeBuild / CodeDeploy
- A bachelor's degree in science, technology, engineering, or a similar field is required.
Experience that will be a great addition to the team:
- AWS administration experience / training including provisioning EC2 instances, VPCs, Elastic Beanstalk, Lambda functions, RDS databases, S3 storage, IAM security, ECS containers, Cloudwatch metrics & logs
- Experience developing and / or deploying serverless functions using AWS Lambda, Azure Functions, or Google Cloud Functions
- Experience developing and / or deploying Docker Containers on ECS or Kubernetes
- Experience with SQL, using RDS-PostgreSQL or other DBMS
- Experience with monitoring / alerting tools such as New Relic, Grafana, Prometheus, Sysdig
- Experience with log aggregation tools such as Sumologic, FluentD, ELK, Splunk