Senior Site Reliability Engineer

Phenix Real Time Solutions

Sorry, this job was removed at 8:59 a.m. (CST) on Monday, June 21, 2021

View 933 Jobs

Find out who's hiring in Chicago.

See all Developer + Engineer jobs in Chicago

View 933 Jobs

Apply

By clicking Apply Now you agree to share your profile information with the hiring company.

Save job

Location:Chicago HQ (preferred) / US

Phenix is seeking an experienced Site Reliability Engineer who will be responsible for services related to availability, latency, efficiency, change management, monitoring, emergency response, and capacity planning as they relate to our large scale distributed real-time network. As a member of the Phenix team, you will be building the future of video communications.

We Are Looking For Someone Who:

Is experienced in areas such as automating infrastructure monitoring, release engineering, and continuous delivery
Has developed automated processes in support of the availability, performance, security, and maintainability of 24/7 systems
Understands the inherent tradeoff between frequently delivering features to customers and operating a reliable system
Has a passion for system-wide continuous improvement
Operates at a high level of effectiveness in a fast-paced startup environment

Responsibilities:

Proactively manage the risk associated with feature delivery
Develop service level objectives and determine indicators for platform reliability
Reduce the toil of standard operating procedures through automation
Participate in system design discussions, platform management, and capacity planning
Designs and conducts load tests and analyzes the results to better understand the limits of our system and how it performs under load
Improve our ability to monitor indicators for platform reliability and performance
Manage software releases from planning stage, through certification in staging, to production release across global PoPs, coordinating with Engineering and Product teams
Lead operational incident response team
Troubleshoot incidents through analysis of system logs
Contribute to Root Cause Analysis (RCA) investigations
Contribute to operations playbooks and documentation
Communicate clearly and openly with internal stakeholders regarding progress, roadblocks, and timelines

Requirements:

MS/BS. Computer Science or a related technical degree preferred
4+ years of experience as Site Reliability and/or DevOps Engineer
Experience with high level languages, such as Python, C/C++, and/or JavaScript
Experience with the bash scripting language
Experience with SQL database queries
Experience managing container-based apps using Docker
Experience with git
Experience with large scale cloud-based operations
Experience with build management technologies
Experience using CI/CD server technologies
Experience with test-driven development
Strong problem solving ability
Ability to troubleshoot issues in complex distributed software environments
Relentless focus on results and details

Bonus Points:

Familiarity with video streaming: WebRTC, RTP, RTMP, HLS, DASH
Experience with mobile audio/video development
Experience integrating with Slack
Familiarity with HTML5
Familiarity with Node.js

Perks:

Competitive benefits package
Collaborating with and learning from a world class team of business professionals and technologists
Working with a global and diverse customer base

Read Full Job Description