Site Ops Incident Manager
What We Do
Uptake is a Chicago-based predictive analytics SaaS platform provider that empowers major industry leaders to optimize performance, reduce asset failures and enhance safety. At Uptake, we combine our strengths—machine learning, analytics, data visualization and software development—with the expertise of our industrial partners. The result is enormous savings in development time and resources for Uptake’s partners and a proven industrial grade software platform that delivers value to partners and their end customers.
What You’ll DoAs an Incident Manager, you’ll perform Incident Management functions critical to Uptake’s applications and infrastructure. The IM will be on our Site Operations team and responsible for leading restoration of site impacting incidents through ownership of outage bridge calls/meetings, triaging and investigation of infrastructure and application health, and orchestration of available resources to drive resolution of degraded systems as quickly as possible. A strong understanding of SaaS and infrastructure fundamentals is key for this position, as are communication skills and the ability to work both individually and across a globally diverse group of engineers and support staff.
Responsibilities:
- Own and drive restoration and coordinates efforts for major incidents across multiple support teams
- Incident ticket tracking, reporting, follow-ups, maintaining/updating, and making sure they are resolved within set Service Level Agreements.
- Foster IT best practices for Incident Management including: detection, triaging, assessment, troubleshooting and restoration
- Identify problems that address site/ infrastructure resiliency, availability/performance issues
- Mentor colleagues and partake in onboarding new hires.
- Problem Management including Root Cause Analysis (we use JIRA for tracking and documenting)
- Creating, maintaining, and assigning Standard Operating Procedures to facilitate knowledge transfer.
- Occasional Change Management (we currently use JIRA for change tickets).
Requirements
- 6+ years experience supporting large-scale web applications and infrastructure
- 2 to 4 years in an operational or analytical role
- Experience as an Incident Manager, Operations Manager, or Problem Manager
- ITIL certification – at minimum: ITIL Foundation Certificate in IT Service Management
- Strong analytical and problem-solving skills
- Technical background or ability to pick up technology concepts quickly
- Familiarity with SaaS or e-commerce website architecture
- Exposure to ITSM/ITIL processes such as change, incident, problem and capacity management
- Documentation, reporting, and organizational skills: JIRA, JQL queries, Confluence, Excel
- Demonstrated statistical modeling capability
- Experience using monitoring tools such as: Grafana, Zabbix, New Relic
Preferred skills:
- Excellent written and oral communication and interpersonal skills
- Identify goals and work independently
- Knowledge of core e-commerce technologies including cloud, web services and multi-tier architectures
- Ability to define and optimize processes
- Ability to work collaboratively in a fast-paced, entrepreneurial environment