Join our Reliability team as a Staff Software Engineer, where you’ll own the reliability of operating Temporal Cloud end to end. You will help define and measure reliability expectations, harden systems through gamedays and chaos testing, and build the tooling and practices that make reliability visible and continuously improving across services and operational processes. We’re looking for someone who thrives in ambiguity, enjoys turning reliability goals into concrete engineering work, and can lead cross-team efforts that make systems more resilient at scale.
What You’ll Do- Own reliability outcomes for operating Temporal Cloud end to end, partnering across engineering, infrastructure, and product to drive measurable improvements.
- Define, implement, and evolve reliability targets and associated practices, including alerting thresholds, operational readiness criteria, and escalation paths.
- Plan and run gamedays to validate incident response, operational procedures, and cross-team coordination under realistic failure scenarios.
- Build and scale a chaos testing program that exercises failure modes safely and drives remediation work that reduces real risk.
- Define and maintain a reliability scorecard across services and key operational processes, and use it to prioritize reliability investments.
- Lead load testing and performance testing efforts, including test design, tooling, and analysis of bottlenecks and capacity constraints.
- Improve observability standards (metrics, logs, traces, dashboards) so reliability signals are consistent, actionable, and easy to audit.
- Drive post-incident learning and corrective actions, ensuring fixes are durable and reduce recurrence risk over time.
- Make system-level tradeoffs across reliability, performance, cost, and velocity, and document decisions clearly for long-term maintainability.
- Mentor other engineers and raise the bar on reliability engineering practices across teams.
- Strong computer science fundamentals, especially in distributed systems, concurrency, and performance.
- Demonstrated ability to design and build complex systems that operate reliably under high load and partial failure.
- Experience driving reliability improvements across multiple services, not just within a single codebase.
- Hands-on experience with at least one of: gamedays, chaos testing, load testing, or building reliability scorecards.
- Strong judgment in ambiguous situations, including the ability to prioritize reliability work based on risk and impact.
- Excellent communication skills, including the ability to align multiple stakeholders on reliability goals, plans, and tradeoffs.
- A collaborative mindset and a track record of mentoring and leveling up engineering practices.
- Experience operating multi-tenant systems and designing protections against noisy-neighbor behaviors.
- Deep expertise in observability (metrics design, tracing strategy, dashboard standards) and alert hygiene.
- Experience building internal platforms or tooling that enables other teams to meet reliability standards.
- Familiarity with workflow orchestration systems or durable execution platforms.
- Open source contributions, especially in infrastructure or distributed systems.
- Base Salary Range - $212,000 - $286,200, depending on qualifications and location
- Additionally, this role is eligible to participate in Temporal's equity plan.
- Unlimited PTO, 12 Holidays + 2 Floating Holidays
- 100% Premiums Coverage for Medical, Dental, and Vision
- AD&D, LT & ST Disability, and Life Insurance (Standard & Supplemental Available)
- Empower 401K Plan
- Additional Perks for Learning & Development, Lifestyle Spending, In-Home Office Setup, Professional Memberships, WFH Meals, Internet Stipend and more!
Paid Time Off (PTO) and Benefits outside the United States vary by country, and are issued in partnership with Remote.com. Additionally, Temporal offers perks to all international employees for learning & career development, a lifestyle spending account, in-home office setup (in addition to company-issued hardware), professional memberships, work-from-home meals, and access to the Calm app for mental wellness.
Travel
Temporal is a globally distributed, collaborative team that values opportunities for in-person connection. Occasional travel may be required for company events, team offsites, and other meaningful moments that bring us together.
- $3,600 / Year Work from Home Meals
- $1,800 / Year Professional Enrichment (Career Development & Professional Memberships)
- $1,200 / Year Lifestyle Spending Account
- $1,000 / Year In-Home Office Setup (In addition to Temporal issued equipment - laptop, monitor, keyboard, mouse, trackpad, and extension power cable at no cost to you)
- $74 / Month Reimbursement for Internet
- Calm App Subscription for Mental Health & Wellness
Top Skills
Similar Jobs
What you need to know about the Chicago Tech Scene
Key Facts About Chicago Tech
- Number of Tech Workers: 245,800; 5.2% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: McDonald’s, John Deere, Boeing, Morningstar
- Key Industries: Artificial intelligence, biotechnology, fintech, software, logistics technology
- Funding Landscape: $2.5 billion in venture capital funding in 2024 (Pitchbook)
- Notable Investors: Pritzker Group Venture Capital, Arch Venture Partners, MATH Venture Partners, Jump Capital, Hyde Park Venture Partners
- Research Centers and Universities: Northwestern University, University of Chicago, University of Illinois Urbana-Champaign, Illinois Institute of Technology, Argonne National Laboratory, Fermi National Accelerator Laboratory



