Jump Trading Group Logo

Jump Trading Group

HPC Operations Lead

Posted 14 Days Ago
Be an Early Applicant
In-Office
Chicago, IL, USA
Senior level
In-Office
Chicago, IL, USA
Senior level
The HPC Operations Lead oversees data center operations, ensuring reliability and excellence in HPC environments, leading teams, and implementing AI-driven improvements.
The summary above was generated by AI

HPC Infrastructure Operations Lead

Location: Chicago or New York (On-site 5 days/week; regular travel to HPC data center sites required)

Jump Trading Group is committed to world class research. We empower exceptional talents in Mathematics, Physics, and Computer Science to seek scientific boundaries, push through them, and apply cutting edge research to global financial markets. Our culture is unique. Constant innovation requires fearlessness, creativity, intellectual honesty, and a relentless competitive streak. We believe in winning together and unlocking unique individual talent by incenting collaboration and mutual respect. At Jump, research outcomes drive more than superior risk adjusted returns. We design, develop, and deploy technologies that change our world, fund start-ups across industries, and partner with leading global research organizations and universities to solve problems.

Trading Infrastructure is a global organization of Engineers who architect, build and maintain our world-class infrastructure. From colo design/implementation, to optimizing our exchange connectivity, to building world class low latent Wide Area Networks, we leverage research and automation to consistently adapt and innovate our infrastructure to scale and drive our trading and evolving business.

Jump's HPC infrastructure powers some of the most demanding computational workloads in the industry. As our HPC footprint grows, we need a seasoned operations leader to own the reliability, standards, and day-to-day excellence of these environments. This role leads the teams that keep the lights on across Jump's HPC data centers, ensuring maximum uptime through disciplined operations, proactive maintenance, and deep technical expertise in critical facility systems. Heavy, daily use of AI tools is expected in this role—to accelerate decision-making, automate operational workflows, analyze data center telemetry, and continuously raise the bar on how the team operates.

What You'll Do:

Team Leadership & Organizational Ownership
- Lead and manage data center site leads and their teams across multiple HPC facilities; site leads report directly to this role.
- Recruit, mentor, and develop team members while conducting performance reviews and building a culture of operational rigor.
- Direct onsite contractors by providing clear scope and validating completed work.

HPC Data Center Standards, Processes & Preventative Maintenance
- Develop, document, and enforce operational standards and procedures for Jump's HPC data centers covering power, cooling, cabling, and hardware lifecycle.
- Design and own the preventative maintenance program, including scheduled inspections, component replacements, and firmware/capacity reviews to minimize unplanned downtime.
- Drive continuous improvement of operational processes and pursue automation—including AI-driven approaches—to reduce manual effort and human error.

Critical Facility Systems Expertise
- Serve as the subject matter authority on HPC data center power distribution, power striping strategies, and failover/redundancy configurations.
- Own expertise across air cooling, liquid cooling (direct-to-chip, rear-door, CDU-based), and hybrid cooling architectures.
- Maintain deep knowledge of environmental monitoring and controls (temperature, humidity, airflow, leak detection) and ensure systems remain within design parameters.

Monitoring & Incident Response
- Own the HPC data center monitoring strategy end-to-end: define what is monitored, set alerting thresholds, and ensure comprehensive visibility into facility and hardware health.
- Leverage AI tools to analyze telemetry data, identify failure patterns, predict potential issues, and accelerate root cause analysis during incidents.
- Lead critical incident response and drive root cause analysis and corrective actions to prevent recurrence.
- Establish and track operational KPIs including availability, mean time to repair, and efficiency metrics.

Server & Switch Hardware Expertise
- Maintain deep, hands-on knowledge of server hardware architectures including multi-socket platforms, GPU/accelerator configurations, memory subsystems, NVMe/storage controllers, BMC/IPMI management, and firmware lifecycle.
- Maintain deep, hands-on knowledge of network switch hardware including line cards, optics/transceivers, switch fabrics, and platform-specific diagnostics for Arista and Cisco platforms.
- Evaluate new hardware platforms, drive hardware qualification and acceptance testing, and provide informed recommendations on hardware selection.

Hardware Break-Fix
- Own the overall hardware break-fix function across all HPC sites, ensuring rapid diagnosis and resolution for servers, GPUs, network equipment, storage, and facility infrastructure.
- Diagnose complex hardware failures at the component level—CPUs, DIMMs, GPUs, NICs, PSUs, fans, drives, switch line cards, and optics—and direct the team to resolve efficiently.
- Establish escalation paths, SLA targets, and reporting for hardware failures.

Inventory & Spares Management
- Own inventory processes and spares tracking across all HPC facilities, ensuring critical spares are stocked, tracked, and replenished to meet availability targets.
- Maintain accurate asset records for all serialized and consumable inventory.

Planning, Vendor & Budget Management
- Conduct capacity planning for space, power, cooling, and cabling to stay ahead of growth.
- Gather requirements and plan new hardware installations including physical placement, power/cooling needs, and cabling.
- Manage relationships with colocation providers and hardware vendors; negotiate contracts and SLAs.
- Develop and manage operational budgets for equipment, staffing, and facilities.

Networking & Linux
- Possess strong working knowledge of networking concepts including L2/L3 protocols, VLANs, BGP, OSPF, LACP, ECMP, and high-performance fabrics relevant to HPC environments.
- Understand network architectures such as spine-leaf, fat-tree, and high-radix topologies used in HPC clusters.
- Maintain strong Linux systems knowledge—comfortable navigating and troubleshooting at the OS level, including storage, networking, process management, log analysis, and system diagnostics.

AI-Driven Operations
- Use AI tools daily across all aspects of the role: writing and reviewing documentation, analyzing operational data, drafting procedures, managing communications, and problem-solving.
- Champion AI adoption within the team—set the expectation that every team member integrates AI into their daily workflows.
- Identify and implement opportunities where AI can replace or augment manual operational processes.

Cross-Team Partnership
- Partner with HPC Engineering, Network Engineering, and other teams to align operations with research and business needs.
- Ensure compliance with all safety, security, and regulatory requirements.

Travel
- Travel regularly to Jump's HPC data center sites for operational oversight, project execution, and team engagement. This is a core requirement of the role.

Additional duties as assigned or needed.

Skills You'll Need:

- Minimum 7+ years of data center operations experience with at least 3 years leading teams in 24/7 critical infrastructure environments. HPC environment experience strongly preferred.
- In-depth knowledge of data center power systems, power distribution/striping, and failover/redundancy architectures.
- In-depth knowledge of cooling technologies including air cooling, liquid cooling (direct-to-chip, rear-door heat exchangers, CDUs), and environmental control systems.
- Proven experience building and maintaining preventative maintenance programs and operational standards/procedures.
- Strong experience with data center monitoring platforms (DCIM, BMS, environmental sensors) and defining monitoring/alerting strategies.
- Demonstrates a high level of energy, results driven, and able to work under pressure with tight deadlines.

Technical Skills:

- Deep knowledge of server hardware architectures: multi-socket platforms, GPU/accelerator systems, memory subsystems, NVMe storage, BMC/IPMI, and firmware management.
- Deep knowledge of network switch hardware: line cards, optics/transceivers, switch fabrics, and platform diagnostics across Arista and Cisco platforms.
- Proven hardware break-fix experience with the ability to diagnose failures at the component level (CPUs, DIMMs, GPUs, NICs, PSUs, drives, line cards, optics).
- Strong understanding of networking concepts: L2/L3 protocols, VLANs, BGP, OSPF, LACP, ECMP, and HPC network topologies (spine-leaf, fat-tree).
- Strong Linux systems proficiency—well beyond basic CLI usage. Comfortable with OS-level troubleshooting, storage and network configuration, process management, log analysis, and system diagnostics.
- Experience managing inventory and spares programs for critical infrastructure.
- Structured cabling standards expertise.
- Programming/scripting experience (Python preferred) is a plus.
- Demonstrated heavy use of AI tools (e.g., LLM-based assistants, AI coding tools, AI-driven analytics) in a professional setting. You should already be using AI daily and be eager to push its application further across operations.
- Strong project management skills with multi-site infrastructure deployment experience.
- Knowledge of industry standards including ASHRAE and TIA-942.
- Excellent written and verbal communication skills with the ability to communicate effectively across technical and non-technical audiences.
- Meet physical requirements including working on ladders/elevated platforms and lifting up to 50 lbs.
- Extremely high personal standards for work quality and operational discipline.
- Reliable and predictable availability, including ability to work evenings and weekends as required.
- Willingness and ability to travel regularly to data center sites.
- Bachelor's degree preferred.

HQ

Jump Trading Group Chicago, Illinois, USA Office

600 W Chicago Ave, Suite 600, Chicago, Illinois, United States, 60654

Similar Jobs

2 Hours Ago
Hybrid
Schaumburg, IL, USA
15-20 Hourly
Junior
15-20 Hourly
Junior
eCommerce • Fashion • Other • Retail • Sales • Wearables • Design
The Sales Associate III at Coach will provide personalized shopping experiences, build client relationships, drive sales, and support daily operations, while adhering to service standards and completing operational tasks.
Top Skills: Clienteling ToolsIpadLaptopPos SystemsSocial Selling Platforms
2 Hours Ago
Hybrid
Chicago, IL, USA
100K-500K Annually
Mid level
100K-500K Annually
Mid level
Artificial Intelligence • Big Data • Enterprise Web • Fintech • Software • Financial Services
The Associate Team Leader manages day-to-day operations of the Level II Support team, mentors junior members, and enhances client service. Responsibilities include resolving escalated issues, monitoring service metrics, and collaborating with internal teams to improve support processes.
Top Skills: Morningstar Platforms
2 Hours Ago
Hybrid
Northbrook, IL, USA
110K-130K Annually
Senior level
110K-130K Annually
Senior level
Automotive • Professional Services • Software • Consulting • Energy • Chemical • Renewable Energy
The role involves executing DV/PV testing for lithium-ion batteries, managing projects, facilitating client communication, and ensuring compliance with standards.
Top Skills: Chemical EngineeringElectrical EngineeringLithium-Ion BatteriesMechanical EngineeringProject ManagementUl Requirements

What you need to know about the Chicago Tech Scene

With vibrant neighborhoods, great food and more affordable housing than either coast, Chicago might be the most liveable major tech hub. It is the birthplace of modern commodities and futures trading, a national hub for logistics and commerce, and home to the American Medical Association and the American Bar Association. This diverse blend of industry influences has helped Chicago emerge as a major player in verticals like fintech, biotechnology, legal tech, e-commerce and logistics technology. It’s also a major hiring center for tech companies on both coasts.

Key Facts About Chicago Tech

  • Number of Tech Workers: 245,800; 5.2% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: McDonald’s, John Deere, Boeing, Morningstar
  • Key Industries: Artificial intelligence, biotechnology, fintech, software, logistics technology
  • Funding Landscape: $2.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Pritzker Group Venture Capital, Arch Venture Partners, MATH Venture Partners, Jump Capital, Hyde Park Venture Partners
  • Research Centers and Universities: Northwestern University, University of Chicago, University of Illinois Urbana-Champaign, Illinois Institute of Technology, Argonne National Laboratory, Fermi National Accelerator Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account