This is a remote position; however, the candidate must reside within 30 miles of one of the following locations: Portland, ME; Boston, MA; Chicago, IL; Dallas, TX; San Francisco Bay Area, CA; and Seattle/WA.
About the TeamWe are the backbone of the AI organization, building the high-performance compute foundation that powers our generative AI and machine learning initiatives. Our team bridges the gap between hardware and software, ensuring that our researchers and data scientists have a reliable, scalable, and efficient platform to train and deploy models. We focus on maximizing GPU utilization, minimizing inference latency, and creating a seamless "paved road" for AI development.
How You’ll Make an ImpactYou are a systems thinker who loves solving hard infrastructure challenges. You will architect the underlying platform that serves our production AI workloads, ensuring they are resilient, secure, and cost-effective. By optimizing our compute layer and deployment pipelines, you will directly accelerate the velocity of the entire AI product team, transforming how we deliver AI at scale.
ResponsibilitiesPlatform Architecture: Design and maintain a robust, Kubernetes-based AI platform that supports distributed training and high-throughput inference serving.
Inference Optimization: Engineer low-latency serving solutions for LLMs and other models, optimizing engines (e.g., vLLM, TGI, Triton) to maximize throughput and minimize cost per token.
Compute Orchestration: Manage and scale GPU clusters on Cloud (AWS) or on-prem environments, implementing efficient scheduling, auto-scaling, and spot instance management to optimize costs.
Operational Excellence (MLOps): Build and maintain "Infrastructure as Code" (Terraform/Ansible) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning.
Reliability & Observability: Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure.
Developer Experience: Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self-serve compute resources without managing underlying infrastructure.
Security & Compliance: Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively.
Experience You’ll Bring5+ years of experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering, with at least 2 years focused on Machine Learning infrastructure.
Production Expertise: Proven track record of managing large-scale production clusters (Kubernetes) and distributed systems.
Hardware Fluency: Deep understanding of GPU architectures (NVIDIA A100/H100), CUDA drivers, and networking requirements for distributed workloads.
Serving Proficiency: Experience deploying and scaling open-source LLMs and embedding models using containerized solutions.
Automation First: Strong belief in "Everything as Code"—you automate toil wherever possible using Python, Go, or Bash.
Technical SkillsCore Engineering: Expert proficiency in Python and Go; comfortable digging into lower-level system performance.
Orchestration & Containers: Mastery of Kubernetes (EKS/GKE), Helm, Docker, and container runtimes. Experience with Ray or Slurm is a huge plus.
Infrastructure as Code: Advanced skills with Terraform, CloudFormation, or Pulumi.
Model Serving: Hands-on experience with serving frameworks like Triton Inference Server, vLLM, Text Generation Inference (TGI), or TorchServe.
Cloud Platforms: Deep expertise in AWS (EC2, EKS, SageMaker) or GCP, specifically regarding GPU instance types and networking.
Observability: Proficiency with Prometheus, Grafana, DataDog, and tracing tools (OpenTelemetry).
Networking: Understanding of service mesh (Istio), load balancing, and high-performance networking (RPC, gRPC).
Top Skills
Similar Jobs
What you need to know about the Chicago Tech Scene
Key Facts About Chicago Tech
- Number of Tech Workers: 245,800; 5.2% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: McDonald’s, John Deere, Boeing, Morningstar
- Key Industries: Artificial intelligence, biotechnology, fintech, software, logistics technology
- Funding Landscape: $2.5 billion in venture capital funding in 2024 (Pitchbook)
- Notable Investors: Pritzker Group Venture Capital, Arch Venture Partners, MATH Venture Partners, Jump Capital, Hyde Park Venture Partners
- Research Centers and Universities: Northwestern University, University of Chicago, University of Illinois Urbana-Champaign, Illinois Institute of Technology, Argonne National Laboratory, Fermi National Accelerator Laboratory


