WEX Inc. Logo

WEX Inc.

Senior AI Infrastructure Engineer

Posted 2 Days Ago
Be an Early Applicant
In-Office
6 Locations
122K-146K Annually
Senior level
In-Office
6 Locations
122K-146K Annually
Senior level
Design and maintain a Kubernetes-based AI platform for distributed training; optimize GPU usage and automate infrastructure processes to support AI workloads.
The summary above was generated by AI

This is a remote position; however, the candidate must reside within 30 miles of one of the following locations: Portland, ME; Boston, MA; Chicago, IL; Dallas, TX; San Francisco Bay Area, CA; and Seattle/WA.

About the Team

We are the backbone of the AI organization, building the high-performance compute foundation that powers our generative AI and machine learning initiatives. Our team bridges the gap between hardware and software, ensuring that our researchers and data scientists have a reliable, scalable, and efficient platform to train and deploy models. We focus on maximizing GPU utilization, minimizing inference latency, and creating a seamless "paved road" for AI development.

How You’ll Make an Impact

You are a systems thinker who loves solving hard infrastructure challenges. You will architect the underlying platform that serves our production AI workloads, ensuring they are resilient, secure, and cost-effective. By optimizing our compute layer and deployment pipelines, you will directly accelerate the velocity of the entire AI product team, transforming how we deliver AI at scale.

Responsibilities
  • Platform Architecture: Design and maintain a robust, Kubernetes-based AI platform that supports distributed training and high-throughput inference serving.

  • Inference Optimization: Engineer low-latency serving solutions for LLMs and other models, optimizing engines (e.g., vLLM, TGI, Triton) to maximize throughput and minimize cost per token.

  • Compute Orchestration: Manage and scale GPU clusters on Cloud (AWS) or on-prem environments, implementing efficient scheduling, auto-scaling, and spot instance management to optimize costs.

  • Operational Excellence (MLOps): Build and maintain "Infrastructure as Code" (Terraform/Ansible) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning.

  • Reliability & Observability: Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure.

  • Developer Experience: Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self-serve compute resources without managing underlying infrastructure.

  • Security & Compliance: Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively.

Experience You’ll Bring
  • 5+ years of experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering, with at least 2 years focused on Machine Learning infrastructure.

  • Production Expertise: Proven track record of managing large-scale production clusters (Kubernetes) and distributed systems.

  • Hardware Fluency: Deep understanding of GPU architectures (NVIDIA A100/H100), CUDA drivers, and networking requirements for distributed workloads.

  • Serving Proficiency: Experience deploying and scaling open-source LLMs and embedding models using containerized solutions.

  • Automation First: Strong belief in "Everything as Code"—you automate toil wherever possible using Python, Go, or Bash.

Technical Skills
  • Core Engineering: Expert proficiency in Python and Go; comfortable digging into lower-level system performance.

  • Orchestration & Containers: Mastery of Kubernetes (EKS/GKE), Helm, Docker, and container runtimes. Experience with Ray or Slurm is a huge plus.

  • Infrastructure as Code: Advanced skills with Terraform, CloudFormation, or Pulumi.

  • Model Serving: Hands-on experience with serving frameworks like Triton Inference Server, vLLM, Text Generation Inference (TGI), or TorchServe.

  • Cloud Platforms: Deep expertise in AWS (EC2, EKS, SageMaker) or GCP, specifically regarding GPU instance types and networking.

  • Observability: Proficiency with Prometheus, Grafana, DataDog, and tracing tools (OpenTelemetry).

  • Networking: Understanding of service mesh (Istio), load balancing, and high-performance networking (RPC, gRPC).

The base pay range represents the anticipated low and high end of the pay range for this position. Actual pay rates will vary and will be based on various factors, such as your qualifications, skills, competencies, and proficiency for the role. Base pay is one component of WEX's total compensation package. Most sales positions are eligible for commission under the terms of an applicable plan. Non-sales roles are typically eligible for a quarterly or annual bonus based on their role and applicable plan. WEX's comprehensive and market competitive benefits are designed to support your personal and professional well-being. Benefits include health, dental and vision insurances, retirement savings plan, paid time off, health savings account, flexible spending accounts, life insurance, disability insurance, tuition reimbursement, and more. For more information, check out the "About Us" section.Pay Range: $121,500.00 - $145,500.00

Top Skills

AWS
Bash
Docker
GCP
Go
Grafana
Kubernetes
Prometheus
Python
Terraform
Text Generation Inference
Torchserve
Triton Inference Server
Vllm

Similar Jobs

6 Days Ago
Hybrid
3 Locations
140K-215K Annually
Senior level
140K-215K Annually
Senior level
Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Design and implement scalable microservices for AI Detection and Response. Collaborate on issues, code reviews, and mentor team members in secure coding practices.
Top Skills: AWSAzureDockerGCPGoGrafanaJavaKubernetesOciPostgresPythonRedis
3 Days Ago
In-Office or Remote
3 Locations
184K-357K Annually
Senior level
184K-357K Annually
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The role involves developing and optimizing AI infrastructure for large-scale training and inference, ensuring system reliability and efficiency through software engineering practices.
Top Skills: C/C++ElkIb VerbsJaxLibfabricsLokiNcclPrometheusPythonPyTorchRdmaTensorFlowUcx
Yesterday
In-Office
5 Locations
184K-357K Annually
Senior level
184K-357K Annually
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
Design and build scalable AI infrastructure, improve architecture and performance, collaborate with teams, and contribute to advancements in AI technologies.
Top Skills: AutogenCi/CdJavaScriptKafkaKubernetesLangchainLanggraphMongoDBNoSQLOpenai FunctionsPythonRagRedisSQLVector Databases

What you need to know about the Chicago Tech Scene

With vibrant neighborhoods, great food and more affordable housing than either coast, Chicago might be the most liveable major tech hub. It is the birthplace of modern commodities and futures trading, a national hub for logistics and commerce, and home to the American Medical Association and the American Bar Association. This diverse blend of industry influences has helped Chicago emerge as a major player in verticals like fintech, biotechnology, legal tech, e-commerce and logistics technology. It’s also a major hiring center for tech companies on both coasts.

Key Facts About Chicago Tech

  • Number of Tech Workers: 245,800; 5.2% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: McDonald’s, John Deere, Boeing, Morningstar
  • Key Industries: Artificial intelligence, biotechnology, fintech, software, logistics technology
  • Funding Landscape: $2.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Pritzker Group Venture Capital, Arch Venture Partners, MATH Venture Partners, Jump Capital, Hyde Park Venture Partners
  • Research Centers and Universities: Northwestern University, University of Chicago, University of Illinois Urbana-Champaign, Illinois Institute of Technology, Argonne National Laboratory, Fermi National Accelerator Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account