adaption Jobs

Distributed Systems Engineer, Data & Inference Platform

adaption

Distributed Systems Engineer, Data & Inference Platform

Reposted 13 Days Ago

Remote or Hybrid

Hiring Remotely in CA

Senior level

Remote or Hybrid

Hiring Remotely in CA

Senior level

Design and operate distributed inference systems for LLMs, build large-scale data pipelines, and debug production issues while collaborating with researchers and ML engineers.

The summary above was generated by AI

The Role

You'll build and operate the systems that turn raw compute into useful intelligence — the inference services that serve LLMs at scale and the data pipelines that feed them. One week you're hunting a tail-latency regression in a production inference service handling millions of requests; the next you're redesigning a Ray Data pipeline so it stops melting down at petabyte scale. The work spans architecture, implementation, and the on-call pager that keeps you honest about both. Researchers and ML engineers will hand you workloads that barely run; you'll hand them back systems that run reliably, efficiently, and cheaply enough to matter.

Responsibilities

Serve Models at Scale: Design and operate distributed inference systems for LLMs, optimizing throughput, latency, and cost across heterogeneous GPU fleets. Batching, scheduling, KV cache management, autoscaling — you own the levers that make inference economical.
Move the Data: Build large-scale data pipelines (Ray Data, Spark, or equivalents) that ingest, transform, and curate the datasets behind training and evaluation. The bottleneck is rarely where people think it is, and you find it.
Debug the Undebuggable: Chase down the failure modes that only emerge under real production traffic — stragglers, head-of-line blocking, silent data corruption, GPU memory fragmentation — and write the postmortems that prevent the next ten. Define SLOs, build the observability to measure them, and own the on-call rotation that defends them.
Partner Across the Stack: Work directly with researchers and ML engineers to take experimental workloads from "runs on one node" to "runs in production." You're a systems partner, not a ticket queue.

Qualifications

5+ years building and operating distributed systems in production.
Deep experience with at least one large-scale data or compute framework (Ray, Spark, Flink, Beam, Dask).
Strong fluency in Python and at least one systems language (Go, Rust, C++).
Working knowledge of the GPU/accelerator stack: CUDA fundamentals, NCCL, mixed precision, memory layout. You don't need to write kernels, but you should know why a workload is bound by what it's bound by.
Experience operating Kubernetes-based infrastructure, including custom operators or schedulers.
A track record of owning hard production incidents end-to-end — diagnosis, mitigation, and the durable fix.
Bonus: hands-on experience with LLM inference engines (vLLM, SGLang, TensorRT-LLM, TGI), modern lakehouse formats (Iceberg, Delta, Hudi), or open-source contributions to relevant projects.

Above all, we're looking for great teammates who make work feel lighter and aren't afraid to go out on a limb with bold ideas. You don't need to be perfect, but you do need to be adaptable. We encourage you to apply, even if you don't check every box.

About Us

Most AI is frozen in place - it doesn't adapt to the world. We think that's backwards. Our mandate is to build efficient intelligence that evolves in real-time. Our vision is AI systems that are flexible, personalized, and accessible to everyone. We believe efficiency is what makes this possible - it's how we expand access and ensure innovation benefits the many, not the few. We believe in talent density: bringing together the best and most driven individuals to push the boundaries of continual adaptation. We're looking for builders and creative thinkers ready to shape the next era of intelligence.

Benefits

Flexible work: In-person collaboration in the Bay Area, a distributed global-first team, and team offsites.
Adaption Passport: Annual travel stipend to explore a country you've never visited. We're building intelligence that evolves alongside you, so we encourage you to keep expanding your horizons.
Lunch Stipend: Weekly meal allowance for take-out or grocery delivery.
Well-Being: Comprehensive medical benefits and generous paid time off.

Similar Jobs

Affirm

Group Product Management Manager, Consumer Servicing

An Hour Ago

Easy Apply

Remote

Easy Apply

209K-269K Annually

Senior level

209K-269K Annually

Senior level

Big Data • Fintech • Mobile • Payments • Financial Services

Lead and grow a team of 2-3 PMs owning agent tooling and workflows. Define vision and roadmap for agent experience, drive AI-first automation, partner with Operations and cross-functional teams, deliver scalable systems, and measure impact through analytics and experimentation.

Top Skills: Agent ToolingAIAnalyticsAutomationChat SystemsExperimentationPhone SystemsWorkflow Systems

Inspiren

Platform Engineer

15 Hours Ago

Easy Apply

In-Office or Remote

United States

Easy Apply

180K-200K Annually

Senior level

180K-200K Annually

Senior level

Artificial Intelligence • Hardware • Healthtech • Software

The Senior Data Platform Engineer will manage and develop the data infrastructure on Databricks and AWS, ensuring scalable and efficient data capabilities while collaborating across teams.

Top Skills: AWSDatabricksKafkaKinesis

Dropbox

Staff Data Engineer

16 Hours Ago

Remote

204K-276K Annually

Expert/Leader

204K-276K Annually

Expert/Leader

Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy

Lead design and implementation of shared, reusable data models and a certified metrics layer. Standardize pipeline patterns, CI/CD, and governance; modernize orchestration and observability; partner with Data Science, Infrastructure, and Product to deliver reliable analytics pipelines and enable AI-native data development.

Top Skills: AirflowAtlanDatabricksDatabricks Metric ViewsDbtDbt MetricflowDelta LakeGreat ExpectationsMonte CarloPythonSpark SqlSQLUnity Catalog

What you need to know about the Chicago Tech Scene

With vibrant neighborhoods, great food and more affordable housing than either coast, Chicago might be the most liveable major tech hub. It is the birthplace of modern commodities and futures trading, a national hub for logistics and commerce, and home to the American Medical Association and the American Bar Association. This diverse blend of industry influences has helped Chicago emerge as a major player in verticals like fintech, biotechnology, legal tech, e-commerce and logistics technology. It’s also a major hiring center for tech companies on both coasts.

Key Facts About Chicago Tech

Number of Tech Workers: 245,800; 5.2% of overall workforce (2024 CompTIA survey)
Major Tech Employers: McDonald’s, John Deere, Boeing, Morningstar
Key Industries: Artificial intelligence, biotechnology, fintech, software, logistics technology
Funding Landscape: $2.5 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Pritzker Group Venture Capital, Arch Venture Partners, MATH Venture Partners, Jump Capital, Hyde Park Venture Partners
Research Centers and Universities: Northwestern University, University of Chicago, University of Illinois Urbana-Champaign, Illinois Institute of Technology, Argonne National Laboratory, Fermi National Accelerator Laboratory