C-Gen.AI Logo

C-Gen.AI

HPC & MLOps Engineer

Posted 9 Days Ago
Remote or Hybrid
Hiring Remotely in United States
Mid level
Remote or Hybrid
Hiring Remotely in United States
Mid level
Design, deploy, and maintain HPC and MLOps infrastructure across cloud and on-prem clusters. Manage schedulers (Slurm/PBS), optimize MPI/CUDA stacks, storage and networking, automate deployments with Python/Bash, instrument systems, and enable AI training/inference pipelines while collaborating with product and support teams.
The summary above was generated by AI
About C-Gen.AI

At C-Gen.AI, we are pioneering the next generation of AI infrastructure. As a leading technology company, our mission is to deliver world-class AI Infrastructure management solutions that drive innovation, efficiency, and performance at scale. We seek passionate, highly skilled engineers who excel in dynamic environments and are eager to work with cutting-edge technologies to transform the computational landscape.

Role Summary

As an HPC & MLOps Engineer, you will be a foundational member of our Data Center team—integral to the design, deployment, and maintenance of high-performance computing (HPC) and MLOps systems. In this role, you will ensure our clients have access to reliable, scalable, and secure HPC environments. You will collaborate closely with cross-functional teams and work under a dynamic, startup environment that rewards initiative and technical excellence.

Key Responsibilities
  • HPC Infrastructure Support:

    • Configure, deploy, monitor, and maintain C-Gen.AI Cluster solution for a diverse client base.

    • Manage both cloud and on-premise deployments, ensuring optimal job scheduling and resource allocation.

    • Troubleshoot and optimize HPC library stacks—including OpenMPI, CUDA, TensorFlow, and PyTorch—and manage parallel file systems (e.g. Lustre, BeeGFS, Ceph, NFS, or object storage).

  • Cloud & On-Prem Automation:

    • Develop and oversee automated deployments across multiple platforms (Cloud providers or on-premise clusters).

    • Implement best practices for network configuration, security, and cost optimization tailored to HPC needs.

    • Create and maintain Bash/Python scripts that streamline workflows, gather essential metrics, and empower self-service HPC management.

  • MLOps Enablement: Contribute to the development and maintenance of HPC workflows for AI/ML teams, enhancing training and inference pipelines.

  • Potential HPC Development:

    • Collaborate with our in-house HPC experts on advanced projects involving performance tuning, MPI, and GPU parallelization.

    • While not mandatory at present, there is ample opportunity to expand your technical skill set in HPC development over time.

  • Reliability & visibility: Instrument systems, collect metrics, and build dashboards/alerts that enable self-service and rapid incident response.

  • Collaboration: Work closely with product, support, and customer teams; document designs, runbooks, and standards

What You Bring
  • Experience: 3+ years hands-on in HPC operations; strong familiarity with Slurm (or similar batch systems such as PBS Pro/OpenPBS).

  • Software skills: Solid Python (including asyncio/task-oriented patterns) and Bash for automation, tooling, and data handling.

  • Scheduling & process management: Deep understanding of batch queues, job lifecycle, and multi-tenant cluster policies.

  • Linux: Strong understanding of RedHat/Debian linux flavors and system administration.

  • Cloud & hybrid: Proficiency with at least one major cloud (AWS/GCP/Azure/OCI) and interest in others; comfortable bridging cloud and on-premises deployments.

  • Storage & performance: Practical knowledge of Lustre/Ceph/NFS/object storage and the throughput/latency trade-offs common to HPC.

  • ML/HPC fluency: Understanding of AI training/inference workflows, GPU scheduling, drivers, and runtime management.

  • Networking: Confidence with high-performance networking (e.g., RDMA/InfiniBand/RoCE) and compiling Linux modules for network support.

  • Communication: Clear written/spoken English; able to collaborate across time zones and functions.

  • Mindset: Self-directed, detail-oriented, and comfortable in a fast-moving startup.

Nice to Have
  • Python libraries & SDKs: asyncio, aiohttp, cloud SDKs (e.g., boto3, google-cloud, azure-sdk, OCI).

  • Performance tuning: CUDA/NCCL profiling, MPI optimization, kernel/sysctl tuning, GPU/CPU/IO benchmarking.

  • Cost optimization: Experience balancing cost, performance, and reliability—especially in cloud bursting scenarios.

  • C/C++ Experience: Having C/C++ and systems programming experience is a big plus.

Why Join C-Gen.AI?
  • High Impact & Growth: Play a crucial role in a startup with unicorn potential, driving innovation at every level.

  • Ownership & Influence: Enjoy significant autonomy with a direct influence on both technical decisions and team culture.

  • Competitive Compensation: Competitive salary, stock options, and flexible remote work arrangements.

  • Mentorship & Learning: Collaborate directly with our technical CEO and an experienced HPC specialist, ensuring continuous professional growth.

  • Culture of Autonomy: Thrive in an environment that values trust, minimal supervision, and individual initiative.

If you’re excited to build the foundation of large-scale AI, we’d love to hear from you. Apply now and help shape the future of AI infrastructure.

Similar Jobs

Yesterday
Remote
United States
211K-316K Annually
Senior level
211K-316K Annually
Senior level
Artificial Intelligence • Productivity • Software • Automation
As a Staff Engineer for Revenue, you'll shape technical vision and architecture for billing and pricing systems, ensuring correctness while enhancing cross-team collaboration.
Top Skills: APIsBilling SystemsPerformance OptimizationSubscription Management
Yesterday
In-Office or Remote
Senior level
Senior level
Artificial Intelligence • Cybersecurity
As a Senior SRE, ensure reliability and performance of cloud infrastructure, manage incident response, implement monitoring, and drive continuous improvements.
Top Skills: ArgocdAws EksElk StackGithub ActionsGrafanaKubernetesOpsgeniePagerdutyPrometheusTerraform
2 Days Ago
Remote
Senior level
Senior level
Artificial Intelligence • Productivity • Software • Automation
Lead Okta administration and SSO, build automations with Okta Workflows and Zapier, optimize SaaS spend and licensing, automate provisioning/deprovisioning via APIs, support and improve the macOS employee experience, triage IT issues, and manage SaaS tooling and reporting across the organization.
Top Skills: 1PasswordAPIsGoogle WorkspaceJIRAmacOSOktaOkta WorkflowsSlackSsoZapierZoom

What you need to know about the Chicago Tech Scene

With vibrant neighborhoods, great food and more affordable housing than either coast, Chicago might be the most liveable major tech hub. It is the birthplace of modern commodities and futures trading, a national hub for logistics and commerce, and home to the American Medical Association and the American Bar Association. This diverse blend of industry influences has helped Chicago emerge as a major player in verticals like fintech, biotechnology, legal tech, e-commerce and logistics technology. It’s also a major hiring center for tech companies on both coasts.

Key Facts About Chicago Tech

  • Number of Tech Workers: 245,800; 5.2% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: McDonald’s, John Deere, Boeing, Morningstar
  • Key Industries: Artificial intelligence, biotechnology, fintech, software, logistics technology
  • Funding Landscape: $2.5 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Pritzker Group Venture Capital, Arch Venture Partners, MATH Venture Partners, Jump Capital, Hyde Park Venture Partners
  • Research Centers and Universities: Northwestern University, University of Chicago, University of Illinois Urbana-Champaign, Illinois Institute of Technology, Argonne National Laboratory, Fermi National Accelerator Laboratory

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account