Fast, reliable, reproducible AI with GPU live migration
AI and HPC infrastructure suffers from scarcity and high costs, so when failures happen they are costly in terms of time and money. Cluster productivity directly determines research output and revenue. Achieving high utilization and throughput is increasingly challenging due to the complexity of workloads, hardware, and operations.
Cedana maximizes AI+HPC cluster utilization and reliability with automated GPU checkpointing infrastructure. We enable transparent and fast migration of GPU workloads across instances, without losing work. Workloads automatically migrate to achieve new levels of reliability and throughput while accelerating time to results. Our system is at the kernel/OS level, requiring no code or config changes, and works seamlessly with Kubernetes, SLURM, and NVIDIA Dynamo. Today, we're deploying into leading inference platforms, neoclouds, enterprise, and research clusters.
Cedana's founding team has spent over a decade making computation run fast, productively, and reliably for AI. Our research appears in NeurIPS and CVPR. We published some of the earliest formal methods for guaranteeing convergence in distributed training. At Shopify we've deployed warehouse automation and robot fleets building behavior trees, fleet control planes, and OTA infrastructure that performs reliably over constrained networks. We bring repeat founder experience having built and exited a healthcare AI company.
As a Forward Deployed Engineer at Cedana, you’ll lead and own technical engagement from end to end. You’ll engage with customers to understand and deploy on their environments: from production SLURM at a university, bare-metal Kubernetes at an inference provider, hybrid setup at a Fortune 100 Pharma enterprise. You’ll rapidly understand their key pain points, and use Cedana to solve their problems. For each customer you own everything from the OS up: SLURM plugins, Kubernetes operators, node configuration, networking, and observability.
This role will expose you to the cutting edge of AI and HPC infrastructure, working with the world’s leading research and commercial customers to deliver a breakthrough solution.
Cedana is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veteran status, or disability status
Cedana is /migrate/resume for compute workloads. We're working on building a global, real-time system for compute. This means a paradigm shift in how we allocate resources to things like high performance computing, numerical simulation and training and running machine learning models. We do so by taking a systems-level and deep-tech approach to these problems, working at the Linux Kernel layer and with hardware.