{"slug": "reducing-container-cold-start-times-using-soci-index-on-dlami-and-dlc", "title": "Reducing container cold start times using SOCI index on DLAMI and DLC", "summary": "AWS has enabled SOCI snapshotter and index support on Deep Learning AMIs and Deep Learning Containers to reduce container cold start times for AI and ML workloads. The SOCI technology uses lazy loading to start containers with only necessary files, cutting startup times from minutes to near-instant for multi-gigabyte images. This addresses bottlenecks in production environments where slow container initialization wastes GPU compute resources and delays scaling events.", "body_md": "[Artificial Intelligence](https://aws.amazon.com/blogs/machine-learning/)\n\n# Reducing container cold start times using SOCI index on DLAMI and DLC\n\n[Deep Learning AMI](https://docs.aws.amazon.com/dlami/latest/devguide/what-is-dlami.html) and [AWS Deep Learning Containers](https://aws.github.io/deep-learning-containers/) are now enabled with support for SOCI snapshotter and index. [Seekable OCI (SOCI)](https://github.com/awslabs/soci-snapshotter) is a technology that enables efficient container image management through selective file downloading. It uses a layer-based indexing system to map file locations within container images, allowing containers to start with only the necessary files loaded (lazy loading). This approach reduces network bandwidth usage and improves container startup times, making it particularly valuable for organizations managing large container images in cloud environments.\n\nIn this post, we look at how to use SOCI on publicly available Deep Learning AMIs and Containers, when to use the various SOCI modes provided by the tool, and how to quickly and efficiently use this tool in your workloads today.\n\n## Background\n\nAs organizations deploy artificial intelligence (AI) and machine learning (ML) workloads at scale, container startup time has become a bottleneck in production environments. Whether it’s spinning up training jobs, serving inference endpoints, or scaling GPU clusters automatically, the time spent downloading multi-gigabyte container images directly impacts cost, user experience, and operational efficiency. Traditional container deployment approaches force teams to download entire images before workloads can begin. This process can take multiple minutes to start up images commonly used in production. During development, a few minutes of wait time is barely noticeable. In production, those same minutes add up fast.\n\nOrganizations deploying deep learning infrastructure at scale typically encounter several critical challenges:\n\n- Prolonged cold start times. Standard Docker image pulls of 15–20 GB can take 4–6 minutes per instance, delaying training jobs and inference endpoints during scaling events.\n- Wasted compute resources. GPU instances sit idle during image pulls, burning through expensive compute hours while waiting for container initialization to finish.\n- Scaling bottlenecks. When demand spikes trigger automatic scaling, slow container startup times prevent rapid response, leading to degraded performance or dropped requests.\n- Bandwidth constraints. Large-scale deployments pulling massive images simultaneously can saturate network bandwidth, creating cascading delays across the infrastructure.\n- Developer productivity. Data scientists and ML engineers waste valuable time waiting for containers to start during iterative development and experimentation cycles.\n\n## Container pulling mechanisms\n\nWhen pulling a container for your workloads, AWS Deep Learning AMIs (DLAMI) and Deep Learning Containers offer three options: the standard Docker pull, SOCI parallel pull, and SOCI lazy loading through SOCI index. Think of these as a sliding scale of tradeoffs. Docker pulls are sequential and slow. SOCI parallel pull provides faster startup times by chunking downloads at the cost of compute resources. SOCI lazy loading provides near-instant container loading but requires files to be fetched on demand. You can use the following guide to choose the right mechanism for your workloads:\n\n- The choice between lazy loading and parallel pull modes depends on the image, instance specifications, and storage configuration. Lazy loading requires images to have a SOCI index. Without one, the system falls back to standard pulling.\n- Lower-spec instances should use lazy loading to conserve resources, while high-spec instances with multiple vCPUs and high network bandwidth benefit from parallel pull mode. Storage performance varies: EBS volumes are bounded by their provisioned IOPS and volume type, potentially creating bottlenecks during unpacking, while NVMe instance store delivers maximum I/O performance at the cost of data persistence across instance stop/start cycles.\n\nThe following example shows the various mechanisms based on the vLLM Deep Learning Container:\n\n*Deep Learning Container Pull Mechanisms*\n\n## Solution architecture\n\nThe following diagram shows the architecture for using SOCI with DLAMI and Deep Learning Containers.\n\n## Container startup time comparison with SOCI snapshotter\n\nThe following benchmarks compare standard Docker pulls against SOCI snapshotter in both lazy loading and parallel pull modes.\n\n### Lazy loading mode\n\nLazy loading mode starts containers immediately by fetching only the necessary data on demand, with remaining layers loaded in the background as needed.\n\n#### Prerequisites\n\nSOCI index required\n\n**Important:** Lazy loading mode requires the container image to have a **SOCI index** stored in the registry. Without a SOCI index, the snapshotter will fall back to standard pull behavior, and you won’t see any performance improvement. **AWS Deep Learning Containers** (DLCs) with the -soci tag suffix come with SOCI indexes pre-created and pushed to the registry, enabling lazy loading out of the box. For custom images, you must [create and push SOCI indexes](https://github.com/awslabs/soci-snapshotter/blob/main/docs/getting-started.md)\n\n#### Environment\n\n**Instance Type**: g5.2xlarge** EBS:**Size 500GiB, IOPS 3000, Throughput 125** AMI**: Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 24.04) 20260413 (`ami-06abbbf2049359343`\n\n)**Docker Image**:`public.ecr.aws/deep-learning-containers/vllm:0.19.0-gpu-py312-ec2-soci`\n\n**Image Size**: 9.72GB (compressed), 32.7GB (disk usage)** Network**: Corp\n\n#### Start container with Docker (non-SOCI)\n\nWe use Docker to start the inference server directly. Since no image exists locally, Docker pulls and extracts the entire image before starting the container.\n\n**Total time: 6m59.099s.**\n\n#### Start container with SOCI snapshotter (lazy loading)\n\nWe use nerdctl with SOCI snapshotter to start the inference container. Although no image exists locally, the SOCI-indexed image allows nerdctl to pull only the index and necessary layers to start the container, enabling lazy loading of remaining layers. Total time: 21.125s.\n\n#### Lazy loading summary\n\nUsing SOCI snapshotter with lazy loading, the container started in **21.125 seconds**, compared to **6 minutes 59.099 seconds** with standard Docker. This improvement is achieved because SOCI pulls only the necessary layers to start the container, with remaining layers loaded on demand as needed.\n\n### Parallel pull mode\n\nWhile lazy loading mode starts containers immediately by fetching only the required data on-demand, **parallel pull mode** downloads the entire image before startup but does so with higher concurrency than standard Docker pulls. This mode is ideal when you need the full image available at startup or when running I/O-intensive workloads.\n\n#### Environment\n\n**Instance Type:** g5.4xlarge**EBS:** 500GiB gp3, 16000 IOPS, 1000 MB/s Throughput**AMI:** Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 24.04) 20260413 (`ami-06abbbf2049359343`\n\n)**Docker Image:**`763104351884.dkr.ecr.us-east-1.amazonaws.com/sglang:0.5.10-gpu-py312-cu129-ubuntu24.04-sagemaker`\n\n**Image Size**: 19.32GB (compressed), 60.4GB (Disk Usage)** Network**: Corp\n\n**Note:** We use a private ECR image for this benchmark because public ECR is fronted by Amazon CloudFront, which limits network bandwidth and affects parallel mode performance. Private ECR is served directly from Amazon Simple Storage Service (Amazon S3), providing higher throughput.\n\n#### Enabling parallel pull mode\n\nThe SOCI snapshotter on Deep Learning AMI defaults to lazy loading mode. To enable parallel pull mode, modify the configuration file at `/etc/soci-snapshotter-grpc/config.toml`\n\n:\n\nApply the configuration by restarting the service:\n\n**Tip:** You can tune `max_concurrent_downloads_per_image`\n\nand `max_concurrent_unpacks_per_image`\n\nbased on your instance type and network bandwidth. For detailed tuning guidance, see [Introducing Seekable OCI Parallel Pull Mode for Amazon EKS](https://aws.amazon.com/blogs/containers/introducing-seekable-oci-parallel-pull-mode-for-amazon-eks/).\n\n#### Verifying parallel mode is active\n\nMonitor the SOCI snapshotter logs during image pull to confirm parallel mode is enabled:\n\nLook for log entries indicating parallel pull/unpack:\n\n#### Pull image with Docker (non-SOCI)\n\nStandard Docker pull downloads and extracts layers with limited concurrency.\n\n**Total time: 4m 44.163s**\n\n#### Pull image with SOCI parallel mode\n\nUsing nerdctl with SOCI parallel pull mode uses increased concurrency for both downloads and unpacking operations.\n\n**Total time: 2m 12.846s**\n\n#### Parallel pull summary\n\nUsing SOCI parallel pull mode reduced image pull time from **4 minutes 44 seconds to 2 minutes 12 seconds**, representing a **2.2x improvement** in pull performance.\n\n## Conclusion\n\nSOCI snapshotter provides improvements for both container startup and image pull operations:\n\n**Lazy loading mode**— Achieved a** 20x improvement**in container startup time (from 6+ minutes to ~21 seconds)** Parallel pull mode**— Achieved a** 2.2x improvement**in image pull time (from 4 minutes 44 seconds to 2 minutes 12 seconds)\n\nChoose lazy loading mode when you need the fastest possible container startup, or parallel pull mode when you need the full image available before your workload begins.\n\n## Clean up\n\nIf you launched EC2 instances to test SOCI snapshotter, terminate them to avoid incurring ongoing charges. Delete any container images you pushed to Amazon Elastic Container Registry (Amazon ECR) during testing, and remove any SOCI indexes you no longer need.\n\n## Getting started with SOCI\n\nDLAMI and Deep Learning Containers are publicly available today with SOCI snapshotter and SOCI index. For more information on publicly available DLAMI and Deep Learning Containers, you can check out [SOCI Index DLAMI](https://docs.aws.amazon.com/dlami/latest/devguide/soci-supported-dlami.html) to select the images that support SOCI, and check out the [Deep Learning Container repository](https://gallery.ecr.aws/deep-learning-containers) to get more information on supported images with SOCI index.\n\nFor detailed configuration guidance and best practices, refer to the [SOCI documentation](https://github.com/awslabs/soci-snapshotter/blob/main/docs/parallel-mode.md) and the [Deep Learning Container SOCI documentation](https://github.com/aws-samples/sample-aws-deep-learning-containers/tree/main/SOCI).", "url": "https://wpnews.pro/news/reducing-container-cold-start-times-using-soci-index-on-dlami-and-dlc", "canonical_source": "https://aws.amazon.com/blogs/machine-learning/reducing-container-cold-start-times-using-soci-index-on-dlami-and-dlc/", "published_at": "2026-06-03 16:26:35+00:00", "updated_at": "2026-06-03 16:48:56.563662+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "ai-infrastructure", "ai-tools"], "entities": ["AWS", "Deep Learning AMI", "AWS Deep Learning Containers", "Seekable OCI", "SOCI"], "alternates": {"html": "https://wpnews.pro/news/reducing-container-cold-start-times-using-soci-index-on-dlami-and-dlc", "markdown": "https://wpnews.pro/news/reducing-container-cold-start-times-using-soci-index-on-dlami-and-dlc.md", "text": "https://wpnews.pro/news/reducing-container-cold-start-times-using-soci-index-on-dlami-and-dlc.txt", "jsonld": "https://wpnews.pro/news/reducing-container-cold-start-times-using-soci-index-on-dlami-and-dlc.jsonld"}}