Reducing container cold start times using SOCI index on DLAMI and DLC

wpnews.pro

Artificial Intelligence Deep Learning AMI and AWS Deep Learning Containers are now enabled with support for SOCI snapshotter and index. Seekable OCI (SOCI) is a technology that enables efficient container image management through selective file down. It uses a layer-based indexing system to map file locations within container images, allowing containers to start with only the necessary files loaded (lazy ). This approach reduces network bandwidth usage and improves container startup times, making it particularly valuable for organizations managing large container images in cloud environments.

In this post, we look at how to use SOCI on publicly available Deep Learning AMIs and Containers, when to use the various SOCI modes provided by the tool, and how to quickly and efficiently use this tool in your workloads today.

Background #

As organizations deploy artificial intelligence (AI) and machine learning (ML) workloads at scale, container startup time has become a bottleneck in production environments. Whether it’s spinning up training jobs, serving inference endpoints, or scaling GPU clusters automatically, the time spent down multi-gigabyte container images directly impacts cost, user experience, and operational efficiency. Traditional container deployment approaches force teams to download entire images before workloads can begin. This process can take multiple minutes to start up images commonly used in production. During development, a few minutes of wait time is barely noticeable. In production, those same minutes add up fast.

Organizations deploying deep learning infrastructure at scale typically encounter several critical challenges:

Prolonged cold start times. Standard Docker image pulls of 15–20 GB can take 4–6 minutes per instance, delaying training jobs and inference endpoints during scaling events.
Wasted compute resources. GPU instances sit idle during image pulls, burning through expensive compute hours while waiting for container initialization to finish.
Scaling bottlenecks. When demand spikes trigger automatic scaling, slow container startup times prevent rapid response, leading to degraded performance or dropped requests.
Bandwidth constraints. Large-scale deployments pulling massive images simultaneously can saturate network bandwidth, creating cascading delays across the infrastructure.
Developer productivity. Data scientists and ML engineers waste valuable time waiting for containers to start during iterative development and experimentation cycles.

Container pulling mechanisms #

When pulling a container for your workloads, AWS Deep Learning AMIs (DLAMI) and Deep Learning Containers offer three options: the standard Docker pull, SOCI parallel pull, and SOCI lazy through SOCI index. Think of these as a sliding scale of tradeoffs. Docker pulls are sequential and slow. SOCI parallel pull provides faster startup times by chunking downloads at the cost of compute resources. SOCI lazy provides near-instant container but requires files to be fetched on demand. You can use the following guide to choose the right mechanism for your workloads:

The choice between lazy and parallel pull modes depends on the image, instance specifications, and storage configuration. Lazy requires images to have a SOCI index. Without one, the system falls back to standard pulling.
Lower-spec instances should use lazy to conserve resources, while high-spec instances with multiple vCPUs and high network bandwidth benefit from parallel pull mode. Storage performance varies: EBS volumes are bounded by their provisioned IOPS and volume type, potentially creating bottlenecks during unpacking, while NVMe instance store delivers maximum I/O performance at the cost of data persistence across instance stop/start cycles.

The following example shows the various mechanisms based on the vLLM Deep Learning Container:

Deep Learning Container Pull Mechanisms

Solution architecture #

The following diagram shows the architecture for using SOCI with DLAMI and Deep Learning Containers.

Container startup time comparison with SOCI snapshotter #

The following benchmarks compare standard Docker pulls against SOCI snapshotter in both lazy and parallel pull modes.

Lazy mode

Lazy mode starts containers immediately by fetching only the necessary data on demand, with remaining layers loaded in the background as needed.

Prerequisites

SOCI index required

Important: Lazy mode requires the container image to have a SOCI index stored in the registry. Without a SOCI index, the snapshotter will fall back to standard pull behavior, and you won’t see any performance improvement. AWS Deep Learning Containers (DLCs) with the -soci tag suffix come with SOCI indexes pre-created and pushed to the registry, enabling lazy out of the box. For custom images, you must create and push SOCI indexes

Environment

Instance Type: g5.2xlarge** EBS:Size 500GiB, IOPS 3000, Throughput 125 AMI**: Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 24.04) 20260413 (ami-06abbbf2049359343

)**Docker Image**:`public.ecr.aws/deep-learning-containers/vllm:0.19.0-gpu-py312-ec2-soci`

**Image Size**: 9.72GB (compressed), 32.7GB (disk usage)** Network**: Corp

#### Start container with Docker (non-SOCI)

We use Docker to start the inference server directly. Since no image exists locally, Docker pulls and extracts the entire image before starting the container.

Total time: 6m59.099s.

Start container with SOCI snapshotter (lazy )

We use nerdctl with SOCI snapshotter to start the inference container. Although no image exists locally, the SOCI-indexed image allows nerdctl to pull only the index and necessary layers to start the container, enabling lazy of remaining layers. Total time: 21.125s.

Lazy summary

Using SOCI snapshotter with lazy , the container started in 21.125 seconds, compared to 6 minutes 59.099 seconds with standard Docker. This improvement is achieved because SOCI pulls only the necessary layers to start the container, with remaining layers loaded on demand as needed.

Parallel pull mode

While lazy mode starts containers immediately by fetching only the required data on-demand, parallel pull mode downloads the entire image before startup but does so with higher concurrency than standard Docker pulls. This mode is ideal when you need the full image available at startup or when running I/O-intensive workloads.

Environment

Instance Type: g5.4xlargeEBS: 500GiB gp3, 16000 IOPS, 1000 MB/s ThroughputAMI: Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 24.04) 20260413 (ami-06abbbf2049359343

)**Docker Image:**`763104351884.dkr.ecr.us-east-1.amazonaws.com/sglang:0.5.10-gpu-py312-cu129-ubuntu24.04-sagemaker`

**Image Size**: 19.32GB (compressed), 60.4GB (Disk Usage)** Network**: Corp

Note: We use a private ECR image for this benchmark because public ECR is fronted by Amazon CloudFront, which limits network bandwidth and affects parallel mode performance. Private ECR is served directly from Amazon Simple Storage Service (Amazon S3), providing higher throughput.

Enabling parallel pull mode

The SOCI snapshotter on Deep Learning AMI defaults to lazy mode. To enable parallel pull mode, modify the configuration file at /etc/soci-snapshotter-grpc/config.toml

:

Apply the configuration by restarting the service:

Tip: You can tune max_concurrent_downloads_per_image

and max_concurrent_unpacks_per_image

based on your instance type and network bandwidth. For detailed tuning guidance, see Introducing Seekable OCI Parallel Pull Mode for Amazon EKS.

Verifying parallel mode is active

Monitor the SOCI snapshotter logs during image pull to confirm parallel mode is enabled:

Look for log entries indicating parallel pull/unpack:

Pull image with Docker (non-SOCI)

Standard Docker pull downloads and extracts layers with limited concurrency.

Total time: 4m 44.163s

Pull image with SOCI parallel mode

Using nerdctl with SOCI parallel pull mode uses increased concurrency for both downloads and unpacking operations. Total time: 2m 12.846s

Parallel pull summary

Using SOCI parallel pull mode reduced image pull time from 4 minutes 44 seconds to 2 minutes 12 seconds, representing a 2.2x improvement in pull performance.

Conclusion #

SOCI snapshotter provides improvements for both container startup and image pull operations:

Lazy mode— Achieved a** 20x improvementin container startup time (from 6+ minutes to ~21 seconds) Parallel pull mode**— Achieved a** 2.2x improvement**in image pull time (from 4 minutes 44 seconds to 2 minutes 12 seconds)

Choose lazy mode when you need the fastest possible container startup, or parallel pull mode when you need the full image available before your workload begins.

Clean up #

If you launched EC2 instances to test SOCI snapshotter, terminate them to avoid incurring ongoing charges. Delete any container images you pushed to Amazon Elastic Container Registry (Amazon ECR) during testing, and remove any SOCI indexes you no longer need.

Getting started with SOCI #

DLAMI and Deep Learning Containers are publicly available today with SOCI snapshotter and SOCI index. For more information on publicly available DLAMI and Deep Learning Containers, you can check out SOCI Index DLAMI to select the images that support SOCI, and check out the Deep Learning Container repository to get more information on supported images with SOCI index.

For detailed configuration guidance and best practices, refer to the SOCI documentation and the Deep Learning Container SOCI documentation.

source & further reading

aws.amazon.com — original article Exploring self-distilled reasoning for supervised fine-tuning with Amazon Nova AWS Sagemaker discontinue support for model/data quality monitoring Custom OS installation now available on AWS DeepRacer devices

Reducing container cold start times using SOCI index on DLAMI and DLC

Background #

Container pulling mechanisms #

Solution architecture #

Container startup time comparison with SOCI snapshotter #

Lazy mode

Prerequisites

Environment

Start container with SOCI snapshotter (lazy )

Lazy summary

Parallel pull mode

Environment

Enabling parallel pull mode

Verifying parallel mode is active

Pull image with Docker (non-SOCI)

Pull image with SOCI parallel mode

Parallel pull summary

Conclusion #

Clean up #

Getting started with SOCI #

Run your AI side-project on zahid.host