Lambda isn't leaking memory, your metrics are lying to you AWS Lambda's reported `@maxMemoryUsed` metric tracks a high-water mark across an entire execution environment, not per-invocation memory usage, according to AWS support. A customer's ONNX model inference Lambda function showed memory climbing to 9 GB and triggering out-of-memory kills, but reducing the cache size from 16 to 8 made the problem worse, causing 270+ SIGKILLs in three hours. The misleading metric led engineers to believe memory was leaking, when in fact transient spikes permanently raised the reported value, obscuring the true memory behavior. Your Lambda isn't leaking memory — your metrics are lying to you A customer’s Lambda was climbing to 9 GB and getting OOM-killed. We reduced the cache size to fix it. That made it worse. Here’s what we learned about Linux memory along the way. The incident Among other things, we allow our customers to host ONNX models, running machine learning ML inference on AWS Lambda. One of our largest customers has 40 ONNX models, each around 250 MB. We cache the ONNX instances with a simple functools.lru cache : python @functools.lru cache maxsize=16 def get s3 client, bucket, model id : response = s3 client.client.get object Bucket=bucket, Key=f"prefix/{model id}/model.onnx" model bytes = response "Body" .read return InferenceSession model bytes The customer was seeing occasional OOMs, about 1 request in 100,000. The obvious fix: reduce maxsize of the cache. We changed it from 16 to 10, then to 8. Fewer models in memory, less memory used. Instead, we got 270+ SIGKILLs in three hours. Every Lambda execution environment climbed from 400 MB to 9,000 MB, got killed, restarted, and climbed again. Reducing the cache didn’t reduce memory — it seemed to be accelerated a leak caused by more load/unload cycles. Some quick fixes The first fixes were straightforward, as we were keeping more things in memory than necessary. In fact, the snippet above shows how we kept two copies of the model in memory model bytes and InferenceSession model bytes for a short period, thereby increasing our peak footprint. We switched to loading via a temporary file so ORT reads from disk directly. Together, these dropped the customer’s p50 memory from ~7.5 GB to ~5 GB, and p99 latency improved as well. The above shows the impact on memory usage of some of the quick fixes. Below we can see the impact across execution environments. This also shows why it is important to plot by execution environment to truly understand what is going on. But something still didn’t add up. A 19 MB ONNX model was using about 120 MB RSS after a few loading cycles. And it still looked like we were leaking memory. Your metrics are lying to you We started looking at @maxMemoryUsed — the memory metric Lambda reports in every REPORT line and exposes via CloudWatch Logs Insights. Supposedly, this gives you the maximum memory used within an invocation while this is not explicitly stated anywhere, it is implied by, e.g. the accepted answer here https://stackoverflow.com/questions/55300504/aws-lambda-function-increasing-max-memory-used , or AWS’s blog post here https://aws.amazon.com/blogs/mt/understanding-aws-lambda-behavior-using-amazon-cloudwatch-logs-insights/ . If you ask any LLM as of June 2026, it will confirm that this is indeed per invocation . We plotted it for several customers across multiple regions. The line only ever went up. Never down. Not once. We checked 5,949 invocations across 3 customers and 3 regions. Zero decreases. Even a customer with zero ONNX models showed the same pattern: a monotonically increasing memory line, from 325 MB to 384 MB over 138 invocations. This seems extremely unlikely even in a situation in which memory is leaked, so we opened a ticket with AWS. Their response: “You’re right that the Max Memory Used value reported in the REPORT line behaves as a high water mark of the execution environment, not a per-invocation reset.” Why this might be the case Several Linux mechanisms for reporting memory usage — VmHWM in /proc/