Sharding Hot Partitions in DynamoDB: Why Your Single-Partition Log Table Will Break at Scale A developer discovered that their DynamoDB table for an AI anti-counterfeiting platform had a hot partition anti-pattern, where all operation logs wrote to a single partition key, risking throughput limits at scale. They identified three hot spots and implemented daily-bucketed partition keys to distribute writes across partitions, avoiding table migration. This post was created for the H0: Hack the Zero Stack hackathon. H0Hackathon I shipped a DynamoDB table with a hot partition and didn't notice for three weeks. At demo scale 700 items, a few writes per minute everything worked. It would have been fine right up until it wasn't. The anti-pattern was obvious in hindsight: every AI operation log entry was written to PK: "OPS LOG" . A single partition key for an append-only, high-throughput write stream. This is the exact workload that hits DynamoDB's per-partition throughput ceiling. Here's what I found, why it matters, and the three patterns I used to fix it, all without a table migration. DynamoDB scales horizontally by splitting data across partitions. Each partition handles: When you use PAY PER REQUEST on-demand billing mode, DynamoDB auto-scales table-level capacity. But it doesn't auto-scale within a partition. If all your writes hit the same partition key, you're bottlenecked at 1,000 WCU on that one partition regardless of your table-level throughput. A note on adaptive capacity: DynamoDB does have an adaptive capacity feature that can temporarily boost a hot partition's throughput by borrowing from underutilized partitions. But adaptive capacity is a safety net, not a design strategy. It activates reactively, has limits, and doesn't eliminate the per-partition ceiling. Designing around the constraint is always better than relying on the database to compensate for a bad access pattern. For PK: "OPS LOG" , with every single AI operation landing on one partition key, this means: ProvisionedThroughputExceededException .A real anti-counterfeiting platform processing scans across thousands of brands could easily hit this. And the failure mode is silent at first: DynamoDB retries internally with exponential backoff. You only see it as increased latency, then as dropped writes. I audited every PK pattern in my single-table design and found three hot spots. The simplest detection method: count the cardinality of each PK pattern. If a PK has cardinality of 1 every write goes to the same key , it's a hot partition by definition. // BEFORE: Every AI call writes to the same PK { PK: "OPS LOG", SK: "2026-06-22T01:00:00Z threat detector", agent: "threat detector", latencyMs: 340, aiSeverity: "HIGH", ... } Problem: Unbounded write concentration. Every AI classification, regardless of brand, product, or time, lands on one partition key. PK cardinality: 1. // BEFORE: All threats for one brand in one partition { PK: "THREAT brand-abc-123", SK: "ALERT 2026-06-22T01:00:00Z geographic anomaly", severity: "HIGH", ... } Problem: A brand under active counterfeiting attack generates hundreds of alerts per day. All writes concentrate on THREAT brand-abc-123 . The brand being attacked the hardest gets the worst write performance. Exactly backwards from what you want. // Collection key for "list all brands" without Scan { PK: "BRAND INDEX", SK: "BRAND 2026-06-22T01:00:00Z abc", name: "Luxe Watches", ... } Problem: If brand registrations spike product launch, marketing campaign , all writes hit BRAND INDEX . Same issue as OPS LOG. PK cardinality: 1. Instead of a single OPS LOG key, bucket writes by date: js // AFTER: Daily-bucketed partition keys const dateBucket = timestamp.slice 0, 10 ; // "2026-06-22" { PK: OPS LOG ${dateBucket} , // OPS LOG 2026-06-22 SK: ${timestamp} ${agent} , GSI1PK: "OPS LOG", // For cross-day queries GSI1SK: timestamp, ... } Write path: Each day's ops entries go to a different partition. Today's 1,000 writes go to OPS LOG 2026-06-22 . Tomorrow's go to OPS LOG 2026-06-23 . The per-partition WCU limit applies per day, not per all-time. PK cardinality goes from 1 to 365/year. Read path: The dashboard needs recent ops entries across days. Two options: Option A: Scatter-gather js // Query each daily partition in parallel const days = 7; const buckets = ; for let i = 0; i < days; i++ { const d = new Date Date.now - i 86400000 ; buckets.push d.toISOString .slice 0, 10 ; } const results = await Promise.all buckets.map date = queryItems OPS LOG ${date} , undefined, { limit: 50, scanForward: false } ; // Merge and sort const logs = results.flat .sort a, b = b.timestamp.localeCompare a.timestamp .slice 0, limit ; 7 parallel queries, each hitting a different partition. DynamoDB handles them concurrently. Total latency is the slowest single query, typically under 20ms. Option B: GSI1 query js // Single query across all days via GSI const logs = await queryGSI1 "OPS LOG", undefined, { limit: 50, scanForward: false } ; The GSI1 projection has GSI1PK: "OPS LOG" across all daily partitions. This re-concentrates reads on one GSI partition key, but reads are less critical than writes 3,000 RCU vs 1,000 WCU limit , and the dashboard is low-frequency. I use scatter-gather as the primary path and GSI1 as a fallback. Threats are read by brand, so the bucket needs to include the brand ID: js // AFTER: Monthly-bucketed by brand const monthBucket = timestamp.slice 0, 7 ; // "2026-06" { PK: THREAT ${brandId} ${monthBucket} , // THREAT abc 2026-06 SK: ALERT ${timestamp} ${type} , GSI1PK: BRAND ${brandId} , // Cross-month queries GSI1SK: THREAT ${timestamp} , ... } Why monthly, not daily? Threats are lower volume than ops logs. A busy brand might get 10-50 threats per day. Monthly bucketing is sufficient to prevent hot-spotting while keeping the scatter-gather read path manageable query last 3 months = 3 parallel queries vs 90 for daily . Read path: GSI1 query on BRAND brandId with SK prefix THREAT returns threats across all monthly buckets, sorted by timestamp, no scatter-gather needed: js const threats = await queryGSI1 BRAND ${brandId} , "THREAT ", { limit: 50, scanForward: false, } ; This is the ideal pattern: shard writes on the base table, unify reads on a GSI. BRAND INDEX and PRODUCT INDEX are also single-partition keys. But brand and product registration is low-throughput: maybe 50 per day during a hackathon, maybe 500 per day in production. The 1,000 WCU per-partition limit won't be hit. The decision: Don't shard collection keys. The engineering cost of scatter-gather reads on "list all brands" isn't justified when registration throughput will never approach the partition limit. If it did say, an enterprise customer bulk-importing 10,000 products via the batch endpoint , I'd switch to PRODUCT INDEX