{"slug": "making-llm-calls-reliable-retry-semaphore-cache-and-batch", "title": "Making LLM Calls Reliable: Retry, Semaphore, Cache, and Batch", "summary": "The article describes a reliability stack for making LLM API calls in TestSmith, consisting of four layers: retry with exponential backoff (default 3 attempts), a semaphore to limit concurrent calls (default 5), a content-addressed cache using SHA-256 hashes to avoid redundant requests, and batch generation to collapse multiple calls into one. The middleware is assembled in the order retry → semaphore → raw provider, with retry acquiring its own semaphore slot per attempt to avoid blocking other goroutines. Batch generation uses OpenAI's JSON response format to produce structured output for multiple members in a single API call.", "body_md": "When TestSmith generates tests with --llm\n, it calls an LLM for every public member of every source file being processed. A project with 20 files and 5 public functions each means up to 100 API calls in a single run. That's a lot of surface area for things to go wrong.\nHere's the reliability stack we built, layer by layer.\nLLM APIs fail transiently. Rate limits, timeouts, occasional 5xx responses — all of these are recoverable if you wait and retry. We built a retry middleware that wraps any Provider\n:\ntype RetryProvider struct {\ninner Provider\nmaxRetries int\n}\nfunc (r *RetryProvider) Complete(ctx context.Context, req CompletionRequest) (CompletionResponse, error) {\nvar lastErr error\nfor attempt := 0; attempt < r.maxRetries; attempt++ {\nif attempt > 0 {\nwait := time.Duration(math.Pow(2, float64(attempt))) * 100 * time.Millisecond\nselect {\ncase <-time.After(wait):\ncase <-ctx.Done():\nreturn CompletionResponse{}, ctx.Err()\n}\n}\nresp, err := r.inner.Complete(ctx, req)\nif err == nil {\nreturn resp, nil\n}\nlastErr = err\n}\nreturn CompletionResponse{}, fmt.Errorf(\"after %d attempts: %w\", r.maxRetries, lastErr)\n}\nMaxRetryAttempts\ndefaults to 3. With exponential backoff: attempt 1 is immediate, attempt 2 waits 200ms, attempt 3 waits 400ms. Total worst-case wait per call is under a second — acceptable latency for a background tool.\nWith up to 100 calls to make, goroutine fan-out is the obvious approach. But hitting an LLM API with 100 concurrent requests triggers rate limiting immediately. A semaphore caps the in-flight calls:\ntype SemaphoreProvider struct {\ninner Provider\nsem chan struct{}\n}\nfunc NewSemaphoreProvider(inner Provider, maxConcurrent int) *SemaphoreProvider {\nreturn &SemaphoreProvider{inner: inner, sem: make(chan struct{}, maxConcurrent)}\n}\nfunc (s *SemaphoreProvider) Complete(ctx context.Context, req CompletionRequest) (CompletionResponse, error) {\nselect {\ncase s.sem <- struct{}{}:\ndefer func() { <-s.sem }()\ncase <-ctx.Done():\nreturn CompletionResponse{}, ctx.Err()\n}\nreturn s.inner.Complete(ctx, req)\n}\nMaxConcurrentCalls\ndefaults to 5. Each retry attempt acquires its own semaphore slot — this is important. If retry logic held a slot while waiting between attempts, other goroutines would be blocked unnecessarily. The retry wrapper is the outer layer; semaphore is the inner layer.\nThe middleware stack assembled by the factory:\nretry → semaphore → raw provider\nMany test generation runs touch the same files repeatedly — watch mode is the extreme case. Calling the LLM for the same source code twice is wasteful. A content-addressed cache avoids it:\ntype ResultCache struct {\nmu sync.RWMutex\nentries map[string][]BodyGenResult\nhits int\nmisses int\n}\nfunc cacheKey(req BodyGenRequest) string {\nh := sha256.New()\nfmt.Fprintf(h, \"%s\\n%s\\n%s\\n%s\", req.Language, req.MemberName, req.SourceCode, req.Framework.Name)\nreturn hex.EncodeToString(h[:])\n}\nThe key is a SHA-256 hash of the language, member name, source code, and framework. If the source file changes, the hash changes and the cache misses — you always get fresh results for changed code.\nAfter a run, --verbose\nprints the cache stats:\nLLM cache — hits: 12 misses: 8 entries: 8\nThe fan-out approach makes one API call per public member. For a file with 10 functions, that's 10 calls. Batch generation collapses this to one:\nfunc (g *LLMBodyGenerator) GenerateBatchBodies(\nctx context.Context,\nreqs []BodyGenRequest,\n) ([]BodyGenResult, error) {\nprompt := buildBatchPrompt(reqs)\nresp, err := g.provider.Complete(ctx, CompletionRequest{\nSystemPrompt: batchSystemPrompt,\nUserPrompt: prompt,\nModel: g.model,\nMaxTokens: g.maxTokens * len(reqs), // scale with request count\nTemperature: g.temperature,\nResponseFormat: \"json_object\", // structured output\n})\n// ...\n}\nWe use OpenAI's response_format: {\"type\": \"json_object\"}\nto get structured output. The model returns a JSON envelope with one entry per member:\n{\n\"tests\": [\n{\"name\": \"ProcessPayment\", \"code\": \"func TestProcessPayment(t *testing.T) { ... }\"},\n{\"name\": \"RefundPayment\", \"code\": \"func TestRefundPayment(t *testing.T) { ... }\"}\n]\n}\nWe parse that with a primary JSON parser, with a fallback to a delimiter-regex parser for providers that don't support structured output.\nThe pipeline checks for the BatchBodyGenerator\ninterface via type assertion. If the generator implements it, batch mode is used. If not (or if the driver explicitly opts out), it falls back to goroutine fan-out with individual calls. This keeps the interface opt-in and backward compatible.\nWith all this happening in the background, it's useful to know what actually ran. The cacheStatsReporter\ninterface lets the CLI query stats without importing the llm\npackage:\n// In cmd/testsmith/generate.go — avoids importing internal/llm from the CLI layer\ntype cacheStatsReporter interface {\nCacheStats() (hits, misses, size int)\n}\nfunc printCacheStats(bg domain.BodyGenerator) {\nif !verbose {\nreturn\n}\nif r, ok := bg.(cacheStatsReporter); ok {\nhits, misses, size := r.CacheStats()\nfmt.Printf(\"LLM cache — hits: %d misses: %d entries: %d\\n\", hits, misses, size)\n}\n}\nThis is the interface segregation principle at work: the CLI knows about domain.BodyGenerator\n(which it needs for the pipeline) and cacheStatsReporter\n(which it needs for stats output). It doesn't need to know anything else about the LLM implementation.\nIn practice, on a mid-size Go project with 40 source files and an average of 6 public functions each:\nThe cache and batch generation together turn what would be a \"go make coffee\" operation into something you can run while you're still in the flow.\nNext: how we structure context for both AI agents working on TestSmith itself and for the LLM generating tests for your project.", "url": "https://wpnews.pro/news/making-llm-calls-reliable-retry-semaphore-cache-and-batch", "canonical_source": "https://dev.to/orieken/making-llm-calls-reliable-retry-semaphore-cache-and-batch-46fi", "published_at": "2026-05-23 16:34:37+00:00", "updated_at": "2026-05-23 17:04:10.228054+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "artificial-intelligence"], "entities": ["TestSmith"], "alternates": {"html": "https://wpnews.pro/news/making-llm-calls-reliable-retry-semaphore-cache-and-batch", "markdown": "https://wpnews.pro/news/making-llm-calls-reliable-retry-semaphore-cache-and-batch.md", "text": "https://wpnews.pro/news/making-llm-calls-reliable-retry-semaphore-cache-and-batch.txt", "jsonld": "https://wpnews.pro/news/making-llm-calls-reliable-retry-semaphore-cache-and-batch.jsonld"}}