Making LLM Calls Reliable: Retry, Semaphore, Cache, and Batch The article describes a reliability stack for making LLM API calls in TestSmith, consisting of four layers: retry with exponential backoff (default 3 attempts), a semaphore to limit concurrent calls (default 5), a content-addressed cache using SHA-256 hashes to avoid redundant requests, and batch generation to collapse multiple calls into one. The middleware is assembled in the order retry → semaphore → raw provider, with retry acquiring its own semaphore slot per attempt to avoid blocking other goroutines. Batch generation uses OpenAI's JSON response format to produce structured output for multiple members in a single API call. When TestSmith generates tests with --llm , it calls an LLM for every public member of every source file being processed. A project with 20 files and 5 public functions each means up to 100 API calls in a single run. That's a lot of surface area for things to go wrong. Here's the reliability stack we built, layer by layer. LLM APIs fail transiently. Rate limits, timeouts, occasional 5xx responses — all of these are recoverable if you wait and retry. We built a retry middleware that wraps any Provider : type RetryProvider struct { inner Provider maxRetries int } func r RetryProvider Complete ctx context.Context, req CompletionRequest CompletionResponse, error { var lastErr error for attempt := 0; attempt < r.maxRetries; attempt++ { if attempt 0 { wait := time.Duration math.Pow 2, float64 attempt 100 time.Millisecond select { case <-time.After wait : case <-ctx.Done : return CompletionResponse{}, ctx.Err } } resp, err := r.inner.Complete ctx, req if err == nil { return resp, nil } lastErr = err } return CompletionResponse{}, fmt.Errorf "after %d attempts: %w", r.maxRetries, lastErr } MaxRetryAttempts defaults to 3. With exponential backoff: attempt 1 is immediate, attempt 2 waits 200ms, attempt 3 waits 400ms. Total worst-case wait per call is under a second — acceptable latency for a background tool. With up to 100 calls to make, goroutine fan-out is the obvious approach. But hitting an LLM API with 100 concurrent requests triggers rate limiting immediately. A semaphore caps the in-flight calls: type SemaphoreProvider struct { inner Provider sem chan struct{} } func NewSemaphoreProvider inner Provider, maxConcurrent int SemaphoreProvider { return &SemaphoreProvider{inner: inner, sem: make chan struct{}, maxConcurrent } } func s SemaphoreProvider Complete ctx context.Context, req CompletionRequest CompletionResponse, error { select { case s.sem <- struct{}{}: defer func { <-s.sem } case <-ctx.Done : return CompletionResponse{}, ctx.Err } return s.inner.Complete ctx, req } MaxConcurrentCalls defaults to 5. Each retry attempt acquires its own semaphore slot — this is important. If retry logic held a slot while waiting between attempts, other goroutines would be blocked unnecessarily. The retry wrapper is the outer layer; semaphore is the inner layer. The middleware stack assembled by the factory: retry → semaphore → raw provider Many test generation runs touch the same files repeatedly — watch mode is the extreme case. Calling the LLM for the same source code twice is wasteful. A content-addressed cache avoids it: type ResultCache struct { mu sync.RWMutex entries map string BodyGenResult hits int misses int } func cacheKey req BodyGenRequest string { h := sha256.New fmt.Fprintf h, "%s\n%s\n%s\n%s", req.Language, req.MemberName, req.SourceCode, req.Framework.Name return hex.EncodeToString h : } The key is a SHA-256 hash of the language, member name, source code, and framework. If the source file changes, the hash changes and the cache misses — you always get fresh results for changed code. After a run, --verbose prints the cache stats: LLM cache — hits: 12 misses: 8 entries: 8 The fan-out approach makes one API call per public member. For a file with 10 functions, that's 10 calls. Batch generation collapses this to one: func g LLMBodyGenerator GenerateBatchBodies ctx context.Context, reqs BodyGenRequest, BodyGenResult, error { prompt := buildBatchPrompt reqs resp, err := g.provider.Complete ctx, CompletionRequest{ SystemPrompt: batchSystemPrompt, UserPrompt: prompt, Model: g.model, MaxTokens: g.maxTokens len reqs , // scale with request count Temperature: g.temperature, ResponseFormat: "json object", // structured output } // ... } We use OpenAI's response format: {"type": "json object"} to get structured output. The model returns a JSON envelope with one entry per member: { "tests": {"name": "ProcessPayment", "code": "func TestProcessPayment t testing.T { ... }"}, {"name": "RefundPayment", "code": "func TestRefundPayment t testing.T { ... }"} } We parse that with a primary JSON parser, with a fallback to a delimiter-regex parser for providers that don't support structured output. The pipeline checks for the BatchBodyGenerator interface via type assertion. If the generator implements it, batch mode is used. If not or if the driver explicitly opts out , it falls back to goroutine fan-out with individual calls. This keeps the interface opt-in and backward compatible. With all this happening in the background, it's useful to know what actually ran. The cacheStatsReporter interface lets the CLI query stats without importing the llm package: // In cmd/testsmith/generate.go — avoids importing internal/llm from the CLI layer type cacheStatsReporter interface { CacheStats hits, misses, size int } func printCacheStats bg domain.BodyGenerator { if verbose { return } if r, ok := bg. cacheStatsReporter ; ok { hits, misses, size := r.CacheStats fmt.Printf "LLM cache — hits: %d misses: %d entries: %d\n", hits, misses, size } } This is the interface segregation principle at work: the CLI knows about domain.BodyGenerator which it needs for the pipeline and cacheStatsReporter which it needs for stats output . It doesn't need to know anything else about the LLM implementation. In practice, on a mid-size Go project with 40 source files and an average of 6 public functions each: The cache and batch generation together turn what would be a "go make coffee" operation into something you can run while you're still in the flow. Next: how we structure context for both AI agents working on TestSmith itself and for the LLM generating tests for your project.