I Built a Chaos Engine That Goes Where No Tool Has Gone Before A developer built pastaay, a chaos engineering tool written in Go that operates across seven different protocols and the OS itself, going beyond the network-level disruptions of tools like Netflix's Chaos Monkey or Gremlin. The tool uses a type-indexed cache behind an atomic pointer to check for active policies on every request without heap allocations, enabling it to handle 50,000 requests per second without garbage collector overhead. Pastaay deploys through a CLI with a dead man's switch, a Kubernetes operator managing CRDs, or an AI that reads live Prometheus metrics to generate attack plans. No team. No budget. Just Go. This is the engineering deep dive behind pastaay, a chaos engine that breaks everything from HTTP headers to physical memory. I started this project with a simple question: why do all chaos engineering tools stop at the network? Netflix’s Chaos Monkey kills instances. That’s useful if your failure mode is “pod died.” Gremlin adds CPU spikes. Litmus runs pod level experiments in Kubernetes. They’re all valuable tools. But none of them let you walk into a single Go binary and say: “intercept this gRPC bidirectional stream, drop every third Kafka message, corrupt this MongoDB aggregation pipeline, and while you’re at it, eat 2GB of RAM and burn 4 CPU cores for 30 seconds.” None of them give you one YAML file that defines destruction across seven different protocols plus the OS itself, then deploys that destruction through a CLI with a dead man’s switch, a Kubernetes operator managing CRDs, or an AI that reads your live Prometheus metrics and writes the attack plan for you. This is the gap pastaay fills. It’s not a wrapper around existing tools. It’s not a YAML generator for litmus experiments. Every interceptor, every operator controller, every JavaScript panel, every line of Go is built from scratch. The name comes from paSta’ay, a ritual of Taiwan’s Saisiyat people honoring the Da’ai, a tribe of diminutive but uncommonly powerful spirits. Chaos, remembrance, and things that punch above their weight. Seemed fitting. Let me walk you through how each piece works, with real code, real architecture diagrams, and the real engineering decisions behind them. Before we dive into individual components, here’s how the entire system fits together: The engine has three entry points, web console, CLI, and Kubernetes operator, all feeding into the same Config Manager. The Manager holds the source of truth: an atomic pointer to the current configuration. Every interceptor reads from this pointer on every request. The file watcher updates this pointer when the YAML changes on disk. The telemetry bus collects events from every layer and feeds them to the web console, Prometheus, and OpenTelemetry. This is the most important performance decision in the entire project. On every HTTP request, literally every request your application serves, pastaay needs to check: “is there an active policy that targets this path?” If that check allocates memory, and you’re serving 50,000 requests per second, you’re creating 50,000 heap allocations per second. The garbage collector will eat you alive. The solution is a type-indexed cache behind an atomic pointer. T hanks to Go 1.19’s atomic.Pointer , we can do this safely and cleanly without interface{} type assertions: type Manager struct { mu sync.Mutex cfg atomic.Pointer PastaayConfig typedPolicies atomic.Pointer map string Policy startTime time.Time sensorStatus sync.Map } func m Manager GetActivePolicies policyType string Policy { ptr := m.typedPolicies.Load cfg := m.cfg.Load if ptr == nil || cfg = nil && time.Since m.startTime < cfg.WarmupDuration { return nil } return ptr policyType } When the configuration is updated by file watcher, webhook, or operator , the Manager takes a write lock once, rebuilds the type-indexed map, and atomically swaps the pointer. Every subsequent read, and there are potentially millions of reads per second, is just an atomic load followed by a map lookup. Zero allocation. Zero lock contention. The warmup check at the top prevents chaos from triggering during the initial configuration load. If your application starts and chaos policies fire before the server is ready, you get a false-positive outage. The warmup period default 10 seconds gives everything time to stabilize. Every policy gets a deterministic FNV-1a hash computed from all its fields: func generateStableHash p Policy uint64 { // FNV-1a: a non-cryptographic hash that distributes well // and is deterministic, same input always produces same output. var h uint64 = 14695981039346656037 // FNV offset basis const fnvPrime uint64 = 1099511628211 // Hash string fields with field separators to prevent // "ab" + "c" from colliding with "a" + "bc". sep := func { h ^= 0; h = fnvPrime } for , s := range string{p.Name, p.Target, p.Type, p.ErrorBody, p.StreamRollMode} { for i := 0; i < len s ; i++ { h ^= uint64 s i h = fnvPrime } sep } // ... numeric fields, match headers ... return h } Why does this matter? The resource sabotage daemon uses policy hashes for deduplication. If you change a policy’s RAM chunk from 256MB to 512MB, the hash changes, and the daemon kills the old RAM leaker and spawns a new one. If nothing changed, the daemon does nothing. This prevents the classic “restart all goroutines on every config reload” problem. SQL interception has a unique challenge: the same logical query can have different textual representations. SELECT FROM users and select from users and SELECT FROM users are the same query. The CleanSQLCommand function normalizes SQL text by stripping comments, collapsing whitespace, and uppercasing: js func CleanSQLCommand cmd string string { var result strings.Builder inString := false var stringChar byte for i := 0; i < len cmd ; i++ { char := cmd i if char == '\'' || char == '"' { // Handle escaped quotes isEscaped := false for j := i - 1; j = 0 && cmd j == '\\'; j-- { isEscaped = isEscaped } if isEscaped { if inString && char == stringChar { inString = false } else if inString { inString = true stringChar = char } } } if inString { if char == '-' && i+1 < len cmd && cmd i+1 == '-' { // Skip SQL line comment for i < len cmd && cmd i = '\n' { i++ } result.WriteByte ' ' continue } } result.WriteByte char } return strings.ToUpper strings.Trim result.String , " \r\n\t; " } This is one of those functions that looks simple but took three iterations to get right. The string aware comment stripping so inside a SQL string literal doesn't get treated as a comment was the hardest part. Version 1 used a regex. Version 2 used a simple loop but broke on escaped quotes. Version 3 this one is the first that handles all edge cases correctly. The Manager’s Update method is the only write path. It's protected by a mutex, but that mutex is never contended during normal operation because updates happen rarely once per config file change . Here's the full method: func m Manager Update newCfg PastaayConfig { m.mu.Lock defer m.mu.Unlock if newCfg = nil { // Pre-compute all policy metadata for i := range newCfg.Policies { p := &newCfg.Policies i // Generate metric tag type:target, truncated to 64 chars tag := p.Type + ":" + p.Target if len tag 64 { tag = tag :61 + "..." } p.MetricTag = tag // SQL regex compilation if strings.EqualFold p.Type, "sql" { // ... compile regex for SQL target matching ... } // Compute stable FNV-1a hash p.PolicyHash = generateStableHash p } m.cfg.Store newCfg } // Rebuild type-indexed cache cache := make map string Policy if newCfg = nil { for , p := range newCfg.Policies { cache p.Type = append cache p.Type , p } } m.typedPolicies.Store &cache } The critical design choice here: the metadata pre-computation happens under the write lock, not on the hot path. Every interceptor calls GetActivePolicies which does nothing but an atomic load and a map lookup. The regex compilation, hash computation, and metric tag generation all happen once during the update, not millions of times during request processing. The naive approach to watching a configuration file: watcher, := fsnotify.NewWatcher watcher.Add "pastaay.yaml" for event := range watcher.Events { if event.Has fsnotify.Write { reload } } This breaks in production. Here’s why: RENAME event, not WRITE . sed -i create a new file and rename it over the old one.Pastaay’s watcher handles all of these: js const debounceWindow = 300 time.Millisecond func WatchConfig filePath string, reloadCallback func PastaayConfig cancel func , err error { ctx, cancelCtx := context.WithCancel context.Background watcher, werr := fsnotify.NewWatcher if werr = nil { cancelCtx return nil, werr } if aerr := watcher.Add filePath ; aerr = nil { = watcher.Close cancelCtx return nil, aerr } var timer time.Timer timerMu sync.Mutex fireWG sync.WaitGroup // tracks in-flight reloads so cancel waits for them stopFlag = make chan struct{} // scheduleReload is the single entry point for ALL file events. // Remove, Rename, Chmod inode changes , Write, Create, and even // watcher errors all go through here. This eliminates the race // condition that existed when Remove and Write had separate paths. scheduleReload := func reason string { timerMu.Lock defer timerMu.Unlock // If stopFlag is closed, cancel has been called. // Don't schedule anything new. select { case <-stopFlag: return default: } // Reset the debounce timer. Every new event pushes the reload // 300ms into the future. This prevents 4 rapid reloads when an // editor saves a file. if timer = nil { timer.Stop } timer = time.AfterFunc debounceWindow, func { // Register BEFORE doing any work so cancel can track us. fireWG.Add 1 defer fireWG.Done // Re-check: if cancel was called during the 300ms wait, // bail out immediately. select { case <-stopFlag: return default: } // Reattach: remove the old inode, then retry Add. // vim writes a temp file then renames. sed -i creates a // new file. K8s ConfigMaps swap symlink targets. All of // these are handled by Remove + Add. = watcher.Remove filePath for attempt := 0; attempt < 10; attempt++ { if err := watcher.Add filePath ; err == nil { break } select { case <-stopFlag: return case <-time.After 50 time.Millisecond : } } newCfg, loadErr := LoadConfig filePath if loadErr = nil { log.Printf " Pastaay-Config reload skipped %s : %v", reason, loadErr return } // Validate rejects broken YAML. Engine keeps the last valid config. if vErr := newCfg.Validate ; vErr = nil { log.Printf " Pastaay-Config reload rejected %s : %v", reason, vErr return } // Final gate: if cancel was called during LoadConfig or // Validate, do NOT invoke the callback. The caller has // already torn down and a stale config push would be a bug. select { case <-stopFlag: return default: } reloadCallback newCfg } } var loopWG sync.WaitGroup loopWG.Add 1 go func { defer loopWG.Done defer watcher.Close for { select { case <-ctx.Done : return case event, ok := <-watcher.Events: if ok { return } // All event types go to the same debounced path. // No more separate CAS reattach goroutine. if event.Has fsnotify.Write || event.Has fsnotify.Create || event.Has fsnotify.Remove || event.Has fsnotify.Rename || event.Has fsnotify.Chmod { scheduleReload event.Op.String } case errEv, ok := <-watcher.Errors: if ok { return } log.Printf " Pastaay-Config watcher error forcing reattach : %v", errEv scheduleReload "watcher-error" } } } cancel = func { close stopFlag // signal all goroutines to stop cancelCtx // cancel the root context timerMu.Lock if timer = nil { timer.Stop // stop any pending debounce timer } timerMu.Unlock loopWG.Wait // wait for the event loop goroutine fireWG.Wait // wait for any in-flight reload callback } return cancel, nil } All file events; Write, Create, Remove, Rename, Chmod, and watcher errors, go through a single scheduleReload path. A 300ms debounce prevents burst reloads. Inside the callback, watcher.Remove + Add handles inode changes from vim, sed -i, and Kubernetes ConfigMap symlink swaps. If LoadConfig fails or Validate rejects, the engine keeps running with the last valid config. The critical detail: fireWG.Add 1 happens inside the AfterFunc body but BEFORE the I/O work. cancel closes stopFlag, stops the timer, then calls loopWG.Wait then fireWG.Wait . Any callback past the stopFlag gate completes before Wait returns. Any callback still in the debounce window sees stopFlag closed and bails. Zero use-after-cancel. Invalid configurations are rejected. If you write broken YAML, the engine keeps running with the last valid configuration. The error is logged. No crash, no rollback, no surprise. Every protocol interceptor in pastaay follows the same skeleton. Here’s the HTTP middleware in full: func Middleware mgr config.Manager func http.Handler http.Handler { return func next http.Handler http.Handler { return http.HandlerFunc func w http.ResponseWriter, r http.Request { // Step 1: Check if this path is protected whitelist if mgr.IsCommandIgnored "http", r.URL.Path { next.ServeHTTP w, r return } // Step 2: Get active policies for this protocol zero alloc atomic load policies := mgr.GetActivePolicies "http" // Step 3: Iterate policies, match path and headers for i := range policies { p := &policies i if matchPath r.URL.Path, p || matchHeaders r, p.MatchHeaders { continue } // Step 4a: Latency injection with context cancellation support if p.LatencyChance 0 && rand.Float64 < p.LatencyChance { metrics.InjectedFaultsTotal.WithLabelValues metricTag, "latency" .Inc ctx, span := tracing.StartChaosSpan r.Context , "pastaay.http.latency", p.Target, "latency" timer := time.NewTimer p.LatencyDuration select { case <-timer.C: span.End case <-ctx.Done : timer.Stop span.End w.WriteHeader 499 // 499: Nginx standard for Client Closed Request return } } // Step 4b: Error injection if p.ErrorChance 0 && rand.Float64 < p.ErrorChance { metrics.InjectedFaultsTotal.WithLabelValues metricTag, "error" .Inc , span := tracing.StartChaosSpan r.Context , "pastaay.http.error", p.Target, "error" defer span.End status := p.ErrorCode if status == 0 { status = http.StatusInternalServerError } w.Header .Set "Content-Type", "application/json" w.WriteHeader status if p.ErrorBody = "" { io.WriteString w, p.ErrorBody } return // Stop chain, don't call next handler } } // Step 5: No matching policy with active probability, pass through next.ServeHTTP w, r } } } Path matching isn’t as simple as strings.HasPrefix . Consider: Policy target: /api/users/ Request path: /api/users/123 ✓ Match Request path: /api/usersettings ✗ Should NOT match The solution requires checking that the character immediately after the prefix is either / or end of string: func matchPath reqPath string, p config.Policy bool { // Exact match or wildcard "all" catches everything. if strings.EqualFold p.Target, "all" || strings.EqualFold reqPath, p.Target { return true } if p.IsWildcard { reqPathUpper := strings.ToUpper reqPath if strings.HasPrefix reqPathUpper, p.WildcardPrefix { remaining := reqPathUpper len p.WildcardPrefix : // The character AFTER the prefix must be '/' or end-of-string. // This prevents /api/users from matching /api/usersettings. if strings.HasSuffix p.WildcardPrefix, "/" || len remaining == 0 || remaining 0 == '/' { return true } } } return false } The third condition remaining 0 == '/' is the key: after stripping the wildcard prefix, the next character must be a path separator. This prevents /api/users from matching /api/usersettings . gRPC interception is more complex than HTTP because of bidirectional streaming. A unary RPC that fails returns an error code to the client immediately. A streaming RPC that fails mid-stream leaves the client hanging unless the server actively closes the stream. The gRPC interceptor handles this by distinguishing between unary and streaming contexts. For streaming, the interceptor wraps the server stream to inject faults on individual messages rather than the entire stream. This allows you to drop message 3 in a stream of 100 messages without affecting the rest, a far more realistic failure mode than killing the entire connection. SQL interception is fundamentally different from HTTP interception. With HTTP, you wrap a handler. With SQL, you wrap the database driver itself, at the database/sql/driver interface level. type WrapperDriver struct { original driver.Driver // the real driver pgx, sqlite3, mysql, etc. cfgManager config.Manager } type WrapperConnector struct { original driver.Connector // Go 1.10+ connector interface cfgManager config.Manager } func d WrapperDriver Open name string driver.Conn, error { // Apply chaos BEFORE opening: can drop connections entirely // or add latency before the handshake. if err := applyConnectionChaos context.Background , d.cfgManager ; err = nil { return nil, err } conn, err := d.original.Open name if err = nil { return nil, err } // Wrap the real connection so every Query/Exec/Begin gets intercepted. return &WrapperConn{originalConn: conn, cfgManager: d.cfgManager}, nil } Every connection that passes through the wrapper gets a WrapperConn that intercepts Prepare , Exec , Query , Begin , and Commit . Each of these methods checks the policy engine before passing through to the real connection. The OpenConnector method handles Go 1.10+ driver connector interface, which is what modern database drivers use internally: func d WrapperDriver OpenConnector name string driver.Connector, error { // Go 1.10+ database/sql uses connectors internally. // We must wrap the connector too, or the driver bypasses our interceptor. if dc, ok := d.original. driver.DriverContext ; ok { connector, err := dc.OpenConnector name if err = nil { return nil, err } return &WrapperConnector{original: connector, cfgManager: d.cfgManager}, nil } // Fallback for older drivers that don't implement DriverContext. return &fallbackConnector{driver: d, name: name}, nil } SQL chaos has a critical safety constraint: never inject faults inside an active transaction’s lifecycle. Breaking a COMMIT is the worst thing you can do to a database, it leaves locks held, connections stuck, and the application in an undefined state. SQL chaos checks are applied at connection open and statement execution points. BEGIN also receives a chaos check, but Commit and Rollback are passed through without interception. You can still introduce latency the transaction takes longer , but you can't drop the connection or inject query errors during the transaction body. This is the kind of constraint that only comes from production experience. Version 1 of the SQL driver didn’t have this protection, and it corrupted test data. Version 2 added it after a very long debugging session. This is the feature that makes people do a double take. Most chaos tools stop at the network layer. Pastaay eats your CPU and RAM. func BurnCPU ctx context.Context, cores int, threshold int { if cores <= 0 { cores = runtime.NumCPU // saturate all cores by default } if threshold <= 0 { threshold = 100000 // empirically ~95% CPU on a single core } var wg sync.WaitGroup for i := 0; i < cores; i++ { wg.Add 1 go func { defer wg.Done payload := byte "pastaay-cpu-vector" var localSink 32 byte // compiler can't optimize this away for { for j := 0; j < threshold; j++ { // SHA-256 is in assembly crypto/sha256 . // The compiler cannot elide it as dead code. localSink = sha256.Sum256 payload } // runtime.KeepAlive tells the compiler "localSink is still // live", prevents the entire loop from being optimized out. runtime.KeepAlive localSink select { case <-ctx.Done : // clean exit when TTL expires or HALT is pressed return default: continue } } } } wg.Wait } The Go compiler is smart. If it detects that your computation has no observable side effects, it eliminates the entire loop. The empty for {} burns CPU in debug builds but vanishes in optimized builds. So we use SHA-256 hashing, a cryptographic operation the compiler cannot elide because it's implemented in assembly via the crypto/sha256 package. Two subtleties here: runtime.KeepAlive localSink , Without this, the compiler might notice that localSink is never read after the loop and eliminate the assignment. KeepAlive is a compiler directive that says "this variable is still live." It's a no op at runtime but prevents dead code elimination. select with ctx.Done , This is the clean escape hatch. The default branch keeps the loop spinning. When the context is cancelled experiment TTL expires, halt button pressed, policy removed , the goroutine exits immediately. No leaks, no orphans. The threshold parameter controls intensity through iteration count. Empirically: func LeakRAM ctx context.Context, chunkMB int, interval time.Duration { if interval <= 0 { interval = 1 time.Second } if chunkMB <= 0 { return } ticker := time.NewTicker interval defer ticker.Stop chunkSize := chunkMB 1024 1024 var pool byte defer func { atomic.AddInt64 ¤tPoolMB, -int64 len pool chunkMB pool = nil runtime.GC } allocate := func bool { if atomic.LoadInt64 ¤tPoolMB +int64 chunkMB maxRAMPoolMB { log.Printf " Pastaay-Resource RAM ceiling %dMB reached, refusing new chunk %dMB ", maxRAMPoolMB, chunkMB return false } chunk := make byte, chunkSize // Page forcing: touch every 4096th byte for i := 0; i < chunkSize; i += 4096 { chunk i = 1 } pool = append pool, chunk atomic.AddInt64 ¤tPoolMB, int64 chunkMB return true } allocate for { select { case <-ctx.Done : return case <-ticker.C: allocate } } } This function embodies three engineering principles I consider non negotiable for any chaos tool: 1. Page forcing. make byte, chunkSize allocates virtual memory but doesn't commit physical pages. The OS uses demand paging, physical RAM is only assigned when a page is actually touched. By writing to every 4096th byte the standard page size on both x86 and ARM64 , we force the kernel to commit real, physical memory. Without this, your /proc/meminfo looks scary but the kernel hasn't actually allocated anything meaningful. 2. Global ceiling. maxRAMPoolMB is hard coded at 4096MB. I hardcoded this to 4GB because it’s a safe upper bound that prevents catastrophic node eviction OOMKill on standard 8GB/16GB cloud worker nodes, while still being devastating enough for any application-level test. No matter what the YAML says, the engine will never allocate more than 4GB across all RAM leaking policies combined. The check is an atomic load compare at the allocation site, if you're already at 3.9GB and request another 256MB chunk, the allocation is refused and logged. The guard module catches unrealistic values at validation time, but the runtime ceiling is the absolute last line of defense. 3. Guaranteed cleanup, the Amnesia Protocol. The defer block zeros the pool slice, subtracts the atomic counter, and triggers runtime.GC . When the context is cancelled because the experiment's TTL expired, you clicked the HALT button, or the policy was removed from the YAML , all memory returns to the OS. No restart required. No dangling allocations. The name "Amnesia Protocol" comes from the idea that the engine should forget it ever held that memory, complete, verifiable amnesia. The resource sabotage daemon MonitorAndTrigger polls policies every 2 seconds, comparing policy hashes to detect changes. This is a state machine with exactly two states: running with a cancel function and idle: func MonitorAndTrigger ctx context.Context, mgr config.Manager { ticker := time.NewTicker 2 time.Second defer ticker.Stop var lastResourceHash uint64 // tracks policy identity across reloads var activeCancel context.CancelFunc for { select { case <-ctx.Done : if activeCancel = nil { activeCancel // kill any running sabotage goroutines } return case <-ticker.C: policies := mgr.GetActivePolicies "resource" // Kill-switch: if all resource policies are removed from YAML, // immediately stop sabotage and release all memory. if len policies == 0 { if activeCancel = nil { activeCancel activeCancel = nil lastResourceHash = 0 } continue } // Compute combined hash of all active resource policies. // Only restart sabotage if policies actually changed. var combinedHash uint64 for , p := range policies { combinedHash = combinedHash<<1 | combinedHash 63 ^ p.PolicyHash } if combinedHash = lastResourceHash { if activeCancel = nil { activeCancel // kill old sabotage before starting new } lastResourceHash = combinedHash var chaosCtx context.Context chaosCtx, activeCancel = context.WithCancel ctx for , p := range policies { TriggerResourceSabotage chaosCtx, buildResourcePolicy p } } } } } The guard is not a gatekeeper, it doesn’t stop you from deploying dangerous policies. It’s an analysis tool that tells you exactly how dangerous your policies are before you deploy them. func Analyze cfg config.PastaayConfig PlanResult { res := PlanResult{Issues: make string, 0 } if cfg == nil || len cfg.Policies == 0 { return PlanResult{Status: "SAFE", Score: 0, TotalRisk: 0.0, Issues: string{}} } systemSurvival := 1.0 // multiplicative survival probability // Core guard: disabled default-ignored = 15% flat risk penalty if cfg.EnableDefaultIgnored { res.Issues = append res.Issues, "CORE GUARD DISABLED: System base vulnerability increased." systemSurvival = 0.85 } targets := make map string string // detect overlapping policies for , p := range cfg.Policies { // Skip no-op policies all probabilities zero, no resource effects if p.LatencyChance == 0 && p.ErrorChance == 0 && p.DropConnection && p.RAMChunkMB == 0 && p.ThrottleThreshold == 0 { continue } // Factor 1: Scope — how much of the system does this hit? // 0.4 = targeted endpoint, 1.0 = entire protocol layer scopeWeight := 0.4 if p.Target == "all" || p.Target == "database" || p.Target == " " || p.Target == "" { scopeWeight = 1.0 res.Issues = append res.Issues, fmt.Sprintf " %s Global Target: Exposes entire '%s' infrastructure layer.", p.Name, p.Type } // Factor 2: Collision — two policies hitting the same target // compound. Each overlap adds +0.2 to scope weight. key := p.Type + ":" + p.Target if orig, exists := targets key ; exists { res.Issues = append res.Issues, fmt.Sprintf " %s Collision: Overlaps with '%s'. Cascading failure probability increased.", p.Name, orig scopeWeight = min 1.0, scopeWeight+0.2 } targets key = p.Name // Factor 3: Fault severity — the worst single effect this policy can cause maxSeverity := 0.0 if p.DropConnection { maxSeverity = 1.0 // hard TCP drop = maximum severity res.Issues = append res.Issues, fmt.Sprintf " %s Hard TCP Drop: Triggers immediate network circuit-breakers.", p.Name } if p.ErrorChance 0 { maxSeverity = max maxSeverity, p.ErrorChance } if p.LatencyChance 0 { latMultiplier := 0.3 if p.LatencyDuration = 5 time.Second { latMultiplier = 1.0 // thread pool exhaustion territory res.Issues = append res.Issues, fmt.Sprintf " %s 5s+ Timeout: Causes severe thread pool exhaustion.", p.Name } else if p.LatencyDuration = 1 time.Second { latMultiplier = 0.6 } maxSeverity = max maxSeverity, p.LatencyChance latMultiplier } if p.Type == "resource" { resSeverity := 0.0 if p.RAMChunkMB = 1024 { resSeverity = 0.9 // OOM territory } else if p.RAMChunkMB = 256 { resSeverity = 0.5 } else if p.RAMChunkMB 0 { resSeverity = 0.2 } if p.ThrottleThreshold = 100000 { resSeverity = max resSeverity, 0.8 // near-100% CPU lock } maxSeverity = max maxSeverity, resSeverity } // Policy risk = severity × scope. Multiplicative survival model. policyRisk := maxSeverity scopeWeight systemSurvival = 1.0 - policyRisk } finalRisk := 1.0 - systemSurvival res.TotalRisk = finalRisk res.Score = int finalRisk 100.0 switch { case res.Score = 75: res.Status = "CRITICAL" case res.Score = 50: res.Status = "HIGH" case res.Score = 25: res.Status = "ELEVATED" default: res.Status = "SAFE" } return res } The scoring model is multiplicative survival analysis: each policy reduces the “survival probability” of the system by maxSeverity × scopeWeight . Policies with global scope and high severity compound dramatically. Two CRITICAL policies don't add, they multiply. The telemetry system uses a lock-free circular buffer that can handle concurrent writes from multiple goroutines one per protocol and concurrent reads from the web console polling endpoint: type LogEntry struct { Pod string json:"pod" Source string json:"source" Name string json:"name" Message string json:"msg" Ts int64 json:"ts" } var mu sync.RWMutex buf 256 LogEntry head int size int nodeName string func Emit source, name, msg string { mu.Lock defer mu.Unlock buf head = LogEntry{ Source: source, Name: name, Pod: source + "/" + name, Message: msg, Ts: time.Now .UnixMilli , } head = head + 1 % 256 if size < 256 { size++ } } func Snapshot LogEntry { mu.RLock defer mu.RUnlock out := make LogEntry, size start := head - size + 256 % 256 for i := 0; i < size; i++ { out i = buf start+i %256 } return out } The ring buffer holds 256 entries. When it fills, older entries are silently overwritten. This is intentional, the telemetry bus is not a persistence layer. If you need long term storage, pipe the logs to your existing observability stack via OpenTelemetry or scrape the Prometheus metrics. The EmitError and EmitInfo helpers automatically attach OpenTelemetry trace and span IDs when available, creating a direct link between chaos events in the log and spans in your tracing backend: func EmitError protocol, target, msg, payload string, span trace.Span { logData := map string interface{}{ "level": "ERROR", "protocol": protocol, "target": target, "message": msg, "payload": payload, } if span = nil && span.SpanContext .IsValid { logData "trace id" = span.SpanContext .TraceID .String logData "span id" = span.SpanContext .SpanID .String } jsonLog, := json.Marshal logData Emit nodeName, protocol, string jsonLog } This trace correlation was one of the best decisions I made. When you see “Kafka message dropped” in the web console journal, you can copy the trace ID, paste it into Jaeger, and see the entire request journey, which services it touched, where the fault was injected, and what the downstream effects were. The Oracle is the feature that gets the most attention, so let me explain exactly how it works. The problem: designing good chaos engineering policies requires SRE expertise. Most developers don’t know what failure modes to test, what probability thresholds make sense, or how to combine faults into realistic scenarios. An experienced SRE might spend hours designing a multi vector attack. Most teams never do it at all. The solution: give an LLM structured context about your running system and let it figure out the attack plan. But LLMs hallucinate. They invent YAML fields. They use range durations like 5s-15s which are syntactically valid YAML but semantically meaningless. They output error chance: 0 instead of omitting the field. They generate single policy tests when you need multi vector attacks. The system prompt for the Oracle is 40 lines of engineering specification that exist to constrain the LLM into producing only valid, deployable output. The actual system prompt, constructed in Go and sent to the LLM, looks like this: You are Pastaay Oracle, a Senior Site Reliability Engineering SRE AI. Analyze the provided telemetry and system configuration matrices. Your ONLY output must be a highly complex, devastating, multi layered Chaos Engineering blueprint in valid Pastaay V1 YAML wrapped in a markdown yaml block. CRITICAL DIRECTIVES: 1. Output ONLY valid Pastaay V1 YAML wrapped in a markdown yaml block using triple backticks and yaml specifier . NO conversational text. 2. DO NOT write single policy basic tests. Generate a Multi Vector Attack containing at least 3 concurrent policies. 3. INTENSITY LEVEL HIGH: Use severe probabilities 0.7-0.9 , extreme latency 3s-8s , and enable drop connection where appropriate. 4. STRICT SCHEMA RULES: - NEVER use ranges for durations e.g., '5s-15s' is ILLEGAL. Use exactly '5s' or '15s' . - NEVER invent types like 'multi'. Stick EXACTLY to the provided schema. - FATAL GUARD: For 'resource' policies, NEVER exceed ram chunk mb: 512. - CLEAN YAML RULE: NEVER output error chance: 0 or latency chance: 0. If a probability is 0, completely OMIT the field from the YAML. - RESOURCE TYPE RULE: For 'resource' policies, completely OMIT error chance, error code, latency chance, and drop connection. They are invalid for OS level sabotage. Notice the style: imperative, negative constraints, all caps enforcement words NEVER, ILLEGAL, FATAL GUARD, STRICT, ONLY . I went through four iterations of this prompt. Version 1 was conversational and polite. It produced garbage YAML. Version 2 added schema rules but was too permissive. Version 3 added negative constraints but still allowed ranges. Version 4 this one works reliably across all four LLM providers. I learned something important about prompt engineering: LLMs are not creative collaborators. They are literal instruction followers. If you leave room for interpretation, they will interpret, usually incorrectly. If you say “don’t use ranges,” they might still use them. If you say “NEVER use ranges, this is ILLEGAL,” they comply. The difference in output quality between “avoid X” and “NEVER do X. X is ILLEGAL” is dramatic. The Oracle doesn’t operate in a vacuum. Before calling the LLM, the engine constructs a telemetry matrix from live Prometheus data: finalPrompt := fmt.Sprintf "User Request: %s\n\n--- LIVE TELEMETRY MATRIX ---\n%s", userPrompt, sysContext, The sysContext contains: pastaay injected faults total by target and fault typeThis means if you type “stress the payment service,” the Oracle sees that payments is already receiving latency faults but not errors, and generates an attack plan that introduces error injection to the payment endpoints while adding latency to the notification service for a multi vector scenario. It’s not guessing, it’s making data informed decisions. The Oracle supports four providers through a unified interface. OpenAI and DeepSeek share a code path because both use OpenAI compatible APIs: func callLLM ctx context.Context, apiKey, model, url, sysPrompt, userPrompt, provider string string, error { // OpenAI and DeepSeek share the same API shape: Chat Completions. payload := map string interface{}{ "model": model, "messages": map string string{ {"role": "system", "content": sysPrompt}, {"role": "user", "content": userPrompt}, }, "temperature": 0.5, } req, err := buildJSONRequest ctx, http.MethodPost, url, payload if err = nil { return "", err } req.Header.Set "Authorization", "Bearer "+apiKey // standard OpenAI-style auth return executeRequest req, provider, func b byte string, error { var res struct { Choices struct { Message struct { Content string json:"content" } json:"message" } json:"choices" } if err := json.Unmarshal b, &res ; err = nil { return "", fmt.Errorf "%s decode: %w", provider, err } if len res.Choices == 0 { return "", fmt.Errorf "%s: no choices", provider } return res.Choices 0 .Message.Content, nil } } Gemini uses a fundamentally different API: func callGemini ctx context.Context, apiKey, model, sysPrompt, userPrompt string string, error { // Gemini uses a different API shape than OpenAI/Anthropic. // The key goes into a header, NOT the URL query string. // Query params leak into proxy logs and OTel spans. url := "https://generativelanguage.googleapis.com/v1beta/models/" + model + ":generateContent" // Gemini's system instruction is a top-level field, not a message role. payload := map string interface{}{ "system instruction": map string interface{}{ "parts": map string string{"text": sysPrompt}, }, "contents": map string interface{}{ {"parts": map string string{{"text": userPrompt}}}, }, "generationConfig": map string interface{}{"temperature": 0.5}, } req, err := buildJSONRequest ctx, http.MethodPost, url, payload if err = nil { return "", err } req.Header.Set "x-goog-api-key", apiKey // NOT ?key= in URL return executeRequest req, "gemini", func b byte string, error { var res struct { Candidates struct { Content struct { Parts struct{ Text string json:"text" } json:"parts" } json:"content" } json:"candidates" } if err := json.Unmarshal b, &res ; err = nil { return "", fmt.Errorf "gemini decode: %w", err } // Gemini wraps responses deeper than OpenAI. // Guard against empty responses from rate limits or safety filters. if len res.Candidates == 0 || len res.Candidates 0 .Content.Parts == 0 { return "", fmt.Errorf "gemini: no candidates" } return res.Candidates 0 .Content.Parts 0 .Text, nil } } Anthropic uses yet another format: func callAnthropic ctx context.Context, apiKey, model, sysPrompt, userPrompt string string, error { // Anthropic uses its own Messages API, different from both // OpenAI and Gemini. System prompt is a top-level field. payload := map string interface{}{ "model": model, "max tokens": 4096, "system": sysPrompt, "messages": map string string{{"role": "user", "content": userPrompt}}, } req, err := buildJSONRequest ctx, http.MethodPost, "https://api.anthropic.com/v1/messages", payload if err = nil { return "", err } req.Header.Set "x-api-key", apiKey req.Header.Set "anthropic-version", "2023-06-01" return executeRequest req, "anthropic", func b byte string, error { var res struct { Content struct{ Text string json:"text" } json:"content" } if err := json.Unmarshal b, &res ; err = nil { return "", fmt.Errorf "anthropic decode: %w", err } if len res.Content == 0 { return "", fmt.Errorf "anthropic: no content" } return res.Content 0 .Text, nil } } Each model has a carefully chosen default: DeepSeek R1 deepseek-reasoner for the Oracle's reasoning heavy workload, GPT-4o-mini for cost efficiency, Gemini 2.5 Flash for speed, and Claude Sonnet for Anthropic's ecosystem. There is no React. No Node.js. No npm. No webpack. No build step. The entire frontend is embedded in the Go binary: js //go:embed templates/ var TemplatesFS embed.FS //go:embed static/ var StaticFS embed.FS When pastaay starts, it registers routes that serve these embedded files directly from memory: func RegisterHandlers mux http.ServeMux, mgr config.Manager { tmpl := template.Must template.ParseFS TemplatesFS, "templates/ .html" mux.Handle "/static/", http.FileServer http.FS StaticFS mux.Handle "/console/docs/raw/", http.StripPrefix "/console/docs/raw/", http.FileServer http.FS docs.FS mux.HandleFunc "/console", func w http.ResponseWriter, r http.Request { renderHTML tmpl, w, "dashboard" } mux.HandleFunc "/console/docs", func w http.ResponseWriter, r http.Request { renderHTML tmpl, w, "docs" } mux.HandleFunc "/console/builder", func w http.ResponseWriter, r http.Request { renderHTML tmpl, w, "builder" } mux.HandleFunc "/console/api/state", requireConsoleToken adminToken, handleState mgr mux.HandleFunc "/console/api/oracle", requireConsoleToken adminToken, handleOracle mux.HandleFunc "/console/api/plan", requireConsoleToken adminToken, handlePlan mux.HandleFunc "/console/api/rollback", requireConsoleToken adminToken, handleRollback mgr mux.HandleFunc "/console/api/probe", requireConsoleToken adminToken, handleProbe mux.HandleFunc "/console/api/metrics", handleMetrics mux.HandleFunc "/console/api/discover", requireConsoleToken adminToken, handleDiscover } The authentication middleware uses constant time comparison to prevent timing attacks on the API token. If PASTAAY WEBHOOK TOKEN is set as an environment variable, every API call requires it in the X-Pastaay-Token header. If it's empty, the console is open access, which is fine for local development but should be configured in production. The dashboard grid is a proper drag and drop implementation. Each panel is a self contained widget that manages its own data fetching, rendering, and lifecycle. The layout persists in localStorage : js // From widget.js, layout persistence saveLayout { const order = ; this.grid.querySelectorAll ' data-widget-id ' .forEach el = { order.push el.dataset.widgetId ; } ; localStorage.setItem 'pastaay layout', JSON.stringify order ; } loadLayout { const saved = localStorage.getItem 'pastaay layout' ; if saved return; const order = JSON.parse saved ; order.forEach id = { const el = this.grid.querySelector data-widget-id="${id}" ; if el this.grid.appendChild el ; } ; } Widgets communicate with the engine through a REST API served by the same Go process. The state endpoint returns the current policy state, telemetry snapshot, and sensor health in a single JSON response. Widgets poll this endpoint at configurable intervals and update themselves. There’s no WebSocket, no Server Sent Events, no complex state management, just HTTP polling that works through any firewall. A closer look at two of the panels. The Global Fault Velocity chart tracks the total fault injection rate across all targets in real time: The Blast Radius Matrix breaks down errors, latency spikes, and dropped connections across the most targeted services: The resilience probe uses Apdex scoring. Apdex Application Performance Index classifies response times into three buckets relative to a configurable threshold T: The Apdex score = Satisfied + Tolerating/2 / Total. A score of 0.94+ means the target is healthy. Below 0.70 means it’s degrading. Below 0.50 means it’s failing. The probe sends HTTP requests through a server side proxy to bypass browser CORS restrictions: js func handleProbe w http.ResponseWriter, r http.Request { var req struct{ URL string json:"url" } json.NewDecoder r.Body .Decode &req start := time.Now // Server-side proxy, bypasses browser CORS restrictions. // The probe targets services the browser can't reach directly. client := &http.Client{Timeout: 10 time.Second} resp, err := client.Get req.URL elapsed := time.Since start .Milliseconds result := map string interface{}{"elapsed ms": elapsed} if err = nil { result "status" = 0 result "error" = err.Error } else { defer resp.Body.Close result "status" = resp.StatusCode } json.NewEncoder w .Encode result } The probe supports multi target round robin probing, EMA smoothed scoring, and clickable diagnostic popovers that explain each field in plain English. The web console can stream pod logs in real time. The backend uses the Kubernetes Watch API to detect new pods and attach log streams: // Watch for new pods and stream their logs in real time. w, err := clientset.CoreV1 .Pods namespace .Watch ctx, metav1.ListOptions{} for event := range w.ResultChan { pod, ok := event.Object. corev1.Pod if ok { continue } switch event.Type { case watch.Added, watch.Modified: // Only attach to Running pods, pending/terminating pods ignored. if pod.Status.Phase == corev1.PodRunning { streamCtx, cancel := context.WithCancel ctx activeStreams pod.Name = cancel // store cancel for cleanup go startKubeLogStreamer streamCtx, clientset, namespace, pod.Name } case watch.Deleted: // Pod removed — cancel its log stream to free resources. if cancel, exists := activeStreams pod.Name ; exists { cancel delete activeStreams, pod.Name } } } Log streams use the Kubernetes Pod Log API with Follow: true . Each line is emitted to the telemetry bus, which feeds the System Output Journal panel in the web console. The journal supports hierarchical filtering Pod → Protocol → Method , text search, live/pause toggle, and clic -to decrypt payload tracing. The watch loop reconnects on failure with exponential backoff, capping at 30 seconds. If the watch channel closes which happens periodically , the loop restarts the watch and reattaches to any running pods. Here is the full console in action: The pastaay Kubernetes operator manages chaos policies as Custom Resources. When you apply a ChaosPolicy CRD to your cluster, the operator detects it, translates it into the engine's configuration format, and pushes it to the target pods through the webhook API: apiVersion: chaos.pastaay.io/v1 kind: ChaosPolicy metadata: name: payment-service-stress spec: policies: - name: http-latency-spike type: http target: /api/payments/ latency chance: 0.8 latency duration: 3s - name: db-connection-drop type: sql target: database drop connection: true error chance: 0.3 The operator controller watches for CRD changes and reconciles the desired state against the engine. It also handles the RBAC necessary for the log streaming feature, a separate ClusterRole that grants pod log read access. IMAGE: docs/assets/operator header.png, the operator architecture header Every policy lookup in pastaay is zero allocation. 1. Atomic pointers for configuration. The policy cache is behind an atomic.Pointer . Reading it requires no lock. The pointer swap during updates takes microseconds and doesn't block readers. 2. Type indexed slice cache. Instead of iterating all policies on every request, policies are pre grouped by protocol type. An HTTP request only checks HTTP policies. A Kafka message only checks Kafka policies. The map lookup is O 1 . 3. Lock free PCG random number generator. Go’s math/rand/v2 uses the PCG Permuted Congruential Generator algorithm, which is lock free and allocation free. Comparing rand.Float64 < p.ErrorChance is a single atomic operation. 4. No intermediate allocations. The policy struct is stack allocated. The path matching uses only stack variables. The context based delay uses time.NewTimer which pre allocates its internal structures. No fmt.Sprintf , no strings.Builder , no temporary slices. 5. OpenTelemetry BatchSpanProcessor. Spans are batched and flushed asynchronously. If your tracing backend is slow or offline, chaos spans queue up in memory and flush later. The critical path injecting the fault never blocks on span export. This section is the part you won’t find in the README or the documentation. These are the things that cost me days of debugging, rewrites, and staring at stack traces at 2 AM. I spent three days debugging why the watcher stopped working after a Kubernetes ConfigMap update. The ConfigMap gets mounted as a symlink to a directory containing the file. When the ConfigMap changes, Kubernetes replaces the symlink target, which means the old file’s inode is deleted and a new one appears. fsnotify sees this as RENAME followed by nothing, because the watcher was attached to the old inode. The fix took an hour to code and three days to discover was necessary. I learned more about Linux filesystems from this one bug than from any course or book. Actual knowledge: fsnotify watches inodes, not paths. Vim writes to a temp file and renames. sed -i does the same. Kubernetes ConfigMaps create symlink forests. Your watcher needs to handle all of these, or your tool silently stops working. My first Oracle prompt was friendly. “Please generate a chaos engineering YAML configuration.” The output: a single HTTP policy with error chance: 0 and a conversational preamble thanking me for the opportunity. I tried making it more specific. “Generate a multi vector attack with at least 3 policies.” The output: three policies, but one had type multi not a valid type , one used latency duration: 5s-15s range, not a valid duration , and one had ram chunk mb: 2048 which the guard would reject. The working prompt is imperatives and prohibitions. “NEVER use ranges. NEVER invent types. output chance: 0 is ILLEGAL. OMIT the field entirely.” The difference in output quality between “please avoid” and “NEVER do X” is the difference between a working tool and a demo that crashes. This is general advice for anyone building LLM powered tools: don’t be polite. Be precise. Negative constraints are more powerful than positive ones. Tell the model what NOT to do, and exactly why it’s wrong. The first version of the policy engine used sync.Map and allocated a new slice on every lookup. It worked. At 100 requests per second, nobody noticed. At 10,000 requests per second, the garbage collector woke up every 50 milliseconds and ate 30% of CPU. Rewriting to use atomic.Pointer and type indexed slices took two weeks. The first week was figuring out what to change. The second week was discovering all the places that implicitly allocated, a fmt.Sprintf here, a strings.Join there, a temporary map created for header matching. The lesson: design for zero allocation from the start. Adding it later is refactoring your entire hot path. The success rate of that kind of refactoring is low. Mine worked because pastaay was small enough at that point. For a larger codebase, it would have been impossible. I almost cut the OpenTelemetry integration from scope. It seemed like a “nice to have” that would add weeks of work for marginal benefit. I was wrong. The first time I traced a single HTTP request through latency injection, to a Kafka consumer that dropped the message, to a Redis hook that returned a timeout, all visible in a single Jaeger trace tree, the entire value proposition of distributed tracing clicked. Chaos engineering without tracing is guesswork. You inject a fault and observe the outcome, but you don’t know the causal chain. Did the payment service fail because of the latency, or because the notification service dropped a message that the payment service was waiting for? Without tracing, you can’t tell. With trace correlation, every chaos event has a span ID that links it to the specific request journey it affected. Pastaay is not finished. Here’s what’s on the roadmap: CEL Driven Rule Engine. Google’s Common Expression Language compiles expression strings to ASTs and evaluates them with zero allocation overhead. Imagine writing policies like: policies: - name: conditional-latency type: http target: /api/ condition: "request.header 'X-Priority' == 'low' && error chance 0.5" latency duration: 5s The condition is compiled once at policy load time and evaluated on every request inline with the existing interceptor. This turns pastaay from a probability based chaos tool into a truly programmable chaos platform. Trace Aware Injection. Using OpenTelemetry Baggage to propagate chaos decisions across service boundaries. If a request enters your system with a specific baggage header, pastaay can inject faults based on the complete distributed request journey, targeting specific end to end transaction flows rather than individual services. Community Interceptors. The interceptor pattern is designed to be extensible. Adding a new protocol requires implementing a single interface. I’d like to see community contributed interceptors for Cassandra, Elasticsearch, S3, and whatever else people are running. If you’ve read this far, you now understand pastaay better than most people who will ever use it. This article is my attempt to document not just what the code does, but why it does it that way, the constraints, the failed approaches, the late night realizations. Pastaay is at github.com/CemAkan/pastaay https://github.com/CemAkan/pastaay . It’s a single Go binary. It has no external runtime dependencies. It works on macOS, Linux, and anywhere Go compiles. Everything in this article is backed by real, working, tested code. I built this alone, from scratch, during my senior year. If you try it and something breaks, open an issue. If you build an interceptor for a protocol I haven’t covered, open a PR. If you use it to find a bug in production before your users do, that’s exactly why it exists.