How to analyze duplicate processing in an async flow

wpnews.pro

In one line: deduplication is about the

evidence that a side effect has been applied— make it atomic with the effect, visible to everyone, and tied to a unique identifier.This is just how I think it out — not a tutorial, not the final answer. I'm sharing my reasoning, and I'd love to hear where it breaks.

Wait a minute pls. Let's step back on this topic and not get stuck in some tech solutions.

So, my questions:

By the way, I tried to enumerate all the failure scenarios the typical implementations aim to prevent, but I gave up — there are too many possibilities to list them all. So I'll start from the essence of duplicate processing instead.

So, what is duplicate processing at its core?

It's not about how many times a message is delivered. It's about how many times the side effect is applied.

So at its root: duplicate processing means the same logical intent has its side effect applied more than once.

And one more thing: even when duplication happens, it only causes damage if the side effect is not idempotent. An idempotent side effect makes a duplicate harmless — but still wasteful, and real business logic is often hard to make idempotent. So idempotency isn't our goal here; the discussion below does not assume it.

Now, instead of jumping to solutions, let's think the other way around: under what conditions does the side effect get applied more than once?

Here is what I think — it happens if any of these is true:

Notice something: each condition above is just a way the guarantee breaks. So if we flip them around, we get the boundaries that guarantee non-duplication.

Flipping the failure conditions, here are the boundaries:

After that, we can decide what the orchestrator and collaborators look like.

Collaborators:

Orchestrator:

public class ConsumerHandler {

    private EvidenceChecker checker;
    private SideEffectHandler handler;
    private OffsetCommitter committer;

    public void consume(Message message, CommitHandle handle) {
        log.info(...);

        // Has the side effect for this identifier already been applied?
        boolean alreadyApplied = checker.check(message.identifierKey());

        // Already applied — skip the work, just commit and return.
        if (alreadyApplied) {
            committer.commit();
            return;
        }

        handler.handle(message -> {
            // Within the same atomic boundary:
            // 1. apply the side effect (business logic)
            // 2. write the evidence record for this identifier
        });

        committer.commit();
    }
}

Take care of this:

We haven't discussed any concrete tech (RDBMS, Redis) yet.
The early alreadyApplied

check is aperformance optimization, not a correctness guarantee. Even with an idempotent side effect, reprocessing a duplicate still wastes resources — CPU, DB calls, external requests — so the check lets us skip that work and return fast. But it does NOT prevent duplication itself: a check-then-act still has a race window. The real guarantee comes from theunique constraintwhen the evidence record is written atomically.- No matter what MQ we use (Kafka, RabbitMQ, or something else), the consumer always needs the messageand away to commit/confirmit — otherwise it can't know the message was consumed. That's whyconsume(Message message, CommitHandle handle)

is written like this.

So far everything is still technology-agnostic — we went from the essence, to the failure conditions, to the boundaries, and finally to an abstract collaboration model. No Redis, no RDBMS yet.

The abstract model is clean. Reality usually isn't. The handler.handle(...)

above still treats business logic as a black box — and that box might not be simple. When the side effect is more than one step, what does its evidence record look like then?

So I'll leave it here as a question:

What problems do you think are still hiding? What would you have to design or reason about next? And if you'd leave any comment to help refine this post, feel free to let me know — thanks in advance.

The point of this post: find the boundaries first, and every later solution has a place to fit.

source & further reading

dev.to — original article Streamlining Gladly Task Creation with Apex Code How I built a Milvus ALTER command in Django (before native support existed) Di era AI, yang menang bukan yang kerja paling banyak

How to analyze duplicate processing in an async flow

Run your AI side-project on zahid.host