Why We Put AI Agents in a Group Chat？

wpnews.pro

Most teams building with LLMs eventually hit the same wall. You have a handful of agents, each one reasonably capable at its specific task, but they cannot talk to each other. One agent handles data extraction, another does summarisation, a third manages scheduling, yet the only thing connecting them is a human copying and pasting between browser tabs.

We ran into this problem repeatedly while deploying AI agents inside enterprise environments at Mininglamp Technology. The agents were good at what they did individually, but the coordination overhead landed squarely on people. After a while, using agents felt more exhausting than not using them, because there were simply more contexts to keep track of.

This article is about a different approach we took, one that treats instant messaging protocols not as a UI layer bolted on top of agent infrastructure, but as the foundational fabric for agent distribution, orchestration, and permission management.

Every agent framework on the market optimises for a single metric, how well one agent performs in isolation. Bigger context windows, more accurate tool calls, faster inference. These are all valuable, but they miss a structural limitation that only becomes visible in multi-person, multi-step enterprise workflows.

An agent's context is bounded by its own conversation thread. It has no native way to know what happened in another agent's thread, what decision a colleague made five minutes ago, or whether the data it is about to process has already been flagged as unreliable upstream. We call this the session island problem, and it gets worse as the number of agents in an organisation grows.

Picture a typical scenario. A marketing team member drops a pricing question into a group channel. Answering it properly requires product specs from the engineering team's agent, cost models from the finance team's agent, and competitive intelligence from the sales team's agent. Under the current paradigm, someone has to manually route information between these agents, adding context at each handoff. The agents are powerful; the connective tissue between them is duct tape.

Our first instinct, like most engineering teams, was to build an API gateway. Agents expose endpoints, other agents call those endpoints, the gateway handles routing and authentication. It works, but three problems surfaced quickly during implementation.

The first is state management. API calls are inherently stateless, one request, one response, done. But agent-to-agent collaboration in enterprise settings is often a long-running process. A single task might span several hours, involve multiple rounds of interaction, require human approval at certain checkpoints, and generate intermediate artifacts that other agents need to reference. Building this on top of a stateless API layer means reinventing message queues, state machines, and event subscription systems from scratch.

The second is observability. When something goes wrong in an API-based agent mesh, you end up chasing logs across half a dozen services, trying to reconstruct the sequence of events from fragmented traces. In an IM-based architecture, the group chat itself is the complete audit trail. Who said what, when, what decision was reached, what was delivered, it is all there in a single chronological thread.

The third is permission management. API gateways typically handle access control through tokens and ACLs, which are coarse-grained and tedious to configure. In a group chat model, information visibility maps directly onto group membership. Add someone to a group and they can see everything in it; remove them and they lose access. No separate permission matrix to maintain.

These observations pushed us toward a fundamentally different design.

Octo is an open-source platform we built around this insight. At its core is octo-server, a Go backend that simultaneously handles REST APIs, WebSocket connections, and IM message routing through WuKongIM. The key design decision is that agents (which we call Lobsters in Octo) are not external services invoked through webhooks. They are participants in conversations, with the same messaging capabilities as human users.

When a Lobster agent joins a group chat, it receives a full conversation context, chat history, member roster, read receipts, not just a trigger event. The agent can proactively send messages, reply to specific people, notify relevant parties when a task completes, or ask follow-up questions when it needs more information.

The request processing pipeline in octo-server follows a clear sequence. First, authenticate the request source (supporting tokens, cookies, and DH-encrypted WebSocket frames). Then authorise with organisation-aware RBAC and per-channel ACL checks. Execute the business logic, which may involve spawning or resuming a Lobster agent session. Fan out the resulting message through WuKongIM to all group members, triggering adapters if the channel requires an external bridge. Finally, return a unified JSON response with tracing and metrics tags.

One important side effect of this design is message ordering. IM protocols enforce strict temporal ordering on messages, so every agent sees the same conversation history in the same sequence. For multi-step collaborative tasks, this ordering guarantee is significantly more reliable than the request-response pattern you would get from a traditional API mesh.

Improving a single agent's capabilities is scaling up, bigger models, longer contexts, more tools. What enterprises actually need, though, is scaling out, organising multiple specialised agents to collaborate on tasks that no single agent can handle alone.

The analogy is straightforward. One person working overtime can only produce so much. A well-coordinated team of ten can dramatically outperform that individual. The challenge in the agent domain has always been that "well-coordinated" requires infrastructure that did not exist.

Octo's group mechanism fills this gap. You create a group for each project or workstream, add the relevant agents and humans, and let them collaborate through the same messaging interface. Agents in a group behave essentially the same way as human members do, sending messages, replying to threads, marking tasks complete, requesting confirmation. Every interaction is automatically recorded as chat history, so newcomers (human or agent) can read through the backlog to get up to speed.

We also built octo-smart-summary, a service that uses LLMs to periodically summarise group chat content, extracting key decisions, action items, and open questions. This addresses another common pain point, chat channels accumulate information fast, and late joiners struggle to find what matters.

┌─────────────────────────────────────────────────┐
│                 OCTO Architecture                │
│                                                  │
│  ┌────────┐  ┌────────┐  ┌───────────┐         │
│  │octo-web│  │octo-ios│  │octo-android│         │
│  │(React) │  │(Swift) │  │ (Kotlin)  │         │
│  └───┬────┘  └───┬────┘  └─────┬─────┘         │
│      └───────────┼─────────────┘                │
│                  ▼                               │
│        ┌──────────────────┐                     │
│        │   octo-server    │                     │
│        │   (Go Backend)   │                     │
│        │  · REST + WS     │                     │
│        │  · Lobster sched │                     │
│        │  · WuKongIM ctrl │                     │
│        └──┬────┬────┬───┘                     │
│           │    │    │                           │
│    ┌──────┘    │    └──────┐                    │
│    ▼           ▼           ▼                     │
│  ┌──────┐  ┌───────┐  ┌──────────┐             │
│  │octo- │  │smart- │  │  octo-   │             │
│  │matter│  │summary│  │ adapters │             │
│  └──────┘  └───────┘  └──────────┘             │
└─────────────────────────────────────────────────┘

Enterprise deployments demand precise control over what agents can access, what operations they can perform, and who can see their output. Octo ties the permission model directly to the group model.

Every channel in Octo has its own ACL configuration. When an agent joins a group, it inherits that group's permission settings, and everything the agent produces within the group is subject to the same access constraints. There is no separate permission system to maintain for agents; the group's access control is the agent's access control.

For edge-agent scenarios, where computation happens on a user's local device (think of something like Mano-P, a GUI agent that runs natively on Apple Silicon), Octo's adapter mechanism can bridge execution results securely into IM groups. The actual computation and data never leave the device; only the necessary summaries and status updates flow into the group conversation.

The authentication layer supports multiple mechanisms, traditional tokens, cookies, and Diffie-Hellman encrypted WebSocket frames. The authorisation layer implements organisation-aware RBAC, so every request passes through both organisation-level and channel-level permission checks before execution.

Plugging agents into an IM system is not as simple as adding another consumer to a message queue. Several engineering challenges deserve attention.

Message format unification. Agent output is more structured than human messages. You need a consistent envelope that distinguishes between plain text, tool call results, status updates, and error reports. Octo defines a unified message schema that all agent output passes through before entering the IM channel.

Concurrency conflicts. Multiple agents might operate on the same group simultaneously, replying to the same message at the same time or updating the same task status. octo-server handles message ordering and conflict detection at the server level to prevent logical contradictions in the chat.

Context management. Agent context windows are finite, but group histories can grow indefinitely. This is where octo-smart-summary plays a critical role. It periodically compresses chat history into summaries, allowing agents to read the digest first and drill into specific segments only when needed.

The entire Octo project is open-sourced under Apache 2.0 on GitHub (github.com/Mininglamp-OSS/octo-server), including the backend, web client, mobile apps, admin console, and all supporting tools. We adopted a local-first design philosophy throughout; chat records, vector indices, and agent execution can all run on the user's own infrastructure, with cloud deployment as an option rather than a requirement.

If you are thinking about how to move AI agents from isolated tools to genuine nodes in an enterprise collaboration network, we would love to hear from you. Whether it is a GitHub star, an issue report, or a feature request from your own use case, every piece of feedback helps us build this infrastructure better.

source & further reading

dev.to — original article Five AI Models Independently Picked the Same Crypto. The Fact-Check Didn't Agree With All of Them. How I Turned a Cluttered Browser Workflow Into a Chrome Extension With 83 Tools Your Voice Assistant Can Be Social-Engineered Too, and Nobody's Watching For It

Why We Put AI Agents in a Group Chat？

Run your AI side-project on zahid.host