cd /news/ai-research/we-re-still-the-only-one-to-hit-1-on… · home topics ai-research article
[ARTICLE · art-23569] src=dev.to pub= topic=ai-research verified=true sentiment=↑ positive

We're still the only one to hit #1 on both LoCoMo and LongMemEval. Here is how to use it.

Backboard has achieved the top position on both the LoCoMo and LongMemEval benchmarks for long-term AI memory, a feat no other system has matched without modifying the original evaluation guidelines. The company attributes its success to message-level memory architecture that builds and retrieves facts during conversations, rather than relying on larger context windows that degrade over long horizons. Backboard's memory feature is available as a single parameter, `memory="Auto"`, which stores and recalls facts across conversations using the same `assistant_id`.

read4 min publishedJun 6, 2026

Backboard is #1 on LoCoMo and LongMemEval, the two academic benchmarks for long-term AI memory without changing the original guidelines. Other companies have gamed by using newer models with bigger context windows. This post explains why the result matters anyway, what it actually measures, and how to use the memory that earned it.

These are not "find a fact in a wall of text" tests. They measure whether a system can build, maintain, and reason over memory across many conversations.

LoCoMo (Long-term Conversational Memory) evaluates very long-term memory over multi-session dialogues that span weeks. It tests single-session recall, cross-session reasoning, temporal reasoning, outside knowledge, and adversarial questions.

LongMemEval scores five distinct abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates (noticing when a fact about the user changes), and abstention (knowing when it does not know). Its own paper reports that commercial assistants and long-context models lose around 30% accuracy on sustained memory.

That last point is the whole story.

A few honest notes about the result.

We are still #1 on the original academic benchmarks. Other systems have since posted high numbers too, but they got there by pointing a stronger model at the problem and leaning on ever-larger context windows. At the top, everyone is near the ceiling of what these tests can even measure, so the raw number stops being interesting. What is interesting is how you got there.

The difference is where the work happens. We solve memory at the message level. Memory is built as the conversation happens, fact by fact, then retrieved when relevant. We do not stuff a giant context window to paper over a memory architecture that cannot actually remember. A bigger context window is brute force, and the benchmarks already show brute force degrades on long horizons. Message-level memory is the thing the test is supposed to reward. Fixing problems with brute force isn't scalable over months or years, and it guides users to inflated token usage and higher spend. No thanks.

We did not run these benchmarks ourselves. Third-party organizations did. We do not build for benchmarks and we do not tune to a leaderboard. We build the best memory product for our customers. It just happens to be the best.

One more thing, and we will not name names: several of the top open-source memory projects on GitHub run on Backboard for their paid cloud offering. The thing people benchmark against us is, in some cases, us. We think that is funny.

So we let the score sit quietly and we ship the product. Here is how to use it.

The memory that tops these benchmarks is one parameter. Store it on the assistant with memory="Auto"

, reuse the same assistant_id

, and facts carry across every conversation.

pip install backboard-sdk
python
import asyncio
from backboard import BackboardClient

async def main():
    client = BackboardClient(api_key="YOUR_API_KEY")

    await client.send_message(
        "My name is Sarah. I just moved from Chicago to Toronto.",
        assistant_id="your-assistant-id",
        memory="Auto",
    )

    reply = await client.send_message(
        "Where do I live now?",
        assistant_id="your-assistant-id",
        memory="Auto",
    )
    print(reply.content)  # Toronto

asyncio.run(main())
js
const send = (body) =>
  fetch("https://app.backboard.io/api/threads/messages", {
    method: "POST",
    headers: {
      "X-API-Key": "YOUR_API_KEY",
      "Content-Type": "application/json",
    },
    body: JSON.stringify(body),
  }).then((r) => r.json());

await send({
  content: "My name is Sarah. I just moved from Chicago to Toronto.",
  assistant_id: "your-assistant-id",
  memory: "Auto",
});

const reply = await send({
  content: "Where do I live now?",
  assistant_id: "your-assistant-id",
  memory: "Auto",
});

console.log(reply.content);
curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "My name is Sarah. I just moved from Chicago to Toronto.", "assistant_id": "your-assistant-id", "memory": "Auto"}'

curl -X POST "https://app.backboard.io/api/threads/messages" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"content": "Where do I live now?", "assistant_id": "your-assistant-id", "memory": "Auto"}'

Each benchmark ability is just a memory mode in practice:

memory="Auto"

saves the new fact and supersedes the old one, no code from you.assistant_id

.memory="Auto"

to memory_pro="Auto"

when precision matters more than cost.Readonly

, the assistant recalls what it has and does not invent what it does not.

response = await client.send_message(
    "What were my project deadlines?",
    assistant_id="your-assistant-id",
    memory_pro="Auto",
)

The benchmark number says we are first. The architecture says why it will hold: memory at the message level, not a context window stretched to hide a weaker design. You do not have to take the leaderboard's word for it. Set memory="Auto"

and feel the difference in your own app.

Grab a key and try it: app.backboard.io

Memory docs: docs.backboard.io/concepts/memory

── more in #ai-research 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/we-re-still-the-only…] indexed:0 read:4min 2026-06-06 ·