Testing MiniMax M2.7 via API on three real ML and coding workflows Testing the MiniMax M2.7 model via API on three real-world workflows—refactoring a PyTorch project, drafting knowledge-base notes, and scaffolding a Kaggle competition entry—using Claude Code as the agentic harness. The author found that M2.7 performed well when tasks had explicit constraints and concrete output formats, but struggled when important context was left implicit, though similar gaps appeared with the comparison model, Claude Opus 4.7. The author concludes that for open-ended tasks, a human review pass remains necessary, and that model quality and harness design are closely intertwined in agentic workflows. Testing MiniMax M2.7 via API on three real ML and coding workflows I recently got access to some MiniMax M2.7 API credits, so I decided to plug this model directly into Claude Code and run it on three workflows I do regularly. The same tasks were run using Claude Opus 4.7 as the comparison baseline. The three workflows: scaffolding an entry for an active Kaggle competition, drafting and auditing knowledge-base notes for my Obsidian vault, and updating an old PyTorch project that became outdated. I wanted to find out how well M2.7 works inside an agentic loop when the task has clear boundaries. The results were consistent across the three runs: M2.7 was useful when the constraints were explicit, and the output format was concrete. It stumbled when important context was left implicit, though some of the same gaps appeared with Opus 4.7 as well. For the more open-ended cases, I would still keep a human review pass in the loop. Setup I added a claude-mm command that points Claude Code at the MiniMax API and ran M2.7 with thinking set to max in the CC interface. I ran on MiniMax’s Plus tier High-Speed, $40/month , where the context window and per-day throughput no longer became bottlenecks for multi-step agentic work. claude-mm { ANTHROPIC BASE URL="https://api.minimax.io/anthropic" \ ANTHROPIC AUTH TOKEN="$MINIMAX API KEY" \ ANTHROPIC MODEL="MiniMax-M2.7" \ ANTHROPIC DEFAULT SONNET MODEL="MiniMax-M2.7" \ ANTHROPIC DEFAULT OPUS MODEL="MiniMax-M2.7" \ ANTHROPIC DEFAULT HAIKU MODEL="MiniMax-M2.7" \ ANTHROPIC SMALL FAST MODEL="MiniMax-M2.7" \ API TIMEOUT MS="3000000" \ CLAUDE CODE DISABLE NONESSENTIAL TRAFFIC="1" \ claude "$@" } In agentic work, the harness can be as important as the model itself. Most of the failures I describe below had similar reasons: the prompt did not explicitly state a constraint the task depended on, and the model filled the gap with a plausible default. In practice, model quality and harness design are hard to separate. A stronger model may infer missing constraints; a better harness may make those constraints explicit. I treated this as a workflow test, not a pure model benchmark. Refactoring an old PyTorch project The first workflow was a refactor: my pytorch tempest https://github.com/Erlemar/pytorch tempest repo is a framework for training neural nets using Hydra + PyTorch Lightning. I wanted to update dependencies, modernize the tooling, and clean up the code issues that had accumulated over time. The merged result is PR: refactoring old code and updating dependencies https://github.com/Erlemar/pytorch tempest/pull/68 . The changes: - Updated CI versions and pre-commit hooks. - Replaced black and flake8 with ruff for both linting and formatting. - Enabled fsdp sharding strategy in the Lightning trainer config. - Refreshed the documentation. - Added uv for environment management. - Switched to modern Python typing list X over List X , X | None over Optional X . - Removed duplicate code paths. - Fixed a lot of small issues. I guided M2.7 explicitly: provided step-by-step requirements “switch black + flake8 to ruff”, “update the pre-commit config” , reviewed each change before moving to the next, and provided feedback when the diff went outside scope. I had enough tests to check whether anything broke after the changes, and rerunning model training took only several minutes. I had some challenges running CI, and the agent helped me fix them one by one. A lot of engineers I know do not want to give an agent free rein over a codebase they care about; they want to supervise the execution and know every existing line of code. M2.7 fits this approach well. You can write short, narrow-scope prompts, conduct line-level review, and then move to the next step. Knowledge notes for the Obsidian vault The second workflow was writing and auditing notes for my Obsidian vault https://dswok.com/ , where I keep around ML reference notes. I write most of them by hand; sometimes I have an LLM draft a parallel version to compare against and take inspiration from. It is important to remember that different models prefer different prompt styles. A 100-line prompt tuned for Opus 4.7 does not transfer one-to-one to M2.7. To handle that, I did a small bootstrap: I asked both models to generate notes from the same starting prompt, then asked M2.7 to read both notes and propose an improved prompt for itself. The next iteration used the M2.7-tuned prompt. I used two prompts a writer command and a critic agent , each around 100 lines. Here is a condensed version of the first one: Fill one broken-link stub in the DSWoK vault: research the topic, draft the note in DSWoK voice, run draft-critic-mm, save to the right folder. 1. Read context: writing style guide, frontmatter taxonomy, alias rule. 2. Pick the stub. 3. Locate references — Grep for