Closing the feedback loop: how mistake classification drives adaptive problem selection in NumPath

The NumPath AI math tutor for children with dyscalculia fixed a flaw in its adaptive engine: although the system classified student mistakes (e.g., digit reversal, borrowing errors), it ignored these classifications when selecting the next problem. The solution was a 60-line code change that implements a rule-based system: if a specific mistake type appears at least twice in the last three attempts, the system reduces the difficulty of the associated knowledge component. This change closes the feedback loop between error diagnosis and problem selection, making the system a true Intelligent Tutoring System rather than a simple difficulty slider.

NumPath is an AI math tutor for children with dyscalculia. At its core is an adaptive engine that picks the next problem for each student based on their Bayesian Knowledge Tracing BKT mastery estimate. In this post I'll walk through a problem we had — and solved — in the rule-based phase: classified mistakes were being logged but completely ignored by the selection engine. The fix was a 60-line change across two files. The research implication is significant. Our MistakeClassifier already tagged every wrong answer with a structured code — BORROW SKIP when a student adds instead of subtracts with borrowing, DIGIT REVERSAL when they write 51 for 15, MAGNITUDE MISJUDGE when they pick the smaller number as larger. These MistakeEvent records were hitting the database on every incorrect attempt. But GetNextProblemUseCase — the code that decides what problem a student gets next — never read them. The engine was selecting problems purely on BKT p mastery . A student could hit BORROW SKIP three sessions in a row and still receive problems at the same difficulty, on the same skill, with zero response to the pattern. This violates what MacLellan et al. call the "Error as Diagnostic Signal" principle: mistakes should trigger targeted remediation, not generic retry. The core question was: when should a mistake pattern trigger a response, and what should that response be? We settled on three rules, each encoded as a named constant: MISTAKE WINDOW = 3 look back this many MistakeEvents threshold = ceil MISTAKE WINDOW / 2 = 2 — dominant code must appear ≥ 2× in window MISTAKE KC MAP = { "DIGIT REVERSAL": "PLACE VALUE", "BORROW SKIP": "SUB BORROW", "MAGNITUDE MISJUDGE": "PLACE VALUE", "PLACE VALUE CONFUSION": "PLACE VALUE", "OPERATION CONFUSION": "OPERATION SIGN", } When detect mistake signal fires, two things happen: p mastery .DIFFICULTY STEP 0.2 down, floored at ENTRY DIFFICULTY 0.3 to prevent over-scaffolding students who are already at entry level.What we explicitly rejected: resetting difficulty to zero too harsh for students who've been making progress , and weighting by mistake severity too complex for Phase 1 with no real data to calibrate against . The reason field on every NextProblemResponse now names the triggering pattern: "Remediation: BORROW SKIP detected 2× on SUB BORROW p mastery=0.41 " This is the explainability requirement. A teacher looking at this in the dashboard can understand exactly why the system chose what it did. The central claim of NumPath's RCT will be that adaptive, mistake-aware tutoring produces better outcomes than static worksheets for dyscalculic learners. Before this change, we had a system that adapted difficulty based on streaks but ignored the type of error a student was making. That's not meaningfully different from a worksheet that repeats problems when you get them wrong. Closing this loop — mistake code → KC target → difficulty adjustment → reason field — is what makes the system an Intelligent Tutoring System rather than a difficulty slider. Every MistakeEvent record is now a longitudinal data point that shapes the student's next experience, and that chain of causality is fully traceable. The implementation was straightforward. The harder question was the threshold: why 2 of 3, not 3 of 3? Three-of-three is too strict — a student who makes BORROW SKIP , then DIGIT REVERSAL , then BORROW SKIP again has a clear pattern but the strict threshold misses it. Two-of-three catches the pattern earlier at the cost of occasional false positives. We don't yet have real student data to validate this choice — it's a hypothesis. We've logged it as a research note for Phase 4. The one thing I'd do differently: add the MistakeEvent index to the model on day one. It was missing and only caught during the performance review pass. A composite index on student id, created at is obvious in hindsight for any table you're going to query with ORDER BY created at DESC LIMIT N . Next up: wiring the KC states into the teacher dashboard so educators can see p mastery per student, not just 7-day accuracy — the final piece of the MacLellan "Teacher-in-the-Loop" principle. MistakeEvent into select next problem is a 60-line change with a meaningful research impactreason field is not a nice-to-have — every adaptive decision must be explainable to a teacher; string-formatted rationale on each NextProblemResponse is the minimum viable explainabilityMISTAKE WINDOW , FRUSTRATION WINDOW , MASTERY WINDOW sit side by side; when we have real data to calibrate thresholds, we change one line each