Two Mistakes Hiding Behind One Good-Looking Number

An incoming Electronics and Communication Engineering student built a model to detect abnormal heartbeats from raw ECG signals that initially reported 98% accuracy. The developer later discovered two critical evaluation mistakes: imbalanced class selection and intra-patient cross-validation, which inflated performance. After correcting to a patient-disjoint split, accuracy dropped to 88% and macro F1 to 0.32, revealing the model's true limitations.

I'm starting an Electronics and Communication Engineering degree this year, and a few weeks before classes began I decided to build something real instead of waiting for a syllabus to tell me what to learn: a model that detects abnormal heartbeats from raw ECG signal, small enough to run on a microcontroller, not a cloud GPU. The first version of this project hit 98% accuracy. That number was almost meaningless, and it took me two separate rounds of being wrong to find out why. The number that looked great The task is beat classification on the MIT-BIH Arrhythmia Database, a public dataset of annotated heartbeats used across decades of cardiology ML research. Each heartbeat gets sorted into one of five categories defined by the AAMI standard: Normal, Supraventricular ectopic, Ventricular ectopic, Fusion, and Unclassifiable. My first pipeline extracted twenty features from each heartbeat's waveform — amplitude, statistical moments, frequency content — and trained three tree-based models at different size tiers, the smallest exportable to plain C for a microcontroller with a few kilobytes of flash. I evaluated with five-fold cross-validation and got 96–98% accuracy depending on model size. It looked like a finished project. It wasn't. The gap that gave it away Accuracy was 98%. Macro F1 — the unweighted average of F1 score across all five classes — was 0.64. That gap is the single most reliable tell of a model quietly failing on minority classes while a dominant class hides the damage. If 90% of your heartbeats are Normal, a model that mostly just learns to recognize Normal beats can score very well on accuracy while being nearly useless at the actual clinically interesting task: catching the abnormal ones. Digging into why led to two separate, compounding mistakes — not one. Mistake one was in how I'd selected which patient recordings to train on. I'd picked fifteen records somewhat arbitrarily, aiming for "rhythm diversity" without actually checking how many examples of each rare class they contained. Some classes ended up with far fewer real examples than I'd assumed. Mistake two was bigger, and I only found it by reading the actual research literature on this dataset rather than trusting my own instincts about cross-validation. My five-fold split was intra-patient: beats from the same person could land in both the training and test fold, just different individual heartbeats. A model evaluated this way can partly learn to recognize an individual patient's specific heart rhythm rather than learning arrhythmias in general — and that inflates the score in a way that doesn't reflect how the model would perform on someone it's never seen. The foundational paper in this exact research area, by de Chazal, O'Dwyer and Reilly 2004 , defines a standard fix: split the dataset by patient, not by individual heartbeat, so the model trains on one group of people and is tested only on a completely disjoint group. It's the comparison baseline that essentially every subsequent paper in this subfield evaluates against. I hadn't used it. I was evaluating my model on a question — "can you also recognize beats from people you've already partly seen?" — that wasn't the question that actually mattered. Doing it right made the number look worse I rebuilt the evaluation around the standard patient-disjoint split. The honest result: accuracy near 88%, macro F1 around 0.32. That's not a regression. It's the same underlying weakness becoming visible instead of staying hidden. The model hadn't gotten worse — my ability to see how it was actually performing had gotten better. A worse-looking number that's telling the truth is strictly more valuable than a better-looking number that isn't. Diagnosing again — and finding I'd built a second blind spot Here's the part I'm least proud of, and the part most worth writing down honestly: when I went to figure out why the new number was low, I discovered I'd computed per-class precision and recall internally the whole time and never actually looked at them. I had the diagnostic data sitting in memory and hadn't surfaced it anywhere — not in a log, not in a saved file. I'd built exactly the kind of blind spot I was trying to fix in my evaluation methodology, just one layer further down, in my own tooling. Once I fixed that and could actually see the breakdown, the pattern was specific: Ventricular beats were classifying well — strong, distinctive waveform shape, the model had something real to grab onto. Supraventricular beats were failing badly, and not in the "guessing wrong but trying" sense — recall was under 10%. The model had essentially given up on that class. The reason, once I looked into it, made sense: Supraventricular beats are often nearly identical to Normal beats in raw shape. What actually defines them clinically is usually timing — they fire earlier than the heart's expected rhythm. My features only ever looked at a single isolated one-second window around each heartbeat. There was no timing information anywhere in the feature set. I'd built a classifier that was structurally blind to the thing that mostly defines one of the five categories I was asking it to detect. Adding the missing signal I added four features describing the interval between consecutive heartbeats: the gap before the current beat, the gap after it, the ratio between them, and a short rolling average of recent gaps. These come from exactly the same R-peak timestamps that beat detection already needs to produce — on a microcontroller, they're nearly free, far cheaper than the frequency-domain features already being computed. The "reliable" macro F1 — averaged only across classes with enough training examples to produce a statistically meaningful score, which in this dataset excludes only the near-empty Unclassifiable class — moved from not-computable-before to 0.52. Genuine improvement, properly earned. What's still broken, on purpose stated out loud Supraventricular and Fusion beats are still hard. Their F1 scores sit between 0.08 and 0.23 depending on which model tier you look at, and reading further into the published literature made me recalibrate my own expectations: serious research using far more sophisticated features than mine — explicit P-wave detection, QRS boundary measurement — still reports precision around 31% for the same class. This is a genuinely difficult classification problem in this exact research area, not a problem I should have expected to fully solve with a feature set built in a week. I also noticed something I hadn't anticipated: my three model sizes don't just differ in accuracy, they fail differently. The smallest model is far more willing to guess "Fusion" than the largest one — high recall, terrible precision — likely because a shallow decision tree with very few splits available, combined with class-balancing that inflates the importance of rare classes, ends up carving one broad region of feature space and labeling all of it Fusion. The largest model, with room for finer decision boundaries, makes that same call more conservatively. Choosing a deployment size for a microcontroller isn't just a tradeoff between accuracy and memory footprint — it's also an implicit choice about which kind of mistake the model will make. What this actually taught me None of the real lessons here were about random forests or feature engineering specifically. They were about the difference between a number that looks good and a number that's actually telling you something true. A high-level aggregate metric can be a near-perfect hiding place for a specific, severe failure, and the only way to find that failure is to deliberately go look for the gap between two metrics that should usually agree. An evaluation method that quietly leaks information from train into test will reward you with a better number and a worse model, and the only defense is knowing the standard methodology for your specific problem well enough to notice when you've deviated from it. And the data you don't bother to surface might as well not exist — I had the exact diagnostic I needed sitting in a variable, unused, for an entire iteration cycle. The actual code — full pipeline, test suite, the C export verified to compile for an ARM Cortex-M4 and produce identical predictions to the Python model it came from — is public: github.com/bl4zeh3x/tinyml-explorations If you've hit a similar gap between a metric that looks fine and a model that isn't, I'd be glad to hear how you found it.