{"slug": "building-knn-from-scratch-because-import-sklearn-feels-like-cheating", "title": "Building KNN from Scratch (Because import sklearn Feels Like Cheating)", "summary": "A developer built a K-Nearest Neighbors classifier from scratch in pure Python to understand the algorithm's mechanics beyond using scikit-learn. The implementation includes Euclidean distance calculation with lazy evaluation and a deterministic tie-breaking mechanism using minimum distance per label. The project addresses edge cases like vector length mismatches and voting ties to ensure robustness.", "body_md": "Let's be real: in a production machine learning environment, we all just import `scikit-learn`\n\nand call it a day. But treating algorithms like black boxes can come back to bite you when those abstractions leak. Building K-Nearest Neighbors (KNN) from scratch is a fantastic exercise to actually understand the mechanics working under the hood.\n\nAt its core, KNN relies on a surprisingly simple geometric premise: a data point probably belongs to the same category as its closest spatial neighbors. Rebuilding this classifier from the ground up forces you to tackle some critical engineering challenges, such as:\n\nIn this post, we'll walk through the architectural decisions behind a pure Python implementation of KNN, translating textbook math into functional code.\n\n``` python\nimport math\n\ndef _check_length(x, y):\n    \"\"\"Ensure both vectors have the same length.\"\"\"\n    if len(x) != len(y):\n        raise ValueError(\"Vectors must be of same length\")\n\ndef euclidean_distance(x, y):\n    \"\"\"Compute Euclidean distance between two vectors.\"\"\"\n    _check_length(x, y)\n\n    sum_of_sq = sum(\n        (i - j) ** 2\n        for i, j in zip(x, y)\n    )\n\n    return math.sqrt(sum_of_sq)\n```\n\nCalculate spatial similarity between vectors using Euclidean distance.\n\nKNN is fundamentally a geometry problem. The entire algorithm hinges on quantifying exactly how far apart points are in a given space.\n\nIf we're looking at a 2D plane, Euclidean distance comes straight from the Pythagorean theorem:\n\nIn machine learning, however, we're rarely dealing with just two dimensions. Fortunately, the formula generalizes neatly to an arbitrary number of dimensions:\n\nWhere:\n\nIn the `euclidean_distance()`\n\nfunction above, notice the generator expression inside `sum()`\n\n. It evaluates lazily, meaning we avoid constructing an intermediate list of squared differences in memory. This is a deliberate design choice that scales well when working with high-dimensional data.\n\nAlso, don't overlook the `_check_length()`\n\nguard—it is structurally critical. Comparing vectors of different dimensions is mathematically invalid. Since Python's `zip()`\n\nfunction silently truncates to the shortest iterable, omitting this check could allow the function to fail silently and return incorrect results.\n\n``` python\ndef _get_majority_vote(neighbors):\n    # ... early-exit validation skipped ...\n\n    vote_count = {}\n    min_distance_per_label = {}\n\n    for neighbor in neighbors:\n        label = neighbor[\"label\"]\n        distance = neighbor[\"distance\"]\n\n        vote_count[label] = vote_count.get(label, 0) + 1\n\n        min_distance_per_label[label] = min(\n            distance,\n            min_distance_per_label.get(label, float(\"inf\"))\n        )\n\n    best_label = None\n    best_vote = -1\n    best_distance = float(\"inf\")\n\n    for label in vote_count:\n        votes = vote_count[label]\n        dist = min_distance_per_label[label]\n\n        if votes > best_vote or (\n            votes == best_vote and dist < best_distance\n        ):\n            best_label = label\n            best_vote = votes\n            best_distance = dist\n\n    return best_label\n```\n\nTally the neighbors' labels to determine the final classification while using spatial proximity as a deterministic tie-breaker.\n\nCounting votes is straightforward with a hash map, but robust edge-case management is what separates a demo implementation from production-ready code.\n\nImagine a scenario where\nk=4\nand your query point sits exactly halfway between two **Class A** neighbors and two **Class B** neighbors. A naive implementation might simply choose whichever label appears first. That makes the classifier non-deterministic and biased by data ordering.\n\nTo address this, the implementation maintains a secondary dictionary:\n\n```\nmin_distance_per_label\n```\n\nAs the algorithm iterates through the neighbors, it tracks the minimum distance observed for each class label. If a voting tie occurs,\n\nthe algorithm chooses the class whose nearest representative is closest to the query point.\n\nMathematically:\n\nWhere:\n\nThis approach anchors tie-breaking in actual spatial proximity rather than arbitrary ordering or randomness.\n\n``` python\nimport numpy as np\n\ndef knn_predict(training_data, labels, query_point, k):\n    # ... input validation skipped ...\n\n    distances = [\n        euclidean_distance(sample, query_point)\n        for sample in training_data\n    ]\n\n    nearest_idx = np.argsort(distances)[:k]\n\n    neighbors = [\n        {\n            \"distance\": distances[i],\n            \"label\": labels[i]\n        }\n        for i in nearest_idx\n    ]\n\n    return _get_majority_vote(neighbors)\n```\n\nCreate a controller function that computes distances, isolates the top k candidates, and delegates classification to the voting logic.\n\nAlthough a production implementation would include comprehensive validation and optimization, the core algorithm relies on one key operation:\n\n```\nnp.argsort(distances)[:k]\n```\n\nKeeping multiple arrays synchronized during sorting can quickly become messy. Rather than zipping distances and labels together, sorting them, and then unpacking them again, we sort only the distance values and retrieve the corresponding indices.\n\n`numpy.argsort()`\n\nreturns the indices that would sort an array. This allows us to select the nearest neighbors without mutating the original data structures.\n\nMathematically, we're selecting:\n\nwhere D is the vector of computed distances.\n\nThis is a common scientific-computing pattern because it preserves the relationship between distances and labels while avoiding unnecessary data transformations.\n\nAfter selecting the nearest indices, we package the neighbors into lightweight dictionaries and pass them directly to the voting mechanism.\n\nK-Nearest Neighbors is a unique machine learning algorithm because it behaves less like a traditional parameter-learning model and more like a combination of geometric reasoning and voting heuristics.\n\nUnlike algorithms such as linear regression or neural networks, KNN performs no explicit training. It simply stores the dataset and uses distance as a proxy for similarity during prediction.\n\nBuilding algorithms from scratch is one of the fastest ways to demystify machine learning. You quickly discover that much of the perceived complexity comes from standard software engineering concerns:\n\nThe mathematics matters, but so does thoughtful implementation.\n\nTo explore more machine learning projects and software engineering content:", "url": "https://wpnews.pro/news/building-knn-from-scratch-because-import-sklearn-feels-like-cheating", "canonical_source": "https://dev.to/pixie2468/building-knn-from-scratch-because-import-sklearn-feels-like-cheating-3h0p", "published_at": "2026-06-14 07:02:50+00:00", "updated_at": "2026-06-14 07:28:48.172531+00:00", "lang": "en", "topics": ["machine-learning", "developer-tools"], "entities": ["K-Nearest Neighbors", "Python", "scikit-learn"], "alternates": {"html": "https://wpnews.pro/news/building-knn-from-scratch-because-import-sklearn-feels-like-cheating", "markdown": "https://wpnews.pro/news/building-knn-from-scratch-because-import-sklearn-feels-like-cheating.md", "text": "https://wpnews.pro/news/building-knn-from-scratch-because-import-sklearn-feels-like-cheating.txt", "jsonld": "https://wpnews.pro/news/building-knn-from-scratch-because-import-sklearn-feels-like-cheating.jsonld"}}