Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking A technical guide demonstrates how to build a human-in-the-loop feedback RAG system that captures team corrections and retrieves them into LLM prompts. The pipeline covers modeling corrections as structured FeedbackNote objects, embedding lessons at write time, and using cosine similarity for semantic retrieval. The approach aims to improve LLM accuracy on enterprise-specific tasks by injecting curated lessons from past corrections. This is the hands-on companion to Part 1: Your LLM Isn’t Dumb — It Just Lacks Your Context https://medium.com/ai-in-plain-english/your-llm-isnt-dumb-it-just-lacks-your-context-e3f65e2d4d63?sharedUserId=nitingummidela . There, we covered the idea: LLMs fail on your code because they lack your enterprise context, and human-in-the-loop HITL feedback RAG fixes that by capturing your team’s corrections and retrieving them back into the prompt. If you have not read it, start there for the This piece is the how . We will build the pipeline end-to-end: how to model a correction, how to store and index it, how retrieval actually works embeddings, approximate nearest-neighbor search, hybrid filtering, reranking , how to assemble a safe prompt, and how to evaluate and operate the whole thing. The code is in Python with a pgvector-style store, but the patterns apply to any stack. A note needs more than free text if you want to retrieve, filter, deduplicate, and expire it later. Model it explicitly. At minimum, you want a stable ID, the structured correction, metadata to filter on, the embedding vector, and lifecycle fields: python from dataclasses import dataclass, fieldfrom datetime import datetime, timezone @dataclassclass FeedbackNote: id: str stable, content-hashed so re-ingest is idempotent task type: str e.g. "sql injection scan" — used for metadata filtering wrong answer: str what the model claimed correction: str what was actually true lesson: str the one-line rule injected into prompts embedding: list float = field default=None filled at ingest time source: str = "review ui" provenance: who/what produced it created at: str = field default factory=lambda: datetime.now timezone.utc .isoformat status: str = "active" active | deprecated — never hard-delete, flip status The lesson is the part that gets injected; the rest is for retrieval, governance, and debugging. Two design choices worth calling out: the id is a hash of the normalized content, so re-submitting the same correction updates rather than duplicates it; and status is a soft-delete flag, so retiring a bad lesson is reversible and auditable instead of a destructive delete. Ingest is then: embed the lesson once, then upsert. You embed at write time, not read time, so retrieval never pays the embedding cost for stored notes. python def ingest note: FeedbackNote, embed, store : text = f"{note.task type}: {note.lesson}" embed task + lesson together note.embedding = embed text one embedding call, cached on the row store.upsert note keyed by note.id idempotent Embed in batches when backfilling a large history, embedding APIs are far cheaper and faster per item in batches of, say, 64–256, and rate limits bite if you call them one at a time. Semantic search rests on one idea: an embedding , a list of numbers that captures the meaning of a piece of text. Two texts that mean similar things get vectors that point in similar directions, even if they share no words. “The query is parameterized” and “this uses a bound placeholder” land close together; “delete the user account” lands far away. An embedding model maps text to a fixed-length vector, typically 384, 768, or 1024 dimensions, depending on the model. The dimension is a property of the model; every vector from a given model has the same length, and you can only compare vectors produced by the same model. Swap the embedding model, and you must re-embed everything. You measure “close” with cosine similarity , the angle between two vectors, scored -1 to 1. A useful trick: if you L2-normalize every embedding to unit length at write time, cosine similarity becomes a plain dot product, which is faster and is what most vector indexes optimize for. So normalize once on the way in: python import numpy as npdef normalize v : v = np.asarray v, dtype=np.float32 return v / np.linalg.norm v + 1e-12 unit length → cosine == dot product The naive search is a linear scan, score the query against every note, and sort: python def brute force search q, notes, k=20 : scored = float np.dot q, n.embedding , n for n in notes dot == cosine if normalized scored.sort reverse=True, key=lambda x: x 0 return scored :k That is O n per query and fine up to a few thousand notes. Beyond that, you use a vector database with an approximate nearest-neighbor ANN index, usually HNSW a navigable graph or IVF inverted lists of clusters . ANN trades a little recall for a large speedup, turning an O n scan into roughly O log n and keeping queries in single-digit milliseconds across millions of vectors. The knobs that matter on HNSW are m graph connectivity and ef search how hard it looks at query time : higher ef search means better recall but slower queries. Crucially, you seldom want pure vector search. You want hybrid retrieval : a metadata pre-filter plus vector similarity, so you only rank notes that are actually applicable. In pgvector, that is one SQL statement: js -- normalized embeddings + cosine distance operator <= , filtered by task and statusSELECT id, lesson, 1 - embedding <= :query vec AS scoreFROM feedback notesWHERE task type = :task type -- metadata filter first AND status = 'active' -- never retrieve retired lessonsORDER BY embedding <= :query vec -- then nearest-neighborLIMIT :candidate k; -- over-fetch for reranking You embed each note once when you save it, and the store handles indexing and search. The same pattern exists in Qdrant https://qdrant.tech/ , Weaviate https://weaviate.io/ , and others as a filter plus a vector query. Nearest-neighbor search is fast but rough, because it compares two embeddings that were each produced without seeing the other. A cross-encoder reranker is a second model that reads the query and a candidate together and scores their true relevance. It is too slow to run over the whole corpus, but perfect for re-scoring the ~20 candidates ANN returned, so the 3 notes you actually inject are the best ones, not just the closest vectors: python def retrieve query, store, embed, reranker, task type, candidate k=20, final k=3 : q = normalize embed query candidates = store.search q, task type=task type, k=candidate k ANN + filter if not candidates: return pairs = query, n.lesson for n in candidates scores = reranker.score pairs cross-encoder, reads query+note jointly ranked = sorted zip scores, candidates , reverse=True, key=lambda x: x 0 return n for s, n in ranked :final k if s = RELEVANCE THRESHOLD Note the RELEVANCE THRESHOLD: if nothing clears the bar, you inject nothing . Retrieving an irrelevant note is worse than retrieving none, because it actively misleads the model. "Return zero notes" must be a valid outcome. One more retrieval-quality rule: keep each note to a single lesson. If you cram several lessons into one note, its embedding becomes a blurry average that sits between all of them and matches none well. One lesson per note keeps each vector sharp and each retrieval precise. Now assemble the question and the retrieved notes into one prompt. Three things matter here that the naive version gets wrong: ordering, fencing, and a token budget. MAX CONTEXT TOKENS = 800 hard cap so retrieved notes can't crowd out the taskdef build prompt query, notes, count tokens : lessons, used = , 0 for i, n in enumerate notes, 1 : line = f"{i}. {n.lesson}" if used + count tokens line MAX CONTEXT TOKENS: break budget guard: stop adding notes lessons.append line ; used += count tokens line context = "\n".join lessons if lessons else " no relevant prior corrections " return f"""{BASE INSTRUCTIONS}