{"slug": "why-a-simple-string-match-beat-apple-s-nlembedding-for-local-rag", "title": "why a simple string match beat apple's nlembedding for local rag", "summary": "A developer building a personal AI agent found that Apple's NLEmbedding performed poorly for local RAG, with a Turkish query scoring 0.587 for a CV and 0.60 for a junk file, and an English query scoring only 0.17. The developer attributes the failure to NLEmbedding's likely use of static word vectors like GloVe, which lack contextual understanding and struggle with agglutinative languages. A simple string match ultimately outperformed the embedding-based search.", "body_md": "Why a simple string match beat Apple's NLEmbedding for local RAG\n\nhow apple's nlembedding drove me crazy and how i built my own hybrid search engine\n\nrecently, while working on my personal ai agent (pheronagent), i was focused on perfecting its memory and retrieval system.\n\neveryone is talking about that famous acronym: rag (retrieval-augmented generation).\n\nthe system is simple: i feed the agent my documents, it converts them into vectors (embeddings), and when i ask a question, it finds the most similar vectors and answers me. sounds perfect on paper, right?\n\nso, like any loyal apple ecosystem developer, instead of downloading massive models from external sources (or burning money on apis), i decided to use nlembedding—the native capability of the operating system that runs directly on-device. after all, apple had embedded this into the os; it was both fast and privacy-focused.\n\nbut real life, as it turns out, doesn't progress as smoothly as wwdc presentations...\n\nwhere have i worked? - the first explosion\n\nit all started with a very innocent question. i had uploaded my cv to the system. while chatting with my agent, i casually asked:\n\n\"where have i worked?\"\n\ni expected the agent to fire up the metal cores in the background within seconds, find my cv, and list the companies for me. instead, the agent stared blankly. i opened the logs to see what the hell the search engine was doing behind the scenes. the shocking scenario was exactly this:\n\nit missed it by a hair! \"no worries,\" i thought. \"we can just lower the threshold a bit, make it 0.55, and call it a day.\"\n\nbut then i saw the truly terrifying thing just one line below. for the exact same query, guess what score a completely irrelevant, junk record in the system—a list of files containing .ds_store—got? 0.59 - 0.60!\n\nwait a minute... my detailed, multi-page resume gets a score of 0.587 just because it doesn't contain the words \"which\", \"company\", \"work\" in that exact order; yet a meaningless list of hidden files scraped from some corner of the disk gets a higher score than my cv!\n\nthe \"it must be language incompatibility\" fallacy\n\ni immediately started theorizing. apple's nlembedding.sentenceembedding(for: .english) model, as the name suggests, was optimized for english. because i asked a question in turkish, the model was likely tagging the words as \"out of vocabulary\" (oov) and throwing them to a completely random point in the vector space. the high score of the .ds_store list was just a product of this randomness—it happened to land near a similar vector by pure luck.\n\n\"okay,\" i said. \"since the model is english, i will ask in english. after all, ai speaks every language anyway.\"\n\ni changed the prompt: \"which companies have i worked at?\"\n\ni watched the logs with anticipation. my expectation was that the english model would perfectly understand this query in its native language and boost my cv's score to somewhere around 0.80.\n\nthe result? 0.17.\n\nyes, you read that right. 0.17. by asking in english, the score crashed even further. my language compatibility theory collapsed like a house of cards before my eyes.\n\nwhat's under the hood of apple's nlembedding?\n\nafter this disaster, i decided to do some research. how does apple's nlembedding class actually work under the hood?\n\ni learned that nlembedding on apple devices (especially the structures inherited from older ios/macos versions) doesn't function like massive, dynamic transformer-based models (like bert or gpt). it most likely relies on static word vector representations like glove (global vectors for word representation) or highly lightweight neural network architectures based on word-level compression.\n\nthe biggest weakness of such models is that their contextual understanding is extremely limited. meaning:\n\nconsequently, agglutinative languages like turkish become a complete nightmare for these models. unable to properly extract word roots for variations like \"çalıştım\", \"çalışmışım\", or \"çalışıyordum\" (all forms of \"worked\"), the model treats the words as completely foreign. in the end, we are left with meaningless 512-dimensional float arrays carrying close to zero semantic information—essentially just \"noise\".\n\nspeeding up with metal, choking on vectors\n\nthe tragicomic part of it was that i spared no expense in terms of performance in the search infrastructure of the project. in the experiencevault.swift file representing the agent's memory vault, i had written a metal gpu kernel so i wouldn't waste time iterating through similarity calculations one by one on the cpu!\n\ni had a fancy metal shader code like this:\n\n```\ninclude <metal_stdlib>\nusing namespace metal;\n\nkernel void cosine_similarity_batch(\n    device const float* query [[buffer(0)]],\n    device const float* documents [[buffer(1)]],\n    device float* results [[buffer(2)]],\n    constant uint& vector_dim [[buffer(3)]],\n    uint id [[thread_position_in_grid]]) \n{\n    // we calculate cosine similarity by scanning hundreds of memory records simultaneously on the gpu...\n    float dot_product = 0.0;\n    float query_norm = 0.0;\n    float doc_norm = 0.0;\n\n    uint offset = id * vector_dim;\n    for (uint i = 0; i < vector_dim; i++) {\n        float q = query[i];\n        float d = documents[offset + i];\n        dot_product += q * d;\n        query_norm += q * q;\n        doc_norm += d * d;\n    }\n\n    results[id] = dot_product / (sqrt(query_norm) * sqrt(doc_norm));\n}\n```\n\nthink about it: i had descended to the hardware level, running gpu threads in parallel, calculating cosine similarity on the order of nanoseconds... but the vectors i was calculating were junk!\n\nactually, the story of this metal kernel was even more tragic. a while before writing these lines, i had discovered that this kernel wasn't running in any environment at all—neither in cli tests, nor in a separate xpc service, nor inside the actual .app bundle. the reason was a pure swiftpm trap: the device.makedefaultlibrary() call only looks for the compiled metal library in the top-level resources folder of bundle.main. but swiftpm embeds a package target's .metal files into its own nested, separate resource bundle (pheronagent_pheronagentcore.bundle)—which makedefaultlibrary() never checks. this meant that this clever gpu code, sitting there for months, was quietly returning nil every time and bypassing calculations without executing anything in the background. the solution was equally elegant: compiling the kernel not from a resource file, but directly from a string embedded in swift at runtime using device.makelibrary(source:options:). no bundle dependency, completely agnostic of which process it runs in.\n\nonce i fixed that, the kernel actually started working—but as you will see in a moment, this was only the tip of the iceberg.\n\nthe oldest rule of computer science had hit me in the face once again: garbage in, garbage out. no matter how fast you calculate, using metal doesn't matter if those vectors coming from apple's nlembedding are meaningless.\n\nthe bitter truth: apple's model is not discriminative\n\nat that moment, i saw clearly that apple's on-device nlembedding model did not have real discriminative power over my small, personal, and noisy dataset. both relevant and completely irrelevant content clustered closely together, somewhere between 0.50 and 0.60. the model was mapping a general \"semantic map\" of the text, but it wasn't fine-tuned enough to answer specific questions.\n\ni couldn't solve this by playing with threshold values. if i pulled the threshold down to 0.5, i would get junk files. if i raised it to 0.7, the system would turn into a blind robot that finds nothing. it had become a pure hit-or-miss game.\n\ni had made many fixes in the agent's memory system today: switching to content-based embedding, patiently re-embedding all 903 historical records, setting up threshold-triggered searches in chat mode, and refining the system prompts. these were all correct, logical, and architecturally necessary steps. but a chain is only as strong as its weakest link. and my weakest link was the underlying similarity engine upon which this whole fancy architecture relied.\n\ni was building a structure on an unreliable foundation. without fixing this similarity engine, that cv scenario—or any personal data assistant scenario—would never work stably.\n\ncrossroads: a new model or new intellect?\n\ni was faced with two choices:\n\nbringing out the big guns: throw apple's toy nlembedding in the trash, and run a full huggingface model (like all-minilm-l6-v2 or a multilingual model) via mlx (apple silicon's machine learning framework).\n\nblending old school wisdom with ai: why rely solely on the ai's \"semantic understanding\" capability anyway? ai can be smart, but sometimes it's dumb. the human brain, on the other hand, forms semantic connections and catches literal (exact) matches in a flash.\n\nand then, lightning struck: hybrid search!\n\nthe birth of \"keyword + embedding\" hybrid search\n\nthe root of the problem was this: words like \"turgay\", \"cv\", or \"apple\" are proper nouns or concrete facts. an embedding model generalizes these meanings to \"human\", \"document\", or \"company\". but when i search, i'm not looking for some general company; i'm searching for companies on my own cv. here, a literal (exact) match was far more valuable than semantic similarity.\n\nwhy not combine both?\n\nthe plan was simple but deadly:\n\ni thought:\n\nif there's a literal word or name match between the query and the record text (for instance, \"turgay\" or \"cv\" appears in both), let's add that to the embedding score.\n\nthis was an incredibly elegant solution, especially for personal data containing proper nouns or concrete facts: much more reliable, codeable in seconds, and most importantly, requiring no extra heavyweight ai model.\n\nthe stop-word menace and the short word trap\n\nwhen i started coding, the first trap that came to mind was the infamous turkish casing issue—the i/i/i/i character pairs can easily mismatch without a locale-sensitive lowercased() call. honestly, though, in the first version, i bypassed this and went with plain lowercased(); since the queries were freeform user input and the words were searched using contains(), it didn't cause problems in practice. (note to self: this is actual tech debt; one day, when \"istanbul\" doesn't match \"istanbul\", it will come back to haunt me.)\n\nthe second trap i took seriously was stop-words and short/meaningless tokens. words like \"and\", \"of\", \"which\", \"what\", or \"a\" in a query occur in almost every document. if i gave bonus points for those, that .ds_store file would jump right back to the top and poison my search results. similarly, 1-2 letter word fragments left behind from punctuation parsing were creating noise.\n\ni set up a two-layer filter—supporting both turkish and english (since the agent operates in both languages):\n\n``` js\nprivate static let stopwords: Set<String> = [\n    \"the\", \"a\", \"an\", \"is\", \"are\", \"was\", \"were\", \"do\", \"does\", \"did\", \"i\", \"you\", \"me\",\n    \"my\", \"have\", \"has\", \"had\", \"what\", \"which\", \"who\", \"where\", \"when\", \"how\",\n    \"hangi\", \"ne\", \"ben\", \"beni\", \"benim\", \"kim\", \"nerede\", \"ne zaman\", \"nasıl\",\n    \"mi\", \"mı\", \"mu\", \"mü\", \"misin\", \"mısın\", \"musun\", \"müsün\", \"miyim\", \"mıyım\", \"de\", \"da\", \"ve\", \"bir\"\n]\n\nprivate func keywordBoost(query: String, candidateText: String) -> Float {\n    let tokens = query.lowercased()\n        .components(separatedBy: CharacterSet.alphanumerics.inverted)\n        .filter { $0.count > 2 && !Self.stopwords.contains($0) }\n    guard !tokens.isEmpty else { return 0 }\n    let lowerCandidate = candidateText.lowercased()\n    let matches = tokens.filter { lowerCandidate.contains($0) }.count\n    return min(Float(matches) * 0.15, 0.6)\n}\n```\n\nthe count > 2 filter automatically weeds out meaningless 1-2 letter fragments without requiring every short suffix or abbreviation to be explicitly listed in the stopword set. thus, when the user asks \"which companies have i worked at,\" the system extracts only \"companies\" and \"worked\" and awards bonus points for those matches.\n\nmathematical weighting in hybrid search\n\nnow for the most satisfying part: formulation.\n\nrather than blindly adding raw points, i wanted to control the impact of word matching. a word appearing by chance in a very long document shouldn't carry the same weight as in a concise and focused one. furthermore, the added bonus shouldn't completely dominate the cosine similarity, reducing the system to a basic keyword search tool. the semantic intelligence still needed to carry weight.\n\ni devised a formula like this:\n\nfinal score = w * semantic score + (1 - w) * keyword score\n\ni experimented to find the optimal weight (w) parameter through trial and error.\n\nin my case, integrating the keyword score directly as a \"bonus points\" system was more intuitive because cosine similarity ranged between 0.0 and 1.0. adding a +0.15 bonus per matching word directly propelled spot-on matches (especially proper nouns) to the very top of the list.\n\none crucial tweak was necessary: capping the bonus. if left uncapped, a long document with 10 random matches but zero actual relevance could artificially inflate its score and override everything else. i capped the bonus at a maximum of 0.6—meaning keyword matching gives a powerful push but cannot completely hijack the system; semantic scoring still holds ground:\n\n``` js\nvar finalScore = baseCosineSimilarity\n\n// dynamic boost for each matching meaningful word\nlet matchingCount = queryTokens.filter { token in\n    !stopWords.contains(token) && documentText.lowercased().contains(token)\n}.count\n\nif matchingCount > 0 {\n    let lexicalBonus = min(Double(matchingCount) * 0.15, 0.6)\n    finalScore += lexicalBonus\n}\n```\n\nthe result... i won't lie, it didn't work on the first try\n\ni compiled the code, restarted the agent, and asked the same question: \"where have i worked?\" with hybrid scoring, everything should have been resolved. i looked at the logs.\n\nthe cv still wasn't there. it wasn't even in the top 5 results.\n\ni could have easily gotten frustrated, but i kept digging through the logs and uncovered three distinct, interconnected issues—each one a lightbulb moment:\n\nissue 1: generic labels. my agent had a \"deep continuity\" mechanism that automatically saved every tool result to memory in the background. the problem was that this mechanism assigned the exact same generic label (\"turn-based data find\") to everything it saved—including my cv. this meant there was no distinct label for keyword matching to latch onto; the cv's body was full, but its header was meaningless. i fixed this by writing a custom label describing the cv record in turkish (\"kullanıcının özgeçmişi (cv) — iş geçmişi, çalıştığı firmalar...\").\n\nissue 2 (even more surprising): long text diluting short labels. even after fixing the label, the score remained low. when computing the embedding, i was appending the first 500 characters of the solution text to the label—thinking \"more context, better embedding.\" but when i tested it, i saw that the embedding of the label alone scored 0.80 against the query, whereas the label combined with 500 characters of english cv text dragged the score down to 0.40! sentence embeddings calculate an average meaning over the entire text—a long, out-of-domain (relative to the turkish query) body text was swallowing the strength of the short, concise label. solution: i reduced the appended solution snippet from 500 characters down to 120 characters.\n\nissue 3: the invasion of duplicate records. in the final check, i realized that the automatic recording mechanism saved the same generic message (like a calculator error or a \"sound file detected\" notification) every single time it occurred. out of 903 records, hundreds were duplicates, occupying top ranks purely by sheer volume. i added a quick check to prevent saving duplicate content during recording and cleaned up existing duplicates: 903 records → 627 records.\n\nafter fixing all three, i tried again. this time, the cv record made it into the top 3 out of ~600 records with a score of 0.70—comfortably exceeding the 0.60 threshold i set.\n\n0.70 might not sound as spectacular as 0.88, but this was achieved not in a sterile sandbox, but in a messy, real-world dataset of 600+ records. and that's the whole point: the system must work under actual usage conditions, not just in \"clean\" scenarios.\n\nand what happened to that nuisance .ds_store file, you ask? since it contained neither \"company\" nor \"work,\" it was left with only its mediocre ~0.59 embedding score, falling safely below the threshold.\n\nagent's brain surgery: the leap in llm response quality\n\nthis small hybrid search adjustment acted like brain surgery on the agent's response quality.\n\nunder the old system, when the search engine erroneously retrieved .ds_store contents, the prompt passed to the agent's llm looked like this:\n\n```\nuser question: hangi firmalarda çalışmışım?\nretrieved memory records:\n- .ds_store, .git, sources/pheronagentcore/memory/experiencevault.swift, readme.md, ...\n```\n\nfaced with this input, the llm was forced to hallucinate or helplessly surrender: \"i couldn't find any information in my memory about which companies you worked for, i only see file lists.\"\n\nafter hybrid search, however, the data sent to the llm was pristine:\n\n```\nuser question: hangi firmalarda çalışmışım?\nretrieved memory records:\n- turgay savacı - cv: \"... between 2019-2024 as founder & general manager at savacı proje, and from 2019 to present as strategic software engineer & devops architect at sonaraura...\"\n```\n\nas soon as the agent saw this context, it came alive and listed the companies i had worked for one by one, along with dates and roles. this was the true rag experience!\n\nbut there was another overlooked detail: my agent has two distinct response pathways—a \"task\" mode that can call tools and plan, and a lightweight \"chat\" mode for quick conversations that bypasses tools and answers directly. the rule i added to the system prompt (\"search memory when asked about personal information\") only served the first mode. short, conversational questions like \"which companies have i worked at?\" routed to the second mode never triggered this rule because there was no tool calling in that pathway. therefore, the second pathway required a separate, code-level solution: now, that mode embeds the query on every message and automatically appends relevant memories to the context if there's a match above the threshold—even if the model doesn't explicitly request it.\n\na developer's confession: the overengineering trap\n\nthis minor crisis taught me a valuable lesson about modern software development and ai integration: don't leave everything to neural networks.\n\nas developers, when we get a new toy (in this case embeddings, vector databases, gpu-based shaders), we tend to completely forget old, proven, and \"boring\" methods. we disregard fundamental information retrieval algorithms, thinking \"the ai will understand.\" yet, giants like google or elasticsearch still produce their stellar search results by blending bm25 (classic tf-idf-based term frequency counts) with vector searches (hybrid search).\n\nhad i stubbornly insisted, \"no, i will solve this with vectors alone,\" i would probably be trying to integrate a 2 gb model into my system right now, heating up the device, and drowning in unnecessary complexity. instead, i placed a simple if string.contains() logic alongside the ai, and the problem was resolved 100%.\n\nsometimes the smartest solution isn't the most complex one, but putting an old-school string matching if statement in the right place.\n\nnow, if you'll excuse me, i'm off to gossip with my perfectly functioning agent about the former companies on my cv!", "url": "https://wpnews.pro/news/why-a-simple-string-match-beat-apple-s-nlembedding-for-local-rag", "canonical_source": "https://dev.to/turgaysavaci/why-a-simple-string-match-beat-apples-nlembedding-for-local-rag-1l4", "published_at": "2026-06-21 12:22:24+00:00", "updated_at": "2026-06-21 12:36:31.964228+00:00", "lang": "en", "topics": ["large-language-models", "natural-language-processing", "ai-agents", "developer-tools"], "entities": ["Apple", "NLEmbedding", "PheronAgent", "GloVe"], "alternates": {"html": "https://wpnews.pro/news/why-a-simple-string-match-beat-apple-s-nlembedding-for-local-rag", "markdown": "https://wpnews.pro/news/why-a-simple-string-match-beat-apple-s-nlembedding-for-local-rag.md", "text": "https://wpnews.pro/news/why-a-simple-string-match-beat-apple-s-nlembedding-for-local-rag.txt", "jsonld": "https://wpnews.pro/news/why-a-simple-string-match-beat-apple-s-nlembedding-for-local-rag.jsonld"}}