why a simple string match beat apple's nlembedding for local rag A developer building a personal AI agent found that Apple's NLEmbedding performed poorly for local RAG, with a Turkish query scoring 0.587 for a CV and 0.60 for a junk file, and an English query scoring only 0.17. The developer attributes the failure to NLEmbedding's likely use of static word vectors like GloVe, which lack contextual understanding and struggle with agglutinative languages. A simple string match ultimately outperformed the embedding-based search. Why a simple string match beat Apple's NLEmbedding for local RAG how apple's nlembedding drove me crazy and how i built my own hybrid search engine recently, while working on my personal ai agent pheronagent , i was focused on perfecting its memory and retrieval system. everyone is talking about that famous acronym: rag retrieval-augmented generation . the system is simple: i feed the agent my documents, it converts them into vectors embeddings , and when i ask a question, it finds the most similar vectors and answers me. sounds perfect on paper, right? so, like any loyal apple ecosystem developer, instead of downloading massive models from external sources or burning money on apis , i decided to use nlembedding—the native capability of the operating system that runs directly on-device. after all, apple had embedded this into the os; it was both fast and privacy-focused. but real life, as it turns out, doesn't progress as smoothly as wwdc presentations... where have i worked? - the first explosion it all started with a very innocent question. i had uploaded my cv to the system. while chatting with my agent, i casually asked: "where have i worked?" i expected the agent to fire up the metal cores in the background within seconds, find my cv, and list the companies for me. instead, the agent stared blankly. i opened the logs to see what the hell the search engine was doing behind the scenes. the shocking scenario was exactly this: it missed it by a hair "no worries," i thought. "we can just lower the threshold a bit, make it 0.55, and call it a day." but then i saw the truly terrifying thing just one line below. for the exact same query, guess what score a completely irrelevant, junk record in the system—a list of files containing .ds store—got? 0.59 - 0.60 wait a minute... my detailed, multi-page resume gets a score of 0.587 just because it doesn't contain the words "which", "company", "work" in that exact order; yet a meaningless list of hidden files scraped from some corner of the disk gets a higher score than my cv the "it must be language incompatibility" fallacy i immediately started theorizing. apple's nlembedding.sentenceembedding for: .english model, as the name suggests, was optimized for english. because i asked a question in turkish, the model was likely tagging the words as "out of vocabulary" oov and throwing them to a completely random point in the vector space. the high score of the .ds store list was just a product of this randomness—it happened to land near a similar vector by pure luck. "okay," i said. "since the model is english, i will ask in english. after all, ai speaks every language anyway." i changed the prompt: "which companies have i worked at?" i watched the logs with anticipation. my expectation was that the english model would perfectly understand this query in its native language and boost my cv's score to somewhere around 0.80. the result? 0.17. yes, you read that right. 0.17. by asking in english, the score crashed even further. my language compatibility theory collapsed like a house of cards before my eyes. what's under the hood of apple's nlembedding? after this disaster, i decided to do some research. how does apple's nlembedding class actually work under the hood? i learned that nlembedding on apple devices especially the structures inherited from older ios/macos versions doesn't function like massive, dynamic transformer-based models like bert or gpt . it most likely relies on static word vector representations like glove global vectors for word representation or highly lightweight neural network architectures based on word-level compression. the biggest weakness of such models is that their contextual understanding is extremely limited. meaning: consequently, agglutinative languages like turkish become a complete nightmare for these models. unable to properly extract word roots for variations like "çalıştım", "çalışmışım", or "çalışıyordum" all forms of "worked" , the model treats the words as completely foreign. in the end, we are left with meaningless 512-dimensional float arrays carrying close to zero semantic information—essentially just "noise". speeding up with metal, choking on vectors the tragicomic part of it was that i spared no expense in terms of performance in the search infrastructure of the project. in the experiencevault.swift file representing the agent's memory vault, i had written a metal gpu kernel so i wouldn't waste time iterating through similarity calculations one by one on the cpu i had a fancy metal shader code like this: include