{"slug": "why-glean-builds-custom-embedding-models-for-every-customer", "title": "Why Glean Builds Custom Embedding Models for Every Customer", "summary": "Glean builds custom embedding models for each customer because enterprise data is highly heterogeneous, containing diverse sources like Slack messages, GitHub code, and meeting transcripts, along with company-specific terminology that generic models struggle to handle. By fine-tuning smaller, BERT-based models on each customer's unique data and continuously updating them based on user feedback, Glean achieves a 20% improvement in search performance over six months. This approach prioritizes retrieval quality, which the company argues is the critical foundation for effective enterprise AI and RAG systems.", "body_md": "Why Glean Builds Custom Embedding Models for Every Customer¶\nI hosted Manav from Glean for a guest lecture on enterprise search and fine-tuning embedding models. This session revealed a surprisingly underutilized approach that can dramatically improve RAG system performance - building custom embedding models for each customer rather than using generic solutions.\n▶️ Learn Glean's Custom Model Strategy\nWhat is Glean and why does their approach to enterprise search matter?¶\nGlean has built a comprehensive Work AI platform that unifies enterprise data across various applications (Google Drive, GitHub, Jira, Confluence) into a single system. Their flagship product, the Glean Assistant, leverages this unified data model to generate relevant answers to user questions and automate workflows.\nThe foundation of their system is semantic search capability, which Manav emphasized is absolutely critical for enterprise AI success:\n\"Search quality matters - you can't have a good RAG system, you can't have a good overall enterprise AI product unless you have good search.\"\nThis makes intuitive sense - without retrieving the right context from your enterprise data, even the best LLMs will produce hallucinations and incorrect information.\nKey Takeaway: Search quality is the foundation of enterprise AI success. Without effective retrieval from enterprise data, even the most advanced LLMs will generate hallucinations and provide incorrect information.\nWhat makes enterprise data uniquely challenging?¶\nUnlike internet data, which has a significant \"head problem\" where most searches target popular websites or common information sources, enterprise data is far more heterogeneous and doesn't fit neatly into a single mold. Manav explained:\n\"Enterprise data is very different than internet data... You have your basic document data sources like Google Drive, Google Docs, Confluence, Notion... But you're also working with a bunch of different types of applications, like Slack, which is a messaging platform. You have meetings, which doesn't really meet the standard concept of what a document is. You have GitHub and GitLab... They all behave in slightly different ways.\"\nThis diversity requires a robust, generalized unified data model that can handle the nuances of different data types while maintaining security and privacy. Additionally, company-specific language (project names, initiatives, internal terminology) creates another layer of complexity that generic models struggle with.\nKey Takeaway: Enterprise search is fundamentally different from web search because of data heterogeneity and company-specific language. A unified data model that can handle diverse data types while preserving security is essential for effective enterprise AI.\nWhy fine-tune embedding models for each customer?¶\nOne of the most fascinating aspects of Glean's approach is that they build custom embedding models for each customer. While many companies focus on using large, general-purpose embedding models, Glean has found that smaller, fine-tuned models often perform better for specific enterprise contexts.\nGlean's Custom Model Process:¶\n- Start with a high-performance base model (typically BERT-based)\n- Perform continued pre-training on company data using masked language modeling\n- Convert the language model into an embedding model through various training techniques\n- Continuously update the model as the company evolves\nThe results are impressive - after six months, they typically see a 20% improvement in search performance just from learning from user feedback and adapting to company changes.\nManav emphasized the power of smaller, specialized models:\n\"When you're thinking about building really performant enterprise AI... you want to also think about using smaller embedding models when you can, because small embedding models when fine-tuned to the domain and the specific task you have in hand can give you a lot better performance compared to just using large LLMs.\"\nKey Takeaway: Smaller, fine-tuned embedding models often outperform large general-purpose models for enterprise contexts. Glean achieves 20% search performance improvements through continuous model adaptation to company-specific language and user feedback.\nHow do they generate high-quality training data?¶\nCreating effective training data for fine-tuning embedding models is challenging, especially with enterprise privacy constraints. Glean uses several creative approaches:\nTraining Data Sources:¶\n- Title-body pairs: Mapping document titles to passages from the document body\n- Anchor data: Using documents that reference other documents to create relevance pairs\n- Co-access data: Identifying documents accessed together by users in short time periods\n- Public datasets: Incorporating high-quality public datasets like MS MARCO\n- Synthetic data: Using LLMs to create question-answer pairs for documents\nApplication-Specific Intelligence¶\nWhat's most impressive is their attention to application-specific nuances. For example, with Slack data, they don't just treat each message as a document. Instead, they create \"conversation documents\" from threads or messages within a short timespan, then use the first message as a title and the rest as the body.\nThis understanding of how different applications work leads to much higher quality training data than generic approaches.\nKey Takeaway: Generating high-quality training data requires understanding the nuances of different enterprise applications. Creative approaches like title-body pairs, anchor data, co-access signals, and synthetic data generation can provide valuable training signals even with privacy constraints.\nHow do they learn from user feedback?¶\nOnce users start interacting with their products, Glean incorporates those signals to further improve their models:\nSearch Product Feedback:¶\n- Query-click pairs: Direct signals of relevance from user interactions\nRAG Assistant Feedback (More Challenging):¶\nFor RAG-only settings like their Assistant product, where users don't explicitly click on documents, they face a more challenging problem:\n- Upvote/downvote systems (though these tend to get sparse usage)\n- Citation tracking when users click on citations to read more about a topic\n- Interaction pattern monitoring to infer relevance from various user behaviors\nManav candidly acknowledged the difficulty:\n\"This is like a pretty hard open question\"\nTheir approach of combining multiple weak signals seems pragmatic given the inherent challenge of getting explicit feedback signals for generative AI products.\nKey Takeaway: Learning from user feedback in RAG systems is challenging, especially for generative interfaces. Combining multiple weak signals (upvotes, citation clicks, interaction patterns) provides a more robust approach than relying on any single feedback mechanism.\nHow do they evaluate embedding model quality?¶\nEvaluating embedding models in enterprise settings is particularly challenging because:\n- Privacy constraints: You can't access customer data directly\n- Unique models: Each customer has a different model\n- Complex systems: End-to-end RAG evaluation involves many moving parts\nGlean's \"Unit Test\" Approach¶\nGlean's solution is to build \"unit tests\" for their models - targeted evaluations for specific behaviors they want their models to exhibit. For example, they test how well models understand paraphrases of the same query.\nThis approach allows them to:\n- Set performance targets for each customer's model\n- Identify underperforming models before customers experience issues\n- Focus optimization efforts on specific areas\nManav emphasized the importance of component-level optimization:\n\"If you want to really make good tangible progress day by day, isolating and optimizing individual components is always going to be much more scalable than trying to improve everything all together all at once.\"\nKey Takeaway: Evaluating embedding models in enterprise settings requires targeted \"unit tests\" that isolate specific behaviors. This component-level approach enables scalable optimization and prevents customer-facing issues.\nWhat role does traditional search play alongside embeddings?¶\nDespite all the focus on embedding models, Manav emphasized that traditional search techniques remain crucial:\n\"You don't want to over-index on semanticness or LLM-based scoring as the only thing that your search system should use... you can get a lot more bang for your buck by not using any semanticness at all to answer most questions.\"\nThe 60-70% Rule¶\nManav estimated that for 60-70% of enterprise search queries, basic lexical search with recency signals works perfectly well. Semantic search becomes more important for complex queries, particularly in agent-based systems.\nThis aligns with practical experience - getting 80% of the way there with full-text search and then adding semantic search as the cherry on top is often the most effective approach.\nKey Takeaway: Don't abandon traditional search techniques in pursuit of embedding-based approaches. A hybrid system that leverages both lexical and semantic search, along with signals like recency and authority, will deliver the best results for enterprise search.\nHow do they handle document relevance over time?¶\nOne interesting question addressed how Glean handles outdated documents that have been superseded by newer information. Their approach centers around a concept they call \"authoritativeness,\" which incorporates:\nAuthoritativeness Factors:¶\n- Recency: Newer documents are generally more relevant\n- Reference patterns: Documents that continue to be linked to or accessed remain authoritative\n- User satisfaction signals: Documents that consistently satisfy user queries maintain relevance\nReal-World Example¶\nA document containing WiFi password information might be old but still highly relevant if people continue to reference it when answering related questions.\nThis multi-faceted approach to document authority is more sophisticated than simply prioritizing recent content, which would miss important evergreen documents.\nKey Takeaway: Document relevance over time requires a multi-faceted \"authoritativeness\" approach that balances recency with reference patterns and user satisfaction signals, rather than simply prioritizing the newest content.\nFinal thoughts on building enterprise search systems¶\nManav concluded with several key insights for building effective enterprise search systems:\nCore Principles:¶\n- Unified data model is critical for handling heterogeneous enterprise data\n- Company-specific language matters tremendously for search quality\n- Fine-tuned smaller models often outperform generic large models for specific tasks\n- User feedback learning, though challenging, provides invaluable signals\n- Targeted \"unit tests\" enable scalable model quality assessment\n- Traditional search techniques remain powerful and shouldn't be discarded\nThe Pragmatic Approach¶\nGlean's approach is refreshingly pragmatic. They've learned that the path to high-quality enterprise search isn't just about using the latest, largest models, but about understanding the unique characteristics of enterprise data and building systems that address those specific challenges.\nThe emphasis on company-specific language models is particularly noteworthy - this is an area where many companies struggle when they try to apply generic embedding models to their unique terminology and document structures.\nKey Takeaway: Successful enterprise search requires a pragmatic approach that combines custom embedding models, unified data architecture, hybrid search techniques, and continuous learning from user feedback rather than relying solely on off-the-shelf solutions.\n--8←¶\n\"snippets/enrollment-button.md\"\nincludes/mkdocs.md", "url": "https://wpnews.pro/news/why-glean-builds-custom-embedding-models-for-every-customer", "canonical_source": "https://jxnl.co/writing/2025/09/11/why-glean-builds-custom-embedding-models-for-every-customer/", "published_at": "2025-09-11 00:00:00+00:00", "updated_at": "2026-05-19 22:17:27.385404+00:00", "lang": "en", "topics": ["enterprise-software", "artificial-intelligence", "machine-learning", "data"], "entities": ["Glean", "Manav", "Google Drive", "GitHub", "Jira", "Confluence", "Glean Assistant"], "alternates": {"html": "https://wpnews.pro/news/why-glean-builds-custom-embedding-models-for-every-customer", "markdown": "https://wpnews.pro/news/why-glean-builds-custom-embedding-models-for-every-customer.md", "text": "https://wpnews.pro/news/why-glean-builds-custom-embedding-models-for-every-customer.txt", "jsonld": "https://wpnews.pro/news/why-glean-builds-custom-embedding-models-for-every-customer.jsonld"}}