Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality

IBM has released two new multilingual embedding models under the Apache 2.0 license, built on ModernBERT: a compact 97M-parameter model and a full-size 311M-parameter model. Both support over 200 languages, handle context lengths up to 32,768 tokens, and achieve top retrieval quality scores on MTEB benchmarks for their size classes. The models are designed for enterprise use, include code retrieval across nine programming languages, and are compatible with popular frameworks like LangChain and LlamaIndex.

Text Generation • 9B • Updated • 69.7k • 175 Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality Enterprise Article /blog Published May 14, 2026 TL;DR:Two new Apache 2.0 multilingual embedding models built on ModernBERT — a 97M-parameter compact model that beats every open sub-100M multilingual embedder on MTEB Multilingual Retrieval 60.3 , and a 311M full-size model that scores 65.2 on MTEB Multilingual Retrieval 2 among open models under 500M parameters with Matryoshka support. Both cover 200+ languages, are tuned on 52 languages, handle 32K-token context 64x R1 , and add code retrieval across 9 programming languages. In this post: Enterprise-Ready by Design enterprise-ready-by-design · A Strong Sub-100M Multilingual Model a-strong-sub-100m-multilingual-model · What Changed from R1 what-changed-from-r1 · Training the Full-Size 311M Model training-the-full-size-311m-model · Building the compact 97M Multilingual Model building-the-compact-97m-multilingual-model · Benchmark Results benchmark-results · Matryoshka Embeddings matryoshka-embeddings-311m · Deployment Options deployment-options · For Framework Integrators for-framework-integrators · Which Model Should You Use? which-model-should-you-use · Try The Models try-the-models Multilingual embedding models face a persistent tension: broad language coverage usually comes at the cost of model size, and small models usually sacrifice languages. If you work across languages — retrieval-augmented generation over multilingual corpora, cross-lingual search, code retrieval in international teams — you've likely had to choose between a model that's fast enough and one that's good enough. The Granite Embedding Multilingual R2 release narrows that gap considerably. We're releasing two new multilingual embedding models: — A 311M-parameter full-size model with 768-dimensional embeddings, Matryoshka dimension support, and top-tier multilingual retrieval quality. granite-embedding-311m-multilingual-r2 — A 97M-parameter compact model with 384-dimensional embeddings that delivers strong retrieval quality for its size. granite-embedding-97m-multilingual-r2 Both models support 200+ languages with enhanced retrieval quality for 52 languages and programming code , handle context lengths up to 32,768 tokens a 64x increase over their R1 predecessors , and are released under the Apache 2.0 license. They work out of the box with sentence-transformers and transformers , require no task-specific instructions, and are compatible as drop-in replacements in LangChain , LlamaIndex , Haystack , and Milvus with a one-line model name change. For frameworks currently using an English-only default, that one line gives every user in your community support for 200+ languages — no API changes, no new dependencies, no code changes required on their end. Both models ship with ONNX and OpenVINO weights for CPU-optimized inference. 52 enhanced-support languages click to expand The underlying encoder was pretrained on text from 200+ languages, producing general-purpose embeddings for any of them. The following 52 languages receive explicit retrieval-pair and cross-lingual training for higher-quality retrieval: Albanian sq , Arabic ar , Azerbaijani az , Bengali bn , Bulgarian bg , Catalan ca , Chinese zh , Croatian hr , Czech cs , Danish da , Dutch nl , English en , Estonian et , Finnish fi , French fr , Georgian ka , German de , Greek el , Hebrew he , Hindi hi , Hungarian hu , Icelandic is , Indonesian id , Italian it , Japanese ja , Kazakh kk , Khmer km , Korean ko , Latvian lv , Lithuanian lt , Malay ms , Marathi mr , Norwegian no , Persian fa , Polish pl , Portuguese pt , Romanian ro , Russian ru , Serbian sr , Slovak sk , Slovenian sl , Spanish es , Swahili sw , Swedish sv , Tagalog tl , Telugu te , Thai th , Turkish tr , Ukrainian uk , Urdu ur , Uzbek uz , Vietnamese vi . Additionally, the models are trained on programming code Python, Go, Java, JavaScript, PHP, Ruby, SQL, C, C++ and support cross-lingual code retrieval. Enterprise-Ready by Design Both embedding models are trained on a mixture of IBM‑curated datasets, publicly available data, and internally generated or synthetic data. Public web‑derived data used in training is selected and filtered using IBM‑developed quality, deduplication, and governance processes intended to reduce risk in downstream commercial use. We intentionally avoid the use of the MS‑MARCO training dataset and datasets with explicit non‑commercial licensing restrictions. The models are pretrained using GneissWeb https://huggingface.co/datasets/ibm-granite/GneissWeb , an IBM‑curated dataset derived from publicly available web content and processed using IBM’s data preparation and governance tooling—along with additional IBM‑curated and other publicly available sources. Datasets undergo IBM governance review to assess licensing considerations, ownership signals, and personal data risks. These processes are designed to contribute to responsible use and enterprise deployment. A Strong Sub-100M Multilingual Model The standout of this release is granite-embedding-97m-multilingual-r2 . At 97 million parameters, it scores 60.3 on Multilingual MTEB Retrieval across 18 languages — the highest retrieval score we've found for any open multilingual embedding model under 100M parameters. The next-best model in that size class, multilingual-e5-small, scores 50.9 on the same benchmark — a +9.4 point gap on a mature benchmark. At roughly one-third the size of the 311M full-size model, it retains the majority of its retrieval quality across multilingual, code, and long-document benchmarks — a +12.2 point gain on MTEB Multilingual Retrieval over its direct predecessor, driven by a new architecture, better training data, and a novel pruning methodology more on that below . The full-size granite-embedding-311m-multilingual-r2 scores 65.2 on the same benchmark, a +13.0 point gain over its R1 predecessor. What Changed from R1 The Granite Embedding Multilingual R1 models were built on XLM-RoBERTa encoders with a 512-token context window. The R2 generation is a ground-up rebuild: ModernBERT https://huggingface.co/blog/modernbert is a recent encoder architecture that revisits the original BERT design with techniques from the last five years of transformer research. The shift brings several practical benefits: alternating attention lengths reduce computation on long sequences improves throughput on long sequences significantly , rotary position embeddings allow the 32K context window without the positional interpolation hacks that plague older architectures, and Flash Attention 2.0 support speeds up encoding on modern GPUs. The new multilingual tokenizers are worth highlighting. Rather than reusing XLM-RoBERTa's 250K-token vocabulary, we adopted existing tokenizers with strong multilingual and code coverage. The 311M model uses the Gemma 3 tokenizer 262K tokens ; the 97M model starts from the GPT-OSS tokenizer and prunes it down to a compact 180K-token vocabulary that preserves broad multilingual coverage while reducing the embedding table's parameter footprint. Tokenizer efficiency matters more than people realize — a 32K-token window sounds impressive until your tokenizer burns half of it encoding a single paragraph of Thai. Training the Full-Size 311M Model The 311M model is a 22-layer ModernBERT encoder with a 262K-token multilingual vocabulary, trained through a multi-stage pipeline: Knowledge distillation : The model learns from multiple teacher models simultaneously. The teachers are Granite 3.3 Instruct and Mistral v0.2 Instruct decoder models, further finetuned for text embeddings, which transfer retrieval-specific knowledge into the 311M encoder architecture. Contrastive fine-tuning : Standard contrastive training on multilingual retrieval pairs — queries matched with relevant and hard-negative passages across 52 languages and code — sharpens the model's ability to distinguish relevant from irrelevant results. Model merging : After training, we merge checkpoints from different training stages and configurations. This combines the strengths of models optimized for different objectives e.g., multilingual breadth vs. English depth into a single set of weights without additional training compute. Matryoshka Representation Learning : The model is trained with Matryoshka objectives so that its 768-dimensional embeddings can be truncated to 512, 384, 256, or 128 dimensions with minimal quality loss see Matryoshka Embeddings matryoshka-embeddings-311m below . The result is a model that scores 65.2 on MTEB Multilingual Retrieval and 56.3 on the overall average — a +14.5 point average gain over its R1 predecessor. Building the compact 97M Multilingual model The 97M model is trained through a combination of vocabulary selection and knowledge distillation : Vocabulary selection : The 262K-token vocabulary is reduced to a purpose-trained 180K-token vocabulary that preserves broad multilingual coverage while cutting the embedding table size substantially. Knowledge distillation : The pruned model is then finetuned using knowledge distillation from multiple teacher models including a Granite 4.1 8B https://huggingface.co/ibm-granite/granite-4.1-8b and Mistral Instruct decoder-based teacher and contrastive training to improve retrieval quality. This approach transfers retrieval-specific knowledge from multiple strong teachers, while reducing the model parameters without sacrificing language coverage. The result is a highly efficient compact model — scoring 60.3 on MTEB Multilingual Retrieval vs. 65.2 for the full-size model, while being approximately 3x smaller. Benchmark Results Multilingual Retrieval Performance across the main benchmark suite sorted by model size. Scores are averages across tasks within each benchmark higher is better : | Model | Params | Active Params | Embed Dim | MTEB Multilingual Retrieval 18 | Code 12 | English Retrieval 10 | LongEmbed 6 | RaR-b 17 | |---|---|---|---|---|---|---|---|---| | F2LLM-v2-80M | 80M | 32M | 320 | 50.1 | 68.0 | 47.5 | 31.7 | 17.9 | | multilingual-e5-small | 118M | 22M | 384 | 50.9 | 53.5 | 46.5 | 38.8 | 20.3 | | granite-embedding-107m-multilingual R1 | 107M | 11M | 384 | 48.1 | 40.7 | 47.9 | 34.3 | 17.1 | | paraphrase-multilingual-MiniLM-L12-v2 | 118M | 22M | 384 | 36.6 | 23.5 | 35.9 | 20.9 | 10.9 | | jina-embeddings-v5-text-nano | 212M | 113M | 768 | 63.3 | 71.2 | 58.8 | 63.6 | 25.2 | | harrier-oss-v1-270m | 268M | 100M | 640 | 66.4 | 62.4 | 52.1 | 64.9 | 32.9 | | multilingual-e5-base | 278M | 86M | 768 | 52.7 | 52.6 | 49.0 | 40.5 | 23.4 | | granite-embedding-278m-multilingual R1 | 278M | 86M | 768 | 52.2 | 48.5 | 51.5 | 37.7 | 18.9 | | embeddinggemma-300m | 308M | 106M | 768 | 62.5 | 68.7 | 54.6 | 55.4 | 26.1 | | gte-multilingual-base | 305M | 113M | 768 | 57.2 | 57.5 | 50.8 | 62.1 | 19.0 | | snowflake-arctic-embed-m-v2.0 | 305M | 113M | 768 | 54.8 | 55.2 | 58.4 | 55.4 | 23.3 | | multilingual-e5-large | 560M | 304M | 1024 | 53.7 | 55.8 | 51.5 | 40.4 | 25.4 | | text-embedding-3-small OpenAI, API only | — | — | 1536 | 50.7 | — | 53.8 | 53.6 | 23.2 | granite-embedding-97m-multilingual-r2 | 97M | 28M | 384 | 60.3 | 60.4 | 50.1 | 65.6 | 24.9 | granite-embedding-311m-multilingual-r2 | 311M | 110M | 768 | 65.2 2 | 63.8 3 | 52.6 5 | 71.7 1 | 28.0 2 | A few things stand out: The 97M R2 model beats multilingual-e5-base and gte-multilingual-base ~300M parameter models on average and on most individual benchmarks, despite being roughly 3x smaller., a full paraphrase-multilingual-MiniLM-L12-v2 — a widely-used framework default — scores 36.6 +23.7 points behind the 97M R2 model, which is also slightly smaller 97M vs 110M parameters with the same 384-dimensional output. LongEmbed is the biggest R1-to-R2 gain : +31.3 points for the 97M model, +34.0 for the 311M. This is the direct payoff of the 32K context window — R1's 512-token limit meant your legal contract was being judged by its first page. Many practical multilingual workloads involve long documents legal contracts, technical manuals, research papers, multi-page reports that R1 simply could not see in full. Code retrieval improves dramatically : +19.7 97M and +15.3 311M over R1, reflecting the new code training set, larger context window, and better training methodology. In the broader competitive field , harrier-oss-v1-270m leads on MTEB Multilingual Retrieval 66.4 and RaR-b 32.9 , while jina-embeddings-v5-text-nano leads on Code 71.2 and English Retrieval 58.8 . The 311M Granite model is competitive on average 56.3 and leads on LongEmbed 71.7 , while offering substantially higher encoding throughput than jina-embeddings-v5-text-nano see speed table below . Speed and Throughput Encoding speed matters for production workloads, especially when you're indexing millions of documents or need low-latency query encoding. We measured latency and throughput on a single NVIDIA H100 GPU using 512-token chunks: The 97M model encodes over 2,500 documents per second — comparable throughput to multilingual-e5-small — while delivering substantially higher retrieval quality. The 311M model, at ~1,800 docs/sec, performs better than jina-embeddings-v5-text-nano on retrieval quality 65.2 vs. 63.3 at over 5.5x the encoding speed note: speed numbers are computed with the latest transformer code, which had a speed regression vs the last 4.57 version - for both the Jina and granite models - see our technical report for details . harrier-oss-v1-270m offers the best combination of speed and retrieval score among the competitors listed here. Matryoshka Embeddings 311M The 311M model supports Matryoshka Representation Learning https://arxiv.org/abs/2205.13147 , which lets you truncate embeddings from the full 768 dimensions down to 512, 384, 256, or 128 with graceful quality degradation. This is useful when storage, memory, or similarity-computation cost is a concern — a 256-dimensional embedding takes one-third the storage of a 768-dimensional one, and cosine similarity is proportionally cheaper to compute. Here's how retrieval quality holds up across embedding dimensions: The quality loss from dimension reduction is remarkably small. Cutting from 768 to 256 dimensions — a 3x reduction in storage and similarity-computation cost — drops MTEB Multilingual Retrieval by just 0.5 points 65.2 → 64.7 and Code Retrieval by 0.5 points 63.9 → 63.4 . Even at 128 dimensions a 6x reduction , the model still scores 63.7 on MTEB Multilingual Retrieval and 62.3 on Code — retaining over 97% of its full-dimension performance. In practice, this means you can substantially reduce your index size and search latency with minimal impact on result quality. Note,results in the above picture were evaluated with a context length of 1024 for English and Multilingual Retrieval, and 8192 for Code . For comparison, the 311M model truncated to 384 dimensions the same dimensionality as the 97M model's native output still outperforms the 97M model across all three benchmarks. If you need 384-dimensional embeddings and can afford the 311M model's encoding cost, Matryoshka truncation is the stronger option. python from sentence transformers import SentenceTransformer model = SentenceTransformer "ibm-granite/granite-embedding-311m-multilingual-r2" Full 768-dimensional embeddings full = model.encode "example text" print full.shape 1, 768 Truncated to 384 dimensions small = model.encode "example text" , truncate dim=384 print small.shape 1, 384 The 97M model does not support Matryoshka — 384 dimensions is already compact. Cross-lingual Retrieval Average performance on cross-lingual tasks within MTEB Retrieval. Belebele https://huggingface.co/datasets/facebook/belebele measures cross-lingual passage matching across 122 languages; MLQA measures extractive cross-lingual question answering retrieval across 7 languages. | Model | Belebele Retrieval | MLQA Retrieval | |---|---|---| | granite-embedding-107m-multilingual R1 | 55.1 | 60.5 | | granite-embedding-278m-multilingual R1 | 62.2 | 63.0 | | granite-embedding-97m-multilingual-r2 | 52.9 | 60.5 | granite-embedding-311m-multilingual-r2 | 66.5 | 67.1 | The 311M R2 model gains +4.3 on Belebele and +4.1 on MLQA over its R1 predecessor, showing improved cross-lingual transfer at the larger scale across both benchmarks. The 97M R2 model scores lower on Belebele 52.9 vs 55.1, −2.2 while matching its R1 predecessor on MLQA 60.5 . The Belebele gap is a tradeoff inherent in the pruning and vocabulary reduction process — the R2 model's training prioritized the broader 18-language MTEB Multilingual Retrieval set where it gains +12.2 over R1 and long-document retrieval +31.3 , while the smaller vocabulary 180K vs. 250K tokens and reduced layer count 12 vs. 22 affect narrow cross-lingual transfer tasks. If cross-lingual transfer across many language pairs is your primary use case, the full-size 311M model is the better choice. Deployment Options Both models ship with multiple deployment paths for production use. Install the core library with: pip install sentence-transformers Sentence Transformers recommended for most users : python from sentence transformers import SentenceTransformer, util model = SentenceTransformer "ibm-granite/granite-embedding-97m-multilingual-r2" queries = "What is the tallest mountain in Japan?", English "Wer hat das Lied Achy Breaky Heart geschrieben?", German "ドイツの首都はどこですか？", Japanese passages = "富士山は、静岡県と山梨県にまたがる活火山で、標高3776.12 mで日本最高峰の独立峰である。", Japanese "Achy Breaky Heart is a country song written by Don Von Tress.", English "Berlin ist die Hauptstadt und ein Land der Bundesrepublik Deutschland.", German q emb = model.encode queries p emb = model.encode passages print util.cos sim q emb, p emb Each query scores highest against its matching passage — across languages LangChain pip install langchain-huggingface : python from langchain huggingface import HuggingFaceEmbeddings embeddings = HuggingFaceEmbeddings model name="ibm-granite/granite-embedding-97m-multilingual-r2" docs = embeddings.embed documents "富士山は日本最高峰の独立峰です。", "Mount Fuji is Japan's highest peak.", query = embeddings.embed query "What is Japan's tallest mountain?" Drop-in replacement anywhere LangChain accepts an Embeddings object LlamaIndex pip install llama-index-embeddings-huggingface : python from llama index.embeddings.huggingface import HuggingFaceEmbedding from llama index.core import Settings embed model = HuggingFaceEmbedding model name="ibm-granite/granite-embedding-97m-multilingual-r2" Settings.embed model = embed model applies globally to any index or pipeline Haystack pip install sentence-transformers haystack-ai from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder, from haystack.components.retrievers.in memory import InMemoryEmbeddingRetriever from haystack.dataclasses import Document from haystack.document stores.in memory import InMemoryDocumentStore doc embedder = SentenceTransformersDocumentEmbedder model="ibm-granite/granite-embedding-97m-multilingual-r2" query embedder = SentenceTransformersTextEmbedder model="ibm-granite/granite-embedding-97m-multilingual-r2" doc embedder.warm up query embedder.warm up Embed and index documents document store = InMemoryDocumentStore result docs = doc embedder.run documents= Document content="富士山は日本最高峰の独立峰です。" , Document content="Mount Fuji is Japan's highest peak." , Document content="Achy Breaky Heart is a country song written by Don Von Tress." , Document content="Berlin ist die Hauptstadt und ein Land der Bundesrepublik Deutschland." , document store.write documents result docs "documents" Embed query and retrieve result query = query embedder.run text="What is Japan's tallest mountain?" retriever = InMemoryEmbeddingRetriever document store=document store results = retriever.run query embedding=result query "embedding" , top k=2 for doc in results "documents" : print f"{doc.score:.3f} {doc.content}" 0.961 Mount Fuji is Japan's highest peak. 0.913 富士山は日本最高峰の独立峰です。 Milvus pip install pymilvus sentence-transformers python from pymilvus import MilvusClient from sentence transformers import SentenceTransformer model = SentenceTransformer "ibm-granite/granite-embedding-97m-multilingual-r2" Use "./milvus.db" for local persistence or a server URI for production client = MilvusClient ":memory:" client.create collection collection name="multilingual docs", dimension=384 docs = "富士山は日本最高峰の独立峰です。", "Mount Fuji is Japan's highest peak.", "Achy Breaky Heart is a country song written by Don Von Tress.", "Berlin ist die Hauptstadt und ein Land der Bundesrepublik Deutschland.", embeddings = model.encode docs .tolist client.insert collection name="multilingual docs", data= {"id": i, "vector": emb, "text": doc} for i, emb, doc in enumerate zip embeddings, docs , query emb = model.encode "What is Japan's tallest mountain?" .tolist results = client.search collection name="multilingual docs", data=query emb, limit=2, output fields= "text" , for hit in results 0 : print f"{hit 'distance' :.3f} {hit 'entity' 'text' }" 0.961 Mount Fuji is Japan's highest peak. 0.913 富士山は日本最高峰の独立峰です。 Both models also ship with pre-converted ONNX and OpenVINO weights for optimized CPU/accelerator inference, work as embedding endpoints via vLLM vllm serve ... --task embed , and can be converted to GGUF for using Ollama https://ollama.com/ llama.cpp https://github.com/ggerganov/llama.cpp . See the model cards for full deployment examples. For Framework Integrators If you maintain an embedding framework, vector store, or RAG pipeline library and are evaluating these models as a default, here's what you need to know: License : Apache 2.0, trained without MS-MARCO Drop-in behavior : No task-specific instruction prefix required — behaves like all-MiniLM-L6-v2 at the API level. Existing code that calls .encode works unchanged. Dimensionality : 384-dimensional output 97M and 768-dimensional output 311M , matching the most common existing defaults. No index migration required. Model size : The 97M model's weights are 195 MB safetensors — less than half the size of paraphrase-multilingual-MiniLM-L12-v2 471 MB , the most common multilingual default. The quantized ONNX weights are just 98 MB, comparable to all-MiniLM-L6-v2 91 MB while covering 200+ languages. CPU-friendly : Ships with ONNX and OpenVINO weights for optimized CPU inference. No GPU dependency for a getting-started tutorial. Multilingual by default : If your current default is English-only, this is a one-line swap that gives every user in your community support for 200+ languages — without touching their code. Stable identifier : ibm-granite/granite-embedding-97m-multilingual-r2 on Hugging Face, maintained by IBM under the Granite model family. To discuss adopting these models as a default in your project, open an issue at ibm-granite/granite-embedding-models https://github.com/ibm-granite/granite-embedding-models . Which Model Should You Use? These two multilingual models are part of the broader Granite Embedding R2 family, which also includes two high-performing English-focused models: granite-embedding-english-r2 https://huggingface.co/ibm-granite/granite-embedding-english-r2 149M parameters and granite-embedding-small-english-r2 https://huggingface.co/ibm-granite/granite-embedding-small-english-r2 47M parameters . If your data is predominantly English, the English models offer higher retrieval quality on English benchmarks at a smaller footprint, since they don't need to allocate capacity across 200+ languages. | If you need... | Use | |---|---| | Best multilingual retrieval quality | | granite-embedding-97m-multilingual-r2 https://huggingface.co/ibm-granite/granite-embedding-97m-multilingual-r2 granite-embedding-english-r2 https://huggingface.co/ibm-granite/granite-embedding-english-r2 or granite-embedding-small-english-r2 https://huggingface.co/ibm-granite/granite-embedding-small-english-r2 Try The Models Both models are available now on Hugging Face under the IBM Granite Embedding collection https://huggingface.co/collections/ibm-granite/granite-embedding-models : You can try the small models interactively on CPU via a Granite Embedding demo here https://huggingface.co/spaces/ibm-granite/granite-embedding on Hugging Face Spaces, or run the full examples notebook in Google Colab: You can access our detailed technical report covering the full training methodology, per-language evaluations, and pruning ablations here Granite Multilingual Embedding R2 report https://arxiv.org/abs/2605.13521 . For questions, feedback, or issues, visit ibm-granite/granite-embedding-models https://github.com/ibm-granite/granite-embedding-models on GitHub. Framework maintainers: If you'd like to adopt these models as a default in your project, open an issue at ibm-granite/granite-embedding-models https://github.com/ibm-granite/granite-embedding-models — we're happy to help with integration, testing, and any questions about licensing or deployment. Give them a try, and if the embeddings spark joy, smash that ❤️ button on Hugging Face. Our models have feelings too, and every +1 keeps them warm at night.