AI Models

Granite Embedding Multilingual R2 for RAG

IBM's Granite Embedding Multilingual R2 packages Apache 2.0 multilingual retrieval, 32K-token context, and framework-friendly deployment into 97M and 311M ModernBERT models. Here is what is verified and what still needs your own evaluation.

May 16, 2026·6 min read·1,186 words

IBM published Granite Embedding Multilingual R2 on the IBM Granite organization on Hugging Face on May 14, 2026. The release adds two Apache 2.0 multilingual embedding models built on ModernBERT, with stated retrieval scope across more than 200 languages and context lengths up to 32,768 tokens. For teams comparing open multilingual embedding options for retrieval-augmented generation, vector search, or code search, the announcement is worth a careful read.

This article summarizes what IBM actually claims, where the models plausibly fit, and which questions still need your own evaluation. Toolhalla has not run hands-on tests of either model; the benchmark numbers below come from IBM's announcement rather than independent Toolhalla measurement.

Sources:

IBM Granite announcement on Hugging Face: https://huggingface.co/blog/ibm-granite/granite-embedding-multilingual-r2
311M model card: https://huggingface.co/ibm-granite/granite-embedding-311m-multilingual-r2
97M model card: https://huggingface.co/ibm-granite/granite-embedding-97m-multilingual-r2

What IBM/Granite announced

According to the Hugging Face announcement, the release includes two models:

Granite Embedding 97M Multilingual R2 — a compact ModernBERT-based embedding model that outputs 384-dimensional vectors.
Granite Embedding 311M Multilingual R2 — a full-size ModernBERT-based embedding model that outputs 768-dimensional vectors and supports Matryoshka truncation to smaller dimensions.

Both models are published under the Apache 2.0 license. IBM lists support for more than 200 languages, with enhanced or tuned retrieval quality for 52 languages, context lengths up to 32,768 tokens, and code retrieval across 9 programming languages.

IBM also says both models ship with ONNX and OpenVINO weights for processor-optimized inference, and that deployment options described in the announcement include Hugging Face Transformers, sentence-transformers, Text Embeddings Inference, vLLM with task embed, and conversion to GGUF or Ollama via llama.cpp.

Why this matters for Toolhalla readers

Most production retrieval stacks still rely on a small set of embedding models — some open, some hosted. Adding another open, permissively licensed multilingual option matters for three reasons:

1. Multilingual coverage as a default. Teams building search or retrieval-augmented generation workflows across languages often have to choose between a stronger English-only model and a weaker multilingual one. IBM positions Granite Embedding Multilingual R2 specifically at the multilingual case, with a long list of supported languages and a smaller tuned subset.

2. A long context window for chunking. A 32K-token context lets you embed longer passages without splitting them as aggressively. Whether that actually improves recall on your corpus is something you have to test, but the headroom is there.

3. Open license and framework fit. Apache 2.0 weights with sentence-transformers and transformers support keep the migration cost low for stacks that already use LangChain, LlamaIndex, or Haystack, or that build directly on Hugging Face primitives.

97M vs 311M: which model should you try first?

IBM frames the two models as a deliberate size trade-off.

The 97M model outputs 384-dimensional embeddings. IBM claims it scores 60.3 on MTEB Multilingual Retrieval and describes it as the best open multilingual embedding model under 100M parameters they found. That is an IBM-stated claim, not a Toolhalla benchmark. The smaller footprint and lower-dimension output make it attractive when index size, processor-side inference, or batch throughput on modest machines is the binding constraint.

The 311M model outputs 768-dimensional embeddings and adds Matryoshka truncation, which lets you cut to smaller dimensions when storage or latency budgets demand it. IBM claims a score of 65.2 on MTEB Multilingual Retrieval and ranks it #2 among open models under 500M parameters in the same comparison. Again, this is the source's claim, not independent verification.

A reasonable default is to try the 97M model first when the workload is mostly short documents or when index size matters, and to evaluate the 311M model when retrieval quality on harder multilingual queries is the priority. Either way, run the evaluation on your own corpus before committing.

Deployment and integration notes

IBM's announcement describes a wide deployment surface. The models work with sentence-transformers and transformers, require no task-specific instruction prefix, and are positioned as drop-in replacements for embedding models in stacks built on LangChain, LlamaIndex, Haystack, or Milvus in many cases.

For processor-side serving, ONNX and OpenVINO weights ship alongside the standard checkpoints. For higher-throughput serving, IBM lists Text Embeddings Inference and vLLM with task embed. Teams that prefer local runtimes can convert to GGUF for use through llama.cpp or Ollama.

A few practical cautions:

"Drop-in replacement" is a useful framing, but every embedding swap changes vector geometry. You will need to re-embed your corpus and re-evaluate retrieval before cutting over.
Framework support being listed is not the same as a one-line migration in every stack. Check the model card and the framework's embedding adapter for any version constraints.
Long context support is a model capability, not a guarantee that the surrounding tokenizer, chunker, or retriever in your stack handles 32K-token passages efficiently.

Benchmark claims and caveats

The MTEB Multilingual Retrieval numbers — 60.3 for the 97M model and 65.2 for the 311M model — come from IBM's Hugging Face announcement. They are useful for filtering candidates, not for choosing a production model.

Before replacing an existing embedding model, evaluate at minimum:

retrieval quality on the languages and domains your users actually query;
behavior on edge cases like mixed-language passages, code snippets, and very short queries;
latency in the serving framework you plan to use;
index size at the chosen embedding dimension, including the Matryoshka truncation you actually intend to ship;
the operational cost and effort of re-embedding the existing corpus.

None of these are answerable from a leaderboard score alone.

Directory update recommendation

For Toolhalla's AI Model Directory, Granite Embedding Multilingual R2 is an add-and-watch candidate. The relevant tagging is provider IBM, model family Granite, type embedding/retrieval, access open weights on Hugging Face, license Apache 2.0. Confidence is high for the release itself and the licensing claim; confidence on benchmark superiority should stay medium until the models are independently tested.

For readers comparing inference providers and where embedding workloads might live, see our overview of Hugging Face, Replicate, and Together AI.

FAQ

Is Granite Embedding Multilingual R2 open source?

IBM's announcement and the model cards state both the 97M and 311M models are released under the Apache 2.0 license, with weights available on Hugging Face.

What is the difference between the 97M and 311M models?

The 97M model is a compact ModernBERT-based embedding model that outputs 384-dimensional vectors. The 311M model is a full-size ModernBERT-based embedding model that outputs 768-dimensional vectors and supports Matryoshka truncation to smaller dimensions. IBM claims stronger MTEB Multilingual Retrieval scores for the 311M model.

Does Granite Embedding Multilingual R2 generate text?

No. This is an embedding and retrieval model family, not a generative language model. It produces vector representations of text for use in retrieval, search, and similar workflows.

Can I use it with LangChain or LlamaIndex?

IBM positions the models as drop-in replacements in many embedding integrations, including LangChain, LlamaIndex, Haystack, and Milvus, via sentence-transformers and transformers. Confirm version compatibility in your specific stack and re-evaluate retrieval after switching.

Should I replace my current embedding model?

Not on the basis of the announcement alone. Re-embedding a corpus is an operational cost, and benchmark leaderboard positions do not always translate to your own queries. Run a side-by-side evaluation on representative traffic before deciding.

Frequently Asked Questions

Is Granite Embedding Multilingual R2 open source?

IBM's announcement and the model cards state both the 97M and 311M models are released under the Apache 2.0 license, with weights available on Hugging Face.

What is the difference between the 97M and 311M models?

Does Granite Embedding Multilingual R2 generate text?

No. This is an embedding and retrieval model family, not a generative language model. It produces vector representations of text for use in retrieval, search, and similar workflows.

Can I use it with LangChain or LlamaIndex?

Should I replace my current embedding model?

🔧 Tools in This Article

Make (Integromat)

Hugging Face

Together AI

LlamaIndex

LangChain

Replicate

Haystack

Ollama

Related Guides

All guides →

AI Models

Gemma 4: where Google’s new open model family fits

Gemma 4 is Google's open model family for local, long-context, vision, and agentic workflows. Here's where the 2B, 4B, 26B MoE, and 31B Dense models fit.

6 min read

AI Models

Llama 4 Maverick vs Scout: Which Model Wins in 2026?

9 min read

AI Models

Qwen 3.6 Plus Review: Alibaba's Fastest Reasoning Model Beats Claude on Coding

Qwen 3.6 Plus arrived without a press release. On March 30-31, 2026, Alibaba's Qwen team dropped it directly onto OpenRouter as a free preview. The announcement was a single post on X from Qwen researcher ChujieZheng, sharing a benchmark chart....

8 min read

#IBM Granite#embeddings#multilingual AI#RAG#open models#Hugging Face