Thursday, September 4, 2025
HomeData Modelling & AIBig dataPerspectives on R in RAG

Perspectives on R in RAG


Retrieval-augmented generation (RAG) has led to a surge in the
number of developers interested in working on retrieval. In this
blog post, I share perspectives providing insights and perspectives
on the R in RAG.

The case for hybrid search and ranking

Hybrid retrieval and ranking pipelines allow you to combine signals
from unsupervised methods (such as BM25) with supervised methods
(such as neural rankers). By combining unsupervised and supervised
techniques, we have shown that ranking
accuracy

increases compared to using either method independently. The rise
in popularity of hybrid models can be attributed to the lack of the
necessary tools, data, time and resources to fine-tune text embedding
models specifically for their retrieval tasks. Extensive research
and experimentation have shown that hybrid ranking outperforms
either method when used alone in a new setting or a new domain with
slightly different texts than what the model was trained on.

What is often overlooked in this hybrid search discussion is the
ability to perform standard full-text-search (FTS) functionality
like exact and phrase matching. Text embedding models are limited
by their fixed vocabulary, leading to poor search results for unseen
words not in the vocabulary. This is particularly evident in cases
such as searching for a product identifier, a phone number, a zip
code, or a code snippet, where text embedding models with fixed
vocabularies fail. For example, if we look at BERT, one of the most
popular language models, its default vocabulary does not include
the word 2024.

>>>from transformers import AutoTokenizer
>>>tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

>>>tokenizer.tokenize("2024")
['202', '##4']

>>>tokenizer.encode("2024", add_special_tokens=False)
[16798, 2549]

>>>tokenizer.tokenize("90210")
['90', '##21', '##0']
[3938, 17465, 2692]

We highly recommend
this video tutorial
for understanding tokenization for language models.

In real-world RAG applications, these search cases are essential.
However, the relevancy datasets used to evaluate retrieval and
ranking techniques often lack queries of these types. Consequently,
when comparing and evaluating retrieval methods on various benchmarks,
we only consider limited search use cases.

As more developers address retrieval challenges in the context of
RAG, it’s important to remember that text embedding models alone
cannot handle simple table stakes search issues.

Different languages have unique characteristics that require specific
approaches to tokenization, stemming, and normalization. BM25 is
suitable for multilingual settings, but it requires attention to
diverse languages, character sets and language-specific features.

Tokenization splits text into tokens like words or subwords. It
should consider language-specific traits.

Normalization often includes converting text to a consistent case,
such as lowercase or uppercase, to eliminate case-sensitive variations.
A fun-fact in this respect is that many multilingual text embedding
models are built on multilingual tokenizer vocabularies which are
case sensitive. That means that the vector representation of “Can”
is different from “can”.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-large")
>>> tokenizer.tokenize("Can can")
['▁Can', '▁can']
>>> tokenizer.encode("Can can", add_special_tokens=False)
[4171, 831]

These mentioned search text processing techniques have an influence
on the shape of the recall and precision curve. In scenarios where
high recall is crucial, such as text search, it is generally
undesirable for casing to be a decisive factor. However, in other
contexts, preserving case may be necessary, especially when
distinguishing named entities from other text components.

Vespa as a flexible text search platform integrates linguistic
processing
components
(Apache OpenNLP, Apache
Lucene
) which
provides text processing capabilities for more than 40 languages.
Plus, you can roll your own custom linguistic implementation. In
addition to the linguistic text processing capabilities, Vespa
offers a wide range of matching
capabilities

like prefix, fuzzy, exact, case sensitive and n-gram catering for
a wide range of full-text search use cases.

To chunk or not to chunk

While advancements have introduced LLMs with longer context windows,
text embedding models still face limitations in handling long text
representations and are outperformed by simple BM25
baselines

when used with longer documents. In other words, to produce meaningful
text embedding representations for search, we must split longer
texts into manageable chunks that can be consumed by the text
embedding model.

To address this challenge, developers choosing to work with
single-vector databases like Pinecone, have chunked the document
into independent retrievable units or rows into the database. This
means the original context, the surrounding chunks or other document
level metadata is not preserved unless it’s duplicated into the
chunk-level retrievable row.

Developers using Vespa, the most versatile vector database for RAG,
don’t need to segment the original long document into smaller
retrievable units. Multi-vector
indexing

per document prevents losing the original context and provides easy
access to all the chunks from the same document. As a result,
developers can retrieve entire documents, not individual
chunks
.

Another advantage of this representation is that it retains the
complete context of the document including metadata. This allows
us to employ hybrid retrieval and ranking, which combines signals
from both the document level and the chunk level. This technique
can be used for candidate retrieval, where relevant documents are
identified based on the entire context. The chunk level text embedding
representations can then be used to further refine or re-rank the
results. Additionally, in the final step of a RAG pipeline, including
adjacent chunks or even all the chunks of the document becomes
straightforward, provided that the generative model supports a long
context window.

RELATED ARTICLES

Most Popular

Dominic
32261 POSTS0 COMMENTS
Milvus
81 POSTS0 COMMENTS
Nango Kala
6626 POSTS0 COMMENTS
Nicole Veronica
11798 POSTS0 COMMENTS
Nokonwaba Nkukhwana
11855 POSTS0 COMMENTS
Shaida Kate Naidoo
6748 POSTS0 COMMENTS
Ted Musemwa
7025 POSTS0 COMMENTS
Thapelo Manthata
6696 POSTS0 COMMENTS
Umr Jansen
6716 POSTS0 COMMENTS