Modern
AI applications often rely on large language models to generate answers, but
these models are only as reliable as the information they can access. Retrieval‑augmented
generation, or RAG, is a widely used way to improve reliability by pulling in
relevant documents at runtime and conditioning the model on those documents
before it produces an answer. In practice, however, the effectiveness of this
approach is tightly coupled to how well the retrieval step works. If the system
retrieves irrelevant or incomplete documents, even a strong model can produce
weak or incorrect outputs.
This
limitation becomes especially visible when dealing with multi-step or
“multi-hop” questions. These are questions where the answer depends on
combining facts from multiple sources rather than finding a single sentence in
a single document. A simple RAG system treats the input question as one query,
embeds it, and retrieves the top matching documents. That works well when all
relevant information happens to live together, but it breaks down when the
facts are scattered. In those cases, the retriever might return broad summaries
or partially relevant material instead of the precise pieces of evidence
required to construct the answer.
A
paper on Question decomposition for RAG [1] treats complex questions not as a
single retrieval problem, but as a collection of smaller, focused retrieval
problems. Instead of querying the system once, the approach uses a language
model to decompose the original question into several simpler sub-questions.
Each sub-question targets a specific piece of the information needed. For
example, instead of asking which company had the highest profit among a set of
companies, the system asks separate questions about each company’s profit,
which makes it much easier to retrieve exact, relevant data points.
This
decomposition step significantly increases the chances that the system will
find all the necessary evidence, because different documents often cover
different aspects of a problem. However, it also introduces a new challenge:
retrieving documents for multiple sub-questions produces a much larger pool of
candidate passages, many of which are only loosely related or even irrelevant
to the original query. The system therefore needs a way to filter and
prioritize these results so that only the most useful pieces of evidence are
passed to the language model.
To
solve this, the approach adds a reranking stage after retrieval. The reranker
is a more precise but more computationally expensive model that scores each
candidate document based on how relevant it is to the original, undecomposed
question. Unlike the initial retrieval step, which relies on vector similarity,
the reranker jointly processes the query and the document, allowing it to
capture finer-grained relationships between them. The system then selects the
top-ranked documents and discards the rest.
The
overall pipeline can be thought of as a three-step process. First, the system
expands the query into a set of sub-queries using a language model. Second, it
retrieves documents independently for each sub-query, merging all results into
a single candidate pool. Third, it applies reranking to filter that pool and
extract the most relevant passages. These final passages are then concatenated
with the original query and passed into the language model for answer
generation.
One
of the key advantages of this approach is that it does not require training new
models or building specialized indexes. It relies entirely on off-the-shelf
components: a general-purpose LLM for decomposition, a standard dense retriever
for initial search, and a pretrained cross-encoder for reranking. This makes it
easy to plug into existing RAG systems with minimal engineering effort.
Empirical
results show that this combination of decomposition and reranking provides
meaningful improvements. On multi-hop benchmarks, the system retrieves more
relevant evidence and produces more accurate answers compared to standard RAG.
The gains come from a clear division of responsibilities: decomposition
improves coverage by ensuring that different aspects of the problem are
retrieved, while reranking restores precision by filtering noise from the
expanded result set.
There
are, however, trade-offs that matter for real-world systems. The largest cost
comes from generating sub-questions with a language model, which adds
noticeable latency. Reranking also increases computational load because it
evaluates each query–document pair individually. While techniques such as
caching can amortize some of this overhead, the approach is still slower than a
naive single-query pipeline.
Another
important limitation is that decomposition is not always beneficial. When a
query is already specific and well-formed, breaking it into sub-questions can
actually introduce noise and reduce performance. The quality of the
decomposition also depends heavily on the language model and the prompt used to
guide it. In addition, the system operates in a single pass, meaning it does
not iteratively refine queries based on retrieved evidence, which could limit
its ability to handle extremely complex reasoning chains.
For
engineers building AI applications, the takeaway is straightforward. If your
system struggles with questions that require combining information from
multiple sources, simply improving embeddings or increasing the number of
retrieved documents may not be enough. Instead, treating retrieval as a
structured process—where you explicitly break down the problem and then
carefully filter the results—can yield significant improvements without
changing your underlying models. The combination of query decomposition and
reranking offers a practical, modular way to do this while staying compatible
with existing RAG architectures.
References:
1.
Question decomposition for RAG: [2507.00355v1 | PDF]: https://arxiv.org/pdf/2507.00355