Cluster computing

Modern AI applications often rely on large language models to generate answers, but these models are only as reliable as the information they can access. Retrieval‑augmented generation, or RAG, is a widely used way to improve reliability by pulling in relevant documents at runtime and conditioning the model on those documents before it produces an answer. In practice, however, the effectiveness of this approach is tightly coupled to how well the retrieval step works. If the system retrieves irrelevant or incomplete documents, even a strong model can produce weak or incorrect outputs.

This limitation becomes especially visible when dealing with multi-step or “multi-hop” questions. These are questions where the answer depends on combining facts from multiple sources rather than finding a single sentence in a single document. A simple RAG system treats the input question as one query, embeds it, and retrieves the top matching documents. That works well when all relevant information happens to live together, but it breaks down when the facts are scattered. In those cases, the retriever might return broad summaries or partially relevant material instead of the precise pieces of evidence required to construct the answer.

A paper on Question decomposition for RAG [1] treats complex questions not as a single retrieval problem, but as a collection of smaller, focused retrieval problems. Instead of querying the system once, the approach uses a language model to decompose the original question into several simpler sub-questions. Each sub-question targets a specific piece of the information needed. For example, instead of asking which company had the highest profit among a set of companies, the system asks separate questions about each company’s profit, which makes it much easier to retrieve exact, relevant data points.

This decomposition step significantly increases the chances that the system will find all the necessary evidence, because different documents often cover different aspects of a problem. However, it also introduces a new challenge: retrieving documents for multiple sub-questions produces a much larger pool of candidate passages, many of which are only loosely related or even irrelevant to the original query. The system therefore needs a way to filter and prioritize these results so that only the most useful pieces of evidence are passed to the language model.

To solve this, the approach adds a reranking stage after retrieval. The reranker is a more precise but more computationally expensive model that scores each candidate document based on how relevant it is to the original, undecomposed question. Unlike the initial retrieval step, which relies on vector similarity, the reranker jointly processes the query and the document, allowing it to capture finer-grained relationships between them. The system then selects the top-ranked documents and discards the rest.

The overall pipeline can be thought of as a three-step process. First, the system expands the query into a set of sub-queries using a language model. Second, it retrieves documents independently for each sub-query, merging all results into a single candidate pool. Third, it applies reranking to filter that pool and extract the most relevant passages. These final passages are then concatenated with the original query and passed into the language model for answer generation.

One of the key advantages of this approach is that it does not require training new models or building specialized indexes. It relies entirely on off-the-shelf components: a general-purpose LLM for decomposition, a standard dense retriever for initial search, and a pretrained cross-encoder for reranking. This makes it easy to plug into existing RAG systems with minimal engineering effort.

Empirical results show that this combination of decomposition and reranking provides meaningful improvements. On multi-hop benchmarks, the system retrieves more relevant evidence and produces more accurate answers compared to standard RAG. The gains come from a clear division of responsibilities: decomposition improves coverage by ensuring that different aspects of the problem are retrieved, while reranking restores precision by filtering noise from the expanded result set.

There are, however, trade-offs that matter for real-world systems. The largest cost comes from generating sub-questions with a language model, which adds noticeable latency. Reranking also increases computational load because it evaluates each query–document pair individually. While techniques such as caching can amortize some of this overhead, the approach is still slower than a naive single-query pipeline.

Another important limitation is that decomposition is not always beneficial. When a query is already specific and well-formed, breaking it into sub-questions can actually introduce noise and reduce performance. The quality of the decomposition also depends heavily on the language model and the prompt used to guide it. In addition, the system operates in a single pass, meaning it does not iteratively refine queries based on retrieved evidence, which could limit its ability to handle extremely complex reasoning chains.

For engineers building AI applications, the takeaway is straightforward. If your system struggles with questions that require combining information from multiple sources, simply improving embeddings or increasing the number of retrieved documents may not be enough. Instead, treating retrieval as a structured process—where you explicitly break down the problem and then carefully filter the results—can yield significant improvements without changing your underlying models. The combination of query decomposition and reranking offers a practical, modular way to do this while staying compatible with existing RAG architectures.

References:

1. Question decomposition for RAG: [2507.00355v1 | PDF]: https://arxiv.org/pdf/2507.00355

Cluster computing

Monday, June 8, 2026

No comments:

Post a Comment