CAG vs. RAG: Cache as the New Ally for Generative AI

Do you want our logo?

Do you want our logo description

In the realm of large language models (LLMs), new techniques are constantly emerging to make the most out of them. In recent weeks, a lot of buzz has surrounded the well-known and widely used RAG technique—which we explained in detail a few months ago in “Retrieval Augmented Generation and its corporate usage”— now has a major competitor: the CAG technique. CAG has gained popularity since December 20, 2024, when the paper “Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Task” was published.

As part of Paradigma Digital’s commitment to staying at the forefront of the latest breakthroughs in Generative AI, we have researched this new addition to the field and will share our insights here. First, we will briefly revisit what RAG is, followed by a detailed explanation of CAG—covering its pros and cons, its use and implications in the business domain, and how to decide whether to continue using RAG or make the switch to CAG.

Retrieval-Augmented Generation (RAG)

In the world of generative AI, RAG is a well-established technique. There are countless posts, courses, and tutorials from both the community and the major companies behind the technology that explain how it works. By way of reminder, RAG combines the power of large language models (LLMs) with the ability to access external data in real time. Its goal is to solve the problem that classic LLMs can only work with the information available at the time of their training, which limits their ability to stay up to date and remain accurate. In corporate environments, this limitation becomes more evident, as information tends to be scattered across multiple repositories such as Confluence, Jira, SharePoint, or Google Drive.

RAG addresses this shortcoming by linking the model to up-to-date knowledge repositories, enabling precise, context-aware responses. To achieve this, an ingestion pipeline is created in which documents are divided into chunks and vectorized, then stored in a vector database. Afterwards, the query flow retrieves and uses those relevant chunks to generate answers. It is vital to maintain this vectorization process on a regular basis to reflect changes in corporate information and ensure that the system always responds with updated data.

Generation augmented by retrieval offers multiple advantages for those who need to leverage external information during response generation. One of its greatest strengths is its ability to access updated data in real time, enabling responses based on the most recent and reliable information. This approach is also highly flexible and can be applied to a wide range of scenarios requiring specific knowledge or rapid evolution—especially when dealing with very large databases or constantly changing domains.

Despite its advantages, the RAG technique also has certain limitations worth noting. On the one hand, the real-time search and retrieval process can cause significant delays, particularly when working with large corpora. On the other hand, the integration of different components (such as search engines, indexes, and embedding pipelines) with the model’s generative component increases system intricacy and may therefore require more infrastructure and maintenance. Additionally, there is always a risk of retrieval errors if documents are selected or prioritized incorrectly, which can impact the accuracy of the answers.

Cache-Augmented Generation (CAG)

Cache-Augmented Generation (CAG) is conceived as an evolution of the RAG architecture by dispensing with real-time retrieval and taking advantage of the extended context capabilities of language models. Instead of fetching documents for each query, the CAG approach preloads all the necessary information into the model’s context while also precomputing a Key-Value Cache that encapsulates the inference state. Thanks to this preliminary step, the model can respond immediately, without depending on external retrieval systems that typically introduce delays and complexity into RAG implementations.

According to the original study on CAG, this approach eliminates the latency arising from dynamic retrieval and minimizes document selection errors. In evaluation settings—where researchers use SQuAD and HotPotQA—the model achieves results that are competitive with or superior to RAG, especially when the knowledge base is limited and fits within the context window. Besides saving time during the query stage, CAG maintains a unified perspective on the content, which favors consistency in responses and boosts multi-hop reasoning without splitting knowledge across multiple retrieval sources. A typical use case would be a static technical manual: all the content is “packaged” into the cache, and during inference, the model obtains answers directly from this precomputed data, avoiding search or ranking processes that could introduce incorrect information.

Although Cache-Augmented Generation is presented as a revolutionary milestone in AI, it is crucial to examine in detail which problems it truly solves and what main challenges companies face when trying to deploy these technologies in a production environment. One key point relates to information updates. While RAG is more effective in domains where data changes constantly—such as news, recent publications, or integrations with third-party APIs—CAG is better suited to scenarios where information remains relatively stable or is updated less frequently, for example, technical manuals, internal reports, or fixed legal repositories.

Another critical aspect is performance and latency in generating responses. Because RAG requires real-time searches, it can introduce notable delays, especially in large repositories. CAG, on the other hand, according to the studies, preloads all the information and removes the dynamic retrieval stage, achieving speeds up to 40 times faster in some experiments. This difference is also evident in system complexity. RAG requires maintaining a retrieval pipeline—which may include search engines, embeddings, and indexing methodologies—thus increasing configuration and maintenance difficulties. CAG, by contrast, simplifies the architecture by relying solely on preloaded information, although this demands strict control over the data included in the cache and the processes used to refresh that information.

Data governance and quality represent another major challenge. Many organizations discover that beyond the technical advantages, their real problems lie in conceptualization and a lack of observability in their systems. CAG may mask the complexity associated with information retrieval, but at the same time, it raises new questions about the expiration of cached data. In fact, some professionals point to data quality and meticulous labeling as decisive factors for the success of Generative AI—above even the efficiency of retrieval mechanisms.
Regarding scale and knowledge-base size, RAG is generally more suitable for contexts involving massive volumes of documents—perhaps millions of articles—where loading all information into the model’s context window would be unfeasible. And it is not just about fitting the knowledge base into the context; in each query, the entire content could be sent to the LLM instead of only the most relevant chunks, thereby significantly increasing the number of tokens. In contrast, CAG stands out in applications where the document corpus is finite and manageable enough to be preloaded without exhausting the LLM’s capacity. This distinction again underscores that the decision between RAG and CAG hinges more on business requirements and the nature of the data than on purely technological considerations.

Despite these drawbacks, the research on CAG suggests that, in scenarios with manageable and relatively static knowledge bases, the absence of redundant retrieval steps yields a faster model, simpler to maintain, and less prone to errors compared to RAG. The findings highlight the usefulness of this approach as a robust alternative where information can be collected and preloaded without causing a constant drain on resources or lag in data updates. This advance demonstrates the potential of LLMs with long context windows to deliver coherent, accurate responses without the complexity of traditional retrieval strategies, thus becoming an ideal solution for use cases that prioritize immediacy and simplicity in the inference flow.

Which one to choose?

The choice between RAG and CAG largely depends on the specific needs of the application and the problems you want to solve. If the priority is to have real-time information with constantly updated data, RAG becomes the most suitable alternative, since CAG could quickly become outdated. However, when the main requirement is to maximize speed and reduce dependencies on external systems, CAG is highly recommended, especially in static domains where all relevant information can be preloaded without major issues.

Organizational infrastructure and the team’s expertise in managing retrieval systems also play a decisive role. If advanced technical resources are available and the team is familiar with implementing search engines and indexing pipelines, RAG offers greater flexibility and access to a potentially broader volume of data. Nevertheless, if information governance and project conceptualization represent the main challenge, neither RAG nor CAG are definitive solutions. Before undertaking any complex implementation, it’s wise to define a clear strategy for managing, versioning, and ensuring data quality—an area where real obstacles to successful AI adoption in business environments are often found.

Business use cases

Having explored in which situations RAG or CAG might be recommended, here are three examples where CAG could be used instead of RAG—provided that the previously mentioned drawbacks do not come into play:

Use case 1: Technical or internal process manuals with infrequent changes

In companies that handle a considerable volume of technical documentation—such as machinery operation manuals or internal procedure guides—but whose contents do not change very often, CAG can be highly advantageous compared to RAG. Once these manuals are preloaded into the model’s cache, the AI can provide answers almost instantly and with no risk of pulling in irrelevant information. With RAG, by contrast, a search would be performed for each query, adding an undesirable layer of complexity and latency when, in reality, the base information rarely changes. In this way, the organization gets an efficient, fast support system that maintains consistency in its answers, since they all come from the same “frozen” and preloaded source.

Use Case 2: Training or e-learning assistants with stable content

Human Resources or Training departments can leverage CAG in their corporate learning platforms, as long as the curriculum does not undergo continuous modifications. Suppose the company has instructional materials or quizzes that are only updated on specific dates, such as at the start of each quarter. In that case, preloading content into the model’s cache allows the AI system to respond more quickly, without needing to query external databases. This results in a much smoother user experience, avoiding the delays inherent in real-time searches and simplifying the infrastructure—since there is no need to maintain a constant retrieval pipeline for the training materials.

Use case 3: Help systems in applications with critical and limited use

There are business applications in which response speed is crucial and the underlying knowledge base is relatively small and stable. Examples include assistance tools for legal departments that review contracts with recurring, minimally changing clauses, or support systems for internal software that is infrequently updated. In these scenarios, CAG could greatly streamline access to documentation while reducing the likelihood of retrieval errors. The system immediately accesses the preloaded information, without relying on indices or searches that could lengthen the process. This characteristic is particularly valuable when the immediacy of response generation makes a significant difference in internal or external client satisfaction, and no substantial changes are anticipated in the underlying documentation.

Conclusion

Both RAG and CAG represent innovative approaches to optimizing language generation in large models. RAG benefits from dynamic data retrieval, making it particularly useful in environments where information is constantly updated and flexibility is essential. In contrast, CAG aims for faster response times and a simpler architecture, as long as the preloaded data is relatively stable and well-structured.

In any case, experience demonstrates that technology alone is not the only key to success. Both RAG and CAG require data governance, clarity regarding use cases, and an organization capable of conceptualizing and scaling AI projects. Although testing indicates that CAG can reach much higher speeds than RAG in certain scenarios, this does not replace the need for a solid strategy nor does it automatically solve the challenges around information quality.

Therefore, before deciding on one methodology or the other, it is advisable to analyze the specific needs of each situation and keep in mind that the real differentiating factor ultimately lies in data quality and in the corporate strategy that supports AI adoption.

References

Chan, B. J., Chen, C.-T., Cheng, J.-H., & Huang, H.-H. (2024). Don’t do RAG: When cache-augmented generation is all you need for knowledge tasks. arXiv.
Github repository about CAG
Medium post about CAG

José María Hernández de la Cruz

Trained as a philologist and later as a computational linguist, I emigrated to Ireland, where I participated in large NLP projects at Big Tech companies. Additionally, I collaborated in training some of the most recognized Large Language Models. Currently, my efforts are focused on staying up to date with tools surrounding Generative AI, evaluating their viability, and applying them to real-life cases to generate value for our clients.

View more of José María.