As we've already discussed, LLMs are models pre-trained up to a certain date and with public data (it's assumed they don't have access to a company's data or our devices, for example). The importance of tokens in aspects like billing or the context-window is also popular (we can't send everything we want in the input or the output, although there are LLMs that have context-windows of millions of tokens like Gemini).

The RAG pattern focuses on helping us with these limitations of LLMs in order to get better results, with fewer hallucinations and reducing the cost per use.

Before going into detail, you can take a look at the rest of the content in the Spring AI series in the following links in case you have missed any of the previous articles:

Some of the foundations of RAG have already been seen in the post on Advisors, Tool-Calling or the topic of augmenting the context, but in this post we will focus more on the essence of RAG itself and on what technologies and abstractions it is supported by.

RAG

Retrieval Augmented Generation (RAG) is a widely used technique to address the limitations that exist in language models with large content, precision and context awareness. The approach is based on a batch process where unstructured data is first read from documents (or other data sources), transformed and then written to a Vector Database (roughly speaking it is an ETL process).

One of the transformations in this flow is the division of the original document (or data source) into smaller pieces, with two key steps:

Image showing the RAG structure with Document ingestion - offline ETL and runtime

The next phase in RAG would be the processing of the user's input, using similarity search to enrich the context with similar documents to be sent to the model.

Spring AI supports RAG by providing a modular architecture that allows creating custom RAG flows or using those created through the use of the Advisor API.

String rag() {
    return ChatClient.builder(chatModel)
        .build().prompt()
        .advisors(new QuestionAnswerAdvisor(vectorStore))
        .user("que me puedes decir de los aranceles de EEUU en el año 2025")
        .call()
        .content();
}

In the example, QuestionAnswerAdvisor will perform a similarity search on the documents in the database, allowing the search to be filtered by documents using the SearchRequest class (also at runtime).

Modules

Spring AI implements a modular RAG architecture based on the paper Modular RAG: Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks. This is still in an experimental phase, so it is subject to change. The available modules are:

Query query = new Query("Hvad er Danmarks hovedstad?");

QueryTransformer queryTransformer = TranslationQueryTransformer.builder()
        .chatClientBuilder(chatClientBuilder)
        .targetLanguage("english")
        .build();

Query transformedQuery = queryTransformer.transform(query);
Map<Query, List<List<Document>>> documentsForQuery = ...
DocumentJoiner documentJoiner = new ConcatenationDocumentJoiner();
List<Document> documents = documentJoiner.join(documentsForQuery);

Embeddings

Embeddings are numerical representations of text, images, or videos that capture the relationships between the input data. They work by transforming text, images, or videos into arrays of numbers called vectors, which are designed to capture the meaning of these assets. This is done by calculating the numerical distance between two vectors from, for example, two pieces of text, thereby determining their similarity.

The EmbeddingModel interface is designed for integration with models. Its primary function is to convert assets into numerical vectors that are used for semantic analysis and text classification, among other things. The interface focuses mainly on:

API Overview

The EmbeddingModel interface extends the Model interface, just as EmbeddingRequest and EmbeddingResponse extend from their corresponding ModelRequest and ModelResponse. The following image shows the relationships between the Embedding API, the Model API, and the Embedding Models:

relationships between the Embedding API, the Model API, and the Embedding Models
public interface EmbeddingModel extends Model<EmbeddingRequest, EmbeddingResponse> {

    @Override
    EmbeddingResponse call(EmbeddingRequest request);
    float[] embed(Document document);
    default float[] embed(String text) {...}
    default List<float[]> embed(List<String> texts) {...}
    default EmbeddingResponse embedForResponse(List<String> texts) {...}
    default int dimensions() {...}
}
public class EmbeddingRequest implements ModelRequest<List<String>> {
    private final List<String> inputs;
    private final EmbeddingOptions options;
    ...
}
public class EmbeddingResponse implements ModelResponse<Embedding> {
    private List<Embedding> embeddings;
    private EmbeddingResponseMetadata metadata = new EmbeddingResponseMetadata();
    ...
}
public class Embedding implements ModelResult<float[]> {
    private float[] embedding;
    private Integer index;
    private EmbeddingResultMetadata metadata;
    ...
}

Available Implementations

On this page, you can find the available implementations. Following the premises of previous posts, we will focus on the Ollama implementation.

Properties

For autoconfiguration, a series of properties are provided that are very similar to the existing properties in the chat model, enabling autoconfiguration with the property:

spring.ai.model.embedding

Implementation

Below are some examples of how to use the OllamaEmbeddingModel and EmbeddingModel:

@Autowired
private EmbeddingModel autoEmbeddingModel;
...
@GetMapping("/auto")
public EmbeddingResponse embed() {
    return autoEmbeddingModel.embedForResponse(List.of("Tendencias tecnologicas 2025", "Tendencia tecnologica IA", "Tendencia tecnologica accesibilidad"));
}
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-ollama</artifactId>
</dependency>

Then, create the corresponding instance:

private OllamaApi ollamaApi;
private OllamaEmbeddingModel embeddingModel;

public EmbeddingController() {
    ollamaApi = new OllamaApi();
    embeddingModel = 
OllamaEmbeddingModel.builder().ollamaApi(ollamaApi).defaultOptions(OllamaOptions.builder().model(OllamaModel.LLAMA3_2_1B).build()).build();
}

@GetMapping("/plain")
public EmbeddingResponse embedd() {
    return embeddingModel.call(new EmbeddingRequest(List.of("Tendencias tecnologicas 2025", "Tendencia tecnologica IA", "Tendencia tecnologica accesibilidad"), 
        OllamaOptions.builder()
            .model(OllamaModel.LLAMA3_2_1B)
            .truncate(false)
            .build()));
}

Vector Databases

These are a type of specialized database that, instead of performing exact matches, relies on similarity searches; in other words, when a vector is provided as input, the result is “similar” vectors.

First, the data is loaded into the vector database. Later, when a request is sent to the model, similar vectors are first retrieved, which are then used as context for the users question and are also sent to the model.

API Overview

To work with vector databases, the Vector Store interface is used:

public interface VectorStore extends DocumentWriter {
    default String getName() {...}
    void add(List<Document> documents);
    void delete(List<String> idList);
    void delete(Filter.Expression filterExpression);
    default void delete(String filterExpression) {...};
    List<Document> similaritySearch(String query);
    List<Document> similaritySearch(SearchRequest request);
    default <T> Optional<T> getNativeClient() {...}
}

In addition to the SearchRequest:

public class SearchRequest {
    public static final double SIMILARITY_THRESHOLD_ACCEPT_ALL = 0.0;
    public static final int DEFAULT_TOP_K = 4;
    private String query = "";
    private int topK = DEFAULT_TOP_K;
    private double similarityThreshold = SIMILARITY_THRESHOLD_ACCEPT_ALL;
    @Nullable
    private Filter.Expression filterExpression;
    public static Builder from(SearchRequest originalSearchRequest) {...}
    public static class Builder {...}
    public String getQuery() {...}
    public int getTopK() {...}
    public double getSimilarityThreshold() {...}
    public Filter.Expression getFilterExpression() {...}
}

To insert data into the vector database, it is necessary to encapsulate that data in a Document object. When inserted into the database, the text is transformed into an array of numbers known as an embedding vector (the function of the vector database is to perform similarity searches, not to generate the embeddings themselves).

The similaritySearch methods allow you to obtain documents similar to a request, which can be adjusted with the following parameters:

FilterExpressionBuilder builder = new FilterExpressionBuilder();
Expression expression = builder.eq("tendencia", "cloud").build();

Schema Initialization

Some databases require the schema to be initialized before use. With Spring Boot, you can set the …initialize-schema property to true, although it is advisable to check this information against each specific implementation.

Batching Strategy

It's common when working with this type of database to have to embed many documents.

Although the first idea that comes to mind is to try embedding all these documents at once, this can lead to certain problems. This is mainly due to the token limits in the models (window size), which would cause errors or truncated embeddings.

This is precisely why the batch strategy is used, where large sets of documents are divided into smaller sets that fit within the window size. This not only solves the problem of token limits but can also improve performance and the request limits of the various APIs.

Spring offers this functionality through the BatchingStrategy interface:

public interface BatchingStrategy {
    List<List<Document>> batch(List<Document> documents);
}

The default implementation is TokenCountBatchingStrategy, which bases the split on the number of tokens in each batch, ensuring the token input limit is not exceeded. The key points of this implementation are:

The strategy estimates the number of tokens per document, groups them into batches without exceeding the token limit, and throws an exception if any single document exceeds it. You can also customize the strategy by creating a new instance through a @Configuration class:

@Configuration
public class EmbeddingConfig {
    @Bean
    public BatchingStrategy customTokenCountBatchingStrategy() {
        return new TokenCountBatchingStrategy(
            EncodingType.CL100K_BASE,  // Specify the encoding type
            8000,                      // Set the maximum input token count
            0.1                        // Set the reserve percentage
        );
    }
}

Once this bean is defined, it will be used automatically by the EmbeddingModel implementations instead of the default strategy. Additionally, you can include your own implementations of TokenCountEstimator (the class that calculates the document`s tokens), as well as parameters for formatting content and metadata, or even your own custom implementation.

And, as with other components, you can also create a completely custom implementation:

@Configuration
public class EmbeddingConfig {
    @Bean
    public BatchingStrategy customBatchingStrategy() {
        return new CustomBatchingStrategy();
    }
}

Available Implementations

You can find the available implementations on this page. In this case, we will focus on the PGvector implementation for PostgreSQL databases, which is simply an open-source extension for PostgreSQL that allows for saving and searching embeddings.

Prerequisites

First, you need access to PostgreSQL with the vector, hstore and uuid-ossp extensions. On startup, PgVectorStore will attempt to install the necessary extensions on the database and create the vector_store table with an index. You can also do this manually with:

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS hstore;
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

CREATE TABLE IF NOT EXISTS vector_store (
    id uuid DEFAULT uuid_generate_v4() PRIMARY KEY,
    content text,
    metadata json,
    embedding vector(1536) // 1536 is the default embedding dimension
);

CREATE INDEX ON vector_store USING HNSW (embedding vector_cosine_ops);

Configuration

Let's start by including the corresponding dependency:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
</dependency>

The vector store implementation can initialize the schema automatically, but you must specify the initializeSchema in the corresponding constructor or through the …​initialize-schema=true property.

Of course, an EmbeddingModel will also be necessary, and you must include its corresponding dependency in the project. And, as is often the case, you will also have to specify the connection values via properties:

spring:
  datasource:
    url: jdbc:postgresql://localhost:5432/postgres
    username: postgres
    password: postgres
  ai:
    vectorstore:
      pgvector:
        index-type: HNSW
        distance-type: COSINE_DISTANCE
        dimensions: 1024
        max-document-batch-size: 10000
        initialize-schema: true

Additionally, other properties exist for greater customization of the Vector Store. Remember that as Spring has accustomed us, a manual configuration can also be created.

Local Execution

You can run a PGVector instance using the following Docker command:

docker run -it --rm --name postgres -p 5432:5432 -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=postgres pgvector/pgvector:pg16

And connecting to the instance with the command:

psql -U postgres -h localhost -p 5432

Usage Examples

As expected, you must select an embedding model in conjunction with the general model being used. An example of loading Documents into the Vector Store (this loading operation would actually be executed as a batch process) would be:

Document document1 = new Document("Tendencia tecnologica 2025: La voz y los vídeos en la IA", Map.of("tendencia", "ia"));
...
List<Document> documents = Arrays.asList(document1, document2, document3, document4, document5, document6, document7, document8, document9, document10);
vectorStore.add(documents);

Later, when a user sends a question, a similarity search will be performed to obtain similar documents that will be passed as context for the prompt.

Deleting Documents

Multiple methods are provided for deleting documents:

Filter.Expression filterExpression = new Filter.Expression(Filter.ExpressionType.EQ, new Filter.Key("tendencia"), new Filter.Value("dragdrop"));
vectorStore.delete(filterExpression);

FilterExpressionBuilder builder = new FilterExpressionBuilder();
Expression expression = builder.eq("tendencia", "cloud").build();

try {
    vectorStore.delete(filterExpression);
    vectorStore.delete(expression);
} catch (Exception e) {
    log.error("Invalid filter expression", e);
}

Note: According to the documentation, it is recommended to wrap these delete calls in try-catch blocks.

Additionally, you should also consider the following performance considerations:

Demo

To see the components we've discussed in this post in action, we create an app. Once we start the application with the …initialize-schema: true property, you can see how the vector_store table has been created using a database client:

Table created in vector_store with the columns id, content, metadata, and embedding

To test the functionality, we have the following endpoints:

"metadata":  "model": "1lama3.2:1b", "usage": { "empty": true "result": { "index": 0, "metadata" : { "output": 0.008033557, 0. 05002698, 0.008954761, 0.013158767, 0.012868241, -0.023812277, 0.018066008, 0.05619622, -0.010666855, 0.017404212, -0.028421931,
Image showing the previously created table with the id, content, metadata, and embedding fields filled in for a list of 10 items
"id": "bb767982-5564-4791-9649-b508700{3367", "text": "Tech Trend 2025: Evidence-Driven Transformation", "media": null, "metadata": { "distance": 0.20401171, "tendencia": "transformation" "score": 0.7959882915019989 "id": "048ad20b-d24c-4545-82cb-94bf9c67dcac" "text": "Tech Trend 2025: Humanistic Systemic Leadership", "media": null, "metadata": { "distance": 0.22337313, "tendencia": "humanistic" "score": 0.7766268700361252
"id": "bb767982-5564-4791-9649-b508700f3367", "text": "Tech Trend 2025: Evidence-Driven Transformation", "media": null, "metadata": { "distance": 0.20401171, "tendencia": "transformation", "score": 0.7959882915019989
We retrieve the previously created table with 10 items and see that it now has 8 items. 2 have been deleted.
o.s.ai.reader.pdt.PagePdfDocumentReader : Processing PDF page: 1 o.s.ai.reader.pdf.PagePdfDocumentReader : Processing : Processing PDF page: 2 o.s.ai.reader.pdf.PagePdfDocumentReader : Processing PDF page: 3 o.s.ai.reader.pdf.PagePdfDocumentReader : Processing PDF page: 4 o.s.ai.reader.pdf.PagePdfDocumentReader : Processing PDF page: 5 o.s.ai.reader.pdf.PagePdfDocumentReader : Processing PDF page: 6 o.s.ai.reader.pdf.PagePdfDocumentReader : Processing PDF page: 7 o.s.ai.reader.pdf.PagePdfDocumentReader : Processing 7 pages o.s.a.transformer.splitter.TextSplitter : Splitting up document into 2 chunks. o.s.a.transformer.splitter.TextSplitter : Splitting up document into 2 chunks. o.s.a.transformer.splitter.TextSplitter : Splitting up document into 2 chunks. o.s.a.transformer.splitter.TextSplitter : Splitting up document into 2 chunks. c.e.s.d.s.application.IngestionService : Data loaded in vectorStore
I am sorry, but I cannot provide updated information on US tariffs for specific countries or products up to the year 2023. Tariffs are a complex political issue, and their regulation can change rapidly due to shifts in international trade policies, government decisions, and market evolution. However, I can offer you a general idea of how tariffs work and what factors can influence their implementation: 1. Import Exemptions: Products not considered "disruptive" or "detrimental" to the country's national security may be exempt from tariffs. 2. Content-Based Tariffs: Depending on the product's content, tariffs may be applied. For example, if a product contains certain prohibited or restricted items, it may be subject to tariffs. 3. Agricultural Protection Tariffs: Agricultural protection products, such as certificates of origin and genetically modified seeds, may have tariffs to protect national commercial interests. 4. Food Health and Safety Tariffs: Tariffs may be applied to products that violate regulations regarding impurities, contamination, or food safety. To obtain updated and specific information on US tariffs for a country or product in 2025, I would recommend consulting the following sources:
Dear user, Regarding the information on the initial budgets of the Assessment and Monitoring Consultancies (AA. PP.) in 2025, according to the report "Report on the initial budgets of the AA. PP. 2025" provided by AIReF, it can be summarized as follows: * The initial budgets for 2025 are $500 million. * GDP growth in the United States is expected to decrease by around half a percentage point in 2025, compared to a scenario without uncertainty. * In 2024, Spanish exports to the United States represented 15.6% of the total weight of Spanish exports and 11.6% of the total weight of Spanish imports. Regarding the main destinations of Spanish exports and origins of Spanish imports in 2024, it can be seen that: * Spanish exports represent 15.6% of France's GDP (1.1%), 11.6% of Germany's GDP (3.7%) and 8.9% of Italy's GDP (3.43%). * The main import for Spain is the South American country. If you wish to obtain more information about the initial budgets or any other topic related to the Consultancies for the...

You can download the example application code from this link.

Conclusion

On this occasion, we have seen what the RAG pattern consists of and how it is implemented in Spring AI, in addition to the functionalities that support it (embeddings and vector databases), as well as the examples of the corresponding implementations (Ollama and PGVector).

In the next chapter, we will address the offline part of the RAG pattern (ETL phase), in addition to focusing on the rise of MCP and how Spring AI solves it. See you in the comments!

References

Tell us what you think.

Comments are moderated and will only be visible if they add to the discussion in a constructive way. If you disagree with a point, please, be polite.

Subscribe