In the previous post, Deep Learning about Spring AI: RAG, Embeddings, and Vector Databases, we explored the RAG pattern and saw that it basically consists of two phases: the first one for data ingestion and transformation, and the second for execution.

This time, we’ll look at the options Spring AI provides for the first phase of the RAG pattern (ETL).

We’ll also dive into one of the most relevant concepts in AI today: the MCP (Model Context Protocol).

If you’ve missed any of the previous posts in the series, you can check them out below:

  1. Deep Learning about Spring AI: Getting Strarted
  2. Deep Learning” on Spring AI: multimodularity, prompts and observability
  3. Deep Learning" on Spring AI: Advisors, Structured Output and Tool Calling
  4. Deep Learning about Spring AI: RAG, Embeddings and Vector Databases

ETL

Within the RAG pattern, the ETL framework organizes the data processing flow, from the point of obtaining raw data to storing structured data in a vector database.

API Overview

ETL pipelines create, transform, and store Documents, using three main components:

To build an ETL pipeline, you can chain together an instance of each of the above:

ETL Pipeline

For example, using the following instances:

You could use the following code to run the RAG pattern in its data ingestion phase:

vectorStore.accept(textSplitter.apply(pdfReader.get()));

Interfaces

The following image shows the interfaces and implementations that support this ETL phase in Spring AI:

Interfaces and implementations supporting this ETL phase in Spring AI.
public interface DocumentReader extends Supplier<List<Document>> {
    default List<Document> read() {
        return get();
    }
}
public interface DocumentTransformer extends Function<List<Document>, List<Document>> {
    default List<Document> transform(List<Document> transform) {
        return apply(transform);
    }
}
public interface DocumentWriter extends Consumer<List<Document>> {
    default void write(List<Document> documents) {
        accept(documents);
    }
}

In the following sections, we'll take a closer look at each of them.

Document Readers

Some existing implementations include:

1. JSON: processes JSON documents by converting them into Document objects.

public class CustomJsonReader {
    ...
    public List<Document> loadJson() {
        JsonReader jsonReader = new JsonReader(this.resource, "etiqueta", "content");
        return jsonReader.get();
    }
}

Constructor parameters:

Behavior for each JSON object (within an array or as a standalone object):

2. Text: processes plain text documents by converting them into Document objects.

public class CustomTextReader {
    ...
    public List<Document> loadText() {
        TextReader textReader = new TextReader(this.resource);
        textReader.getCustomMetadata().put("filename", "text-source.txt");
    return textReader.read();
    }
}

Constructor parameters:

Behavior:

3. HTML with Jsoup: processes HTML documents and transforms them into Document objects using the JSoup library.

4. Markdown: processes Markdown documents by converting them into Document objects. The following dependency must be included:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-markdown-document-reader</artifactId>
</dependency>
public class CustomMarkdownReader {
    ...
    public List<Document> loadMarkdown() {
        MarkdownDocumentReaderConfig config = MarkdownDocumentReaderConfig.builder()        .withHorizontalRuleCreateDocument(true).withIncludeCodeBlock(false).withIncludeBlockquote(false).withAdditionalMetadata("filename", "README.md").build();

        MarkdownDocumentReader reader = new MarkdownDocumentReader(this.resource, config);
        return reader.get();
    }
}

The MarkdownDocumentReaderConfig class allows for some customizations:

5. PDF Page: thanks to the Apache PdfBox library, it’s possible to parse PDF files using the PagePdfDocumentReader class. The following dependency must be added:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>
private Resource arancelesPdf;
...
var pdfReader = new PagePdfDocumentReader(arancelesPdf);

6. PDF Paragraph: using the same library as PDF Page, this option allows splitting the PDF into paragraphs and transforming each of them into a Document.

private Resource arancelesPdf;
...
var pdfReader = new ParagraphPdfDocumentReader(arancelesPdf);

7. Tika: an Apache Tika library used to extract text from files in various formats (PDF, DOC/DOCX, PPT/PPTX, HTML).

The following dependency must be included:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-tika-document-reader</artifactId>
</dependency>
public class CustomTikaReader {
    ...
    public List<Document> loadData() {
        TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(this.resource);
        return tikaDocumentReader.read();
    }
}

Transformers

Some existing implementations are:

1. TextSplitter: the abstract base class that helps split files so they don’t exceed token limits.

2. TokenTextSplitter: a TextSplitter implementation that splits the text into chunks based on the number of tokens:

public class CustomTokenTextSplitter {

    public List<Document> splitCustomized(List<Document> documents) {
        TokenTextSplitter splitter = new TokenTextSplitter(10, 5, 2, 15, true);
        return splitter.apply(documents);
    }
}

Possible constructor parameters:

Behavior:

3. ContentFormatTransformer: ensures consistent content formats across all documents.

4. KeywordMetadataEnricher: uses generative AI to extract keywords from the file and add them as metadata.

public class CustomKeywordEnricher {

    private final ChatModel chatModel;

    CustomKeywordEnricher(ChatModel chatModel) {
        this.chatModel = chatModel;
    }

    public List<Document> enrichDocuments(List<Document> documents) {
        KeywordMetadataEnricher enricher = new  KeywordMetadataEnricher(this.chatModel, 2);
        return enricher.apply(documents);
    }
}

Possible constructor parameters:

Behavior:

5. SummaryMetadataEnricher: uses generative AI to create summaries of the files and add them as metadata.

public class CustomSummaryEnricher {

    private final SummaryMetadataEnricher enricher;

    CustomSummaryEnricher(SummaryMetadataEnricher enricher) {
        this.enricher = enricher;
}

    public List<Document> enrichDocuments(List<Document> documents) {
        return this.enricher.apply(documents);
    }
}

@Configuration
public class SummaryMetadataConfig {

    @Bean
    SummaryMetadataEnricher summaryMetadata(ChatModel chatmodel) {
        return new SummaryMetadataEnricher(chatmodel, List.of(SummaryType.PREVIOUS, SummaryType.CURRENT, SummaryType.NEXT));
    }
}

Possible constructor parameters:

Behavior:

Writers

Some existing implementations are:

1. File: implementation that writes the content of a list of Documents to a file.

public class CustomDocumentWriter {

    public void writeDocuments(List<Document> documents) {
        FileDocumentWriter writer = new FileDocumentWriter("./src/main/resources/static/docs/output.txt", true, MetadataMode.ALL, false);
        writer.accept(documents);
    }
}

Possible constructor parameters:

Behavior:

2. VectorStore: integration with various VectorStores.

Demo ETL

We define the following endpoints to demonstrate the behavior of readers, transformers, and writers:

Readers:

Reader behavior: /readers/json
Reader behavior: /readers/text
Reader behavior: /readers/markdown
Reader behavior: /readers/tika

Transformers:

Transformer behavior: /transformers/token
Transformer behavior: /transformers/keyword
Transformer behavior: /transformers/summary

Writers:

Writer behavior: /writers/file

RAG: By using some of the previous functionalities for the RAG pattern, you can create an endpoint that, from one or more existing code files, allows you to request code recommendations:

package com.example.springai.demo.springai_demo.application;

import org.springframework.stereotype.Service;
import lombok.extern.slf4j.Slf4j;

@Service
@Slf4j
public class ParadigmaSpecialService {

    public void callingSpecialService() {
        log.info("This is the implementation for the special service with Paradigma rules");
    }
}
RAG behavior: /rag/no-rag
RAG behavior: /rag/code

You can download the code of the sample application from this link.

MCP

Originally created by Anthropic, MCP or Model Context Protocol is a protocol that standardizes the interaction between applications and LLMs.

As people say online, MCP is kind of like USB-C. Just as USB-C is a standard connection between devices, MCP is a standard protocol for connecting AI models to applications or data sources.

It was created to streamline the integration of data and tools with LLMs, offering:

In essence, it’s similar (though approached differently) to the previously discussed Tool-Calling, aiming to create reusable functionalities across different clients/applications in a seamless way.

MCP Demo

As is often the case, the concept becomes clearer through a practical example. Let’s run a demo where, for example, we ask a LLM what files exist in a specific folder on our system.

To explore how MCP helps, we choose one of the many example clients and servers listed in the MCP spec. In this case, we use the LibreChat client and the filesystem server.

First, if the Ollama service is running on your system, it needs to be stopped (on Linux use: systemctl stop ollama.service).

Next, run LibreChat with the default configuration for Ollama using Docker Compose. Besides this default config, you’ll need to enter the container running Ollama to download the desired LLM.

After accessing the web interface (by default at http://localhost:3080/) and registering/logging in (the account is local), you’ll see a familiar chat interface:

Chat interface similar to ChatGPT.

As shown, the Ollama model has been selected and we ask about files in a folder on our system (since Ollama runs in a container, the folder must be accessible inside that container).

Ollama model selected.

Since the model doesn’t have access to our filesystem, it responds that it can't help.

Now, we restart Docker Compose but this time enabling the MCP servers configuration. Back in the UI, we now see the “MCP Servers” option enabled in the chat area with the corresponding server listed:

“MCP Servers” option enabled in the UI.

Running the same query again, we can now verify that the MCP server function accesses the filesystem and the model responds accordingly:

MCP server function verification on the filesystem – part 1
MCP server function verification on the filesystem – part 2

At this point, what we’ve achieved with MCP is to extend the available context so the LLM can help us with tasks that were previously impossible.

Conclusions

In this post from our Spring AI series, we continued to explore the data ingestion phase of the RAG pattern and how to implement it.

We also covered one of the hottest new capabilities in the AI world: MCP (Model Context Protocol), showcasing its potential with a hands-on example.

In the next installment, we’ll dive deeper into implementing MCP in Spring AI — both on the client and server side.

References

Tell us what you think.

Comments are moderated and will only be visible if they add to the discussion in a constructive way. If you disagree with a point, please, be polite.

Subscribe